summaryrefslogtreecommitdiff
path: root/arch
AgeCommit message (Collapse)Author
2018-10-19KVM: arm64: Safety check PSTATE when entering guest and handle ILChristoffer Dall
This commit adds a paranoid check when entering the guest to make sure we don't attempt running guest code in an equally or more privilged mode than the hypervisor. We also catch other accidental programming of the SPSR_EL2 which results in an illegal exception return and report this safely back to the user. Signed-off-by: Christoffer Dall <christoffer.dall@arm.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
2018-10-18arm/arm64: KVM: Enable 32 bits kvm vcpu events supportDongjiu Geng
The commit 539aee0edb9f ("KVM: arm64: Share the parts of get/set events useful to 32bit") shares the get/set events helper for arm64 and arm32, but forgot to share the cap extension code. User space will check whether KVM supports vcpu events by checking the KVM_CAP_VCPU_EVENTS extension Acked-by: James Morse <james.morse@arm.com> Reviewed-by : Suzuki K Poulose <suzuki.poulose@arm.com> Signed-off-by: Dongjiu Geng <gengdongjiu@huawei.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
2018-10-18arm/arm64: KVM: Rename function kvm_arch_dev_ioctl_check_extension()Dongjiu Geng
Rename kvm_arch_dev_ioctl_check_extension() to kvm_arch_vm_ioctl_check_extension(), because it does not have any relationship with device. Renaming this function can make code readable. Cc: James Morse <james.morse@arm.com> Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com> Signed-off-by: Dongjiu Geng <gengdongjiu@huawei.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
2018-10-03arm64: KVM: Remove some extra semicolon in kvm_target_cpuzhong jiang
There are some extra semicolon in kvm_target_cpu, remove it. Signed-off-by: zhong jiang <zhongjiang@huawei.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
2018-10-03KVM: arm/arm64: Rename kvm_arm_config_vm to kvm_arm_setup_stage2Marc Zyngier
VM tends to be a very overloaded term in KVM, so let's keep it to describe the virtual machine. For the virtual memory setup, let's use the "stage2" suffix. Reviewed-by: Eric Auger <eric.auger@redhat.com> Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
2018-10-03kvm: arm64: Allow tuning the physical address size for VMSuzuki K Poulose
Allow specifying the physical address size limit for a new VM via the kvm_type argument for the KVM_CREATE_VM ioctl. This allows us to finalise the stage2 page table as early as possible and hence perform the right checks on the memory slots without complication. The size is encoded as Log2(PA_Size) in bits[7:0] of the type field. For backward compatibility the value 0 is reserved and implies 40bits. Also, lift the limit of the IPA to host limit and allow lower IPA sizes (e.g, 32). The userspace could check the extension KVM_CAP_ARM_VM_IPA_SIZE for the availability of this feature. The cap check returns the maximum limit for the physical address shift supported by the host. Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Christoffer Dall <cdall@kernel.org> Cc: Peter Maydell <peter.maydell@linaro.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
2018-10-03kvm: arm64: Limit the minimum number of page table levelsSuzuki K Poulose
Since we are about to remove the lower limit on the IPA size, make sure that we do not go to 1 level page table (e.g, with 32bit IPA on 64K host with concatenation) to avoid splitting the host PMD huge pages at stage2. Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Christoffer Dall <cdall@kernel.org> Reviewed-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
2018-10-03kvm: arm64: Set a limit on the IPA sizeSuzuki K Poulose
So far we have restricted the IPA size of the VM to the default value (40bits). Now that we can manage the IPA size per VM and support dynamic stage2 page tables, we can allow VMs to have larger IPA. This patch introduces a the maximum IPA size supported on the host. This is decided by the following factors : 1) Maximum PARange supported by the CPUs - This can be inferred from the system wide safe value. 2) Maximum PA size supported by the host kernel (48 vs 52) 3) Number of levels in the host page table (as we base our stage2 tables on the host table helpers). Since the stage2 page table code is dependent on the stage1 page table, we always ensure that : Number of Levels at Stage1 >= Number of Levels at Stage2 So we limit the IPA to make sure that the above condition is satisfied. This will affect the following combinations of VA_BITS and IPA for different page sizes. Host configuration | Unsupported IPA ranges 39bit VA, 4K | [44, 48] 36bit VA, 16K | [41, 48] 42bit VA, 64K | [47, 52] Supporting the above combinations need independent stage2 page table manipulation code, which would need substantial changes. We could purse the solution independently and switch the page table code once we have it ready. Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Christoffer Dall <cdall@kernel.org> Reviewed-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
2018-10-01kvm: arm64: Add 52bit support for PAR to HPFAR conversoinSuzuki K Poulose
Add support for handling 52bit addresses in PAR to HPFAR conversion. Instead of hardcoding the address limits, we now use PHYS_MASK_SHIFT. Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Christoffer Dall <cdall@kernel.org> Reviewed-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
2018-10-01kvm: arm64: Switch to per VM IPA limitSuzuki K Poulose
Now that we can manage the stage2 page table per VM, switch the configuration details to per VM instance. The VTCR is updated with the values specific to the VM based on the configuration. We store the IPA size and the number of stage2 page table levels for the guest already in VTCR. Decode it back from the vtcr field wherever we need it. Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Christoffer Dall <cdall@kernel.org> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
2018-10-01kvm: arm64: Configure VTCR_EL2.SL0 per VMSuzuki K Poulose
VTCR_EL2 holds the following key stage2 translation table parameters: SL0 - Entry level in the page table lookup. T0SZ - Denotes the size of the memory addressed by the table. We have been using fixed values for the SL0 depending on the page size as we have a fixed IPA size. But since we are about to make it dynamic, we need to calculate the SL0 at runtime per VM. This patch adds a helper to compute the value of SL0 for a VM based on the IPA size. Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Christoffer Dall <cdall@kernel.org> Reviewed-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
2018-10-01kvm: arm64: Dynamic configuration of VTTBR maskSuzuki K Poulose
On arm64 VTTBR_EL2:BADDR holds the base address for the stage2 translation table. The Arm ARM mandates that the bits BADDR[x-1:0] should be 0, where 'x' is defined for a given IPA Size and the number of levels for a translation granule size. It is defined using some magical constants. This patch is a reverse engineered implementation to calculate the 'x' at runtime for a given ipa and number of page table levels. See patch for more details. Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Christoffer Dall <cdall@kernel.org> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
2018-10-01kvm: arm64: Make stage2 page table layout dynamicSuzuki K Poulose
Switch to dynamic stage2 page table layout based on the given VM. So far we had a common stage2 table layout determined at compile time. Make decision based on the VM instance depending on the IPA limit for the VM. Adds helpers to compute the stage2 parameters based on the guest's IPA and uses them to make the decisions. The IPA limit is still fixed to 40bits and the build time check to ensure the stage2 doesn't exceed the host kernels page table levels is retained. Also make sure that we use the pud/pmd level helpers from the host only when they are not folded. Cc: Christoffer Dall <cdall@kernel.org> Cc: Marc Zyngier <marc.zyngier@arm.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
2018-10-01kvm: arm64: Prepare for dynamic stage2 page table layoutSuzuki K Poulose
Our stage2 page table helpers are statically defined based on the fixed IPA of 40bits and the host page size. As we are about to add support for configurable IPA size for VMs, we need to make the page table checks for each VM. This patch prepares the stage2 helpers to make the transition to a VM dependent table layout easier. Instead of statically defining the table helpers based on the page table levels, we now check the page table levels in the helpers to do the right thing. In effect, it simply converts the macros to static inline functions. Cc: Eric Auger <eric.auger@redhat.com> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Christoffer Dall <cdall@kernel.org> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
2018-10-01kvm: arm/arm64: Prepare for VM specific stage2 translationsSuzuki K Poulose
Right now the stage2 page table for a VM is hard coded, assuming an IPA of 40bits. As we are about to add support for per VM IPA, prepare the stage2 page table helpers to accept the kvm instance to make the right decision for the VM. No functional changes. Adds stage2_pgd_size(kvm) to replace S2_PGD_SIZE. Also, moves some of the definitions in arm32 to align with the arm64. Also drop the _AC() specifier constants wherever possible. Cc: Christoffer Dall <cdall@kernel.org> Acked-by: Marc Zyngier <marc.zyngier@arm.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
2018-10-01kvm: arm64: Configure VTCR_EL2 per VMSuzuki K Poulose
Add support for setting the VTCR_EL2 per VM, rather than hard coding a value at boot time per CPU. This would allow us to tune the stage2 page table parameters per VM in the later changes. We compute the VTCR fields based on the system wide sanitised feature registers, except for the hardware management of Access Flags (VTCR_EL2.HA). It is fine to run a system with a mix of CPUs that may or may not update the page table Access Flags. Since the bit is RES0 on CPUs that don't support it, the bit should be ignored on them. Suggested-by: Marc Zyngier <marc.zyngier@arm.com> Acked-by: Christoffer Dall <cdall@kernel.org> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
2018-10-01kvm: arm/arm64: Allow arch specific configurations for VMSuzuki K Poulose
Allow the arch backends to perform VM specific initialisation. This will be later used to handle IPA size configuration and per-VM VTCR configuration on arm64. Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Christoffer Dall <cdall@kernel.org> Reviewed-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
2018-10-01kvm: arm64: Clean up VTCR_EL2 initialisationSuzuki K Poulose
Use the new helper for converting the parange to the physical shift. Also, add the missing definitions for the VTCR_EL2 register fields and use them instead of hard coding numbers. Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Christoffer Dall <cdall@kernel.org> Reviewed-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
2018-10-01arm64: Add a helper for PARange to physical shift conversionSuzuki K Poulose
On arm64, ID_AA64MMFR0_EL1.PARange encodes the maximum Physical Address range supported by the CPU. Add a helper to decode this to actual physical shift. If we hit an unallocated value, return the maximum range supported by the kernel. This will be used by KVM to set the VTCR_EL2.T0SZ, as it is about to move its place. Having this helper keeps the code movement cleaner. Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: James Morse <james.morse@arm.com> Cc: Christoffer Dall <cdall@kernel.org> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
2018-10-01kvm: arm64: Add helper for loading the stage2 setting for a VMSuzuki K Poulose
We load the stage2 context of a guest for different operations, including running the guest and tlb maintenance on behalf of the guest. As of now only the vttbr is private to the guest, but this is about to change with IPA per VM. Add a helper to load the stage2 configuration for a VM, which could do the right thing with the future changes. Cc: Christoffer Dall <cdall@kernel.org> Cc: Marc Zyngier <marc.zyngier@arm.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
2018-09-23Merge tag 'for-linus-4.19d-rc5-tag' of ↵Greg Kroah-Hartman
git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Juergen writes: "xen: Two small fixes for xen drivers." * tag 'for-linus-4.19d-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: xen: issue warning message when out of grant maptrack entries xen/x86/vpmu: Zero struct pt_regs before calling into sample handling code
2018-09-23Merge branch 'x86-urgent-for-linus' of ↵Greg Kroah-Hartman
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Thomas writes: "A set of fixes for x86: - Resolve the kvmclock regression on AMD systems with memory encryption enabled. The rework of the kvmclock memory allocation during early boot results in encrypted storage, which is not shareable with the hypervisor. Create a new section for this data which is mapped unencrypted and take care that the later allocations for shared kvmclock memory is unencrypted as well. - Fix the build regression in the paravirt code introduced by the recent spectre v2 updates. - Ensure that the initial static page tables cover the fixmap space correctly so early console always works. This worked so far by chance, but recent modifications to the fixmap layout can - depending on kernel configuration - move the relevant entries to a different place which is not covered by the initial static page tables. - Address the regressions and issues which got introduced with the recent extensions to the Intel Recource Director Technology code. - Update maintainer entries to document reality" * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mm: Expand static page table for fixmap space MAINTAINERS: Add X86 MM entry x86/intel_rdt: Add Reinette as co-maintainer for RDT MAINTAINERS: Add Borislav to the x86 maintainers x86/paravirt: Fix some warning messages x86/intel_rdt: Fix incorrect loop end condition x86/intel_rdt: Fix exclusive mode handling of MBA resource x86/intel_rdt: Fix incorrect loop end condition x86/intel_rdt: Do not allow pseudo-locking of MBA resource x86/intel_rdt: Fix unchecked MSR access x86/intel_rdt: Fix invalid mode warning when multiple resources are managed x86/intel_rdt: Global closid helper to support future fixes x86/intel_rdt: Fix size reporting of MBA resource x86/intel_rdt: Fix data type in parsing callbacks x86/kvm: Use __bss_decrypted attribute in shared variables x86/mm: Add .bss..decrypted section to hold shared variables
2018-09-21Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmGreg Kroah-Hartman
Paolo writes: "It's mostly small bugfixes and cleanups, mostly around x86 nested virtualization. One important change, not related to nested virtualization, is that the ability for the guest kernel to trap CPUID instructions (in Linux that's the ARCH_SET_CPUID arch_prctl) is now masked by default. This is because the feature is detected through an MSR; a very bad idea that Intel seems to like more and more. Some applications choke if the other fields of that MSR are not initialized as on real hardware, hence we have to disable the whole MSR by default, as was the case before Linux 4.12." * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (23 commits) KVM: nVMX: Fix bad cleanup on error of get/set nested state IOCTLs kvm: selftests: Add platform_info_test KVM: x86: Control guest reads of MSR_PLATFORM_INFO KVM: x86: Turbo bits in MSR_PLATFORM_INFO nVMX x86: Check VPID value on vmentry of L2 guests nVMX x86: check posted-interrupt descriptor addresss on vmentry of L2 KVM: nVMX: Wake blocked vCPU in guest-mode if pending interrupt in virtual APICv KVM: VMX: check nested state and CR4.VMXE against SMM kvm: x86: make kvm_{load|put}_guest_fpu() static x86/hyper-v: rename ipi_arg_{ex,non_ex} structures KVM: VMX: use preemption timer to force immediate VMExit KVM: VMX: modify preemption timer bit only when arming timer KVM: VMX: immediately mark preemption timer expired only for zero value KVM: SVM: Switch to bitmap_zalloc() KVM/MMU: Fix comment in walk_shadow_page_lockless_end() kvm: selftests: use -pthread instead of -lpthread KVM: x86: don't reset root in kvm_mmu_setup() kvm: mmu: Don't read PDPTEs when paging is not enabled x86/kvm/lapic: always disable MMIO interface in x2APIC mode KVM: s390: Make huge pages unavailable in ucontrol VMs ...
2018-09-20x86/mm: Expand static page table for fixmap spaceFeng Tang
We met a kernel panic when enabling earlycon, which is due to the fixmap address of earlycon is not statically setup. Currently the static fixmap setup in head_64.S only covers 2M virtual address space, while it actually could be in 4M space with different kernel configurations, e.g. when VSYSCALL emulation is disabled. So increase the static space to 4M for now by defining FIXMAP_PMD_NUM to 2, and add a build time check to ensure that the fixmap is covered by the initial static page tables. Fixes: 1ad83c858c7d ("x86_64,vsyscall: Make vsyscall emulation configurable") Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Feng Tang <feng.tang@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: kernel test robot <rong.a.chen@intel.com> Reviewed-by: Juergen Gross <jgross@suse.com> (Xen parts) Cc: H Peter Anvin <hpa@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Andy Lutomirsky <luto@kernel.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20180920025828.23699-1-feng.tang@intel.com
2018-09-20KVM: nVMX: Fix bad cleanup on error of get/set nested state IOCTLsLiran Alon
The handlers of IOCTLs in kvm_arch_vcpu_ioctl() are expected to set their return value in "r" local var and break out of switch block when they encounter some error. This is because vcpu_load() is called before the switch block which have a proper cleanup of vcpu_put() afterwards. However, KVM_{GET,SET}_NESTED_STATE IOCTLs handlers just return immediately on error without performing above mentioned cleanup. Thus, change these handlers to behave as expected. Fixes: 8fcc4b5923af ("kvm: nVMX: Introduce KVM_CAP_NESTED_STATE") Reviewed-by: Mark Kanda <mark.kanda@oracle.com> Reviewed-by: Patrick Colp <patrick.colp@oracle.com> Signed-off-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-09-20KVM: x86: Control guest reads of MSR_PLATFORM_INFODrew Schmitt
Add KVM_CAP_MSR_PLATFORM_INFO so that userspace can disable guest access to reads of MSR_PLATFORM_INFO. Disabling access to reads of this MSR gives userspace the control to "expose" this platform-dependent information to guests in a clear way. As it exists today, guests that read this MSR would get unpopulated information if userspace hadn't already set it (and prior to this patch series, only the CPUID faulting information could have been populated). This existing interface could be confusing if guests don't handle the potential for incorrect/incomplete information gracefully (e.g. zero reported for base frequency). Signed-off-by: Drew Schmitt <dasch@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-09-20KVM: x86: Turbo bits in MSR_PLATFORM_INFODrew Schmitt
Allow userspace to set turbo bits in MSR_PLATFORM_INFO. Previously, only the CPUID faulting bit was settable. But now any bit in MSR_PLATFORM_INFO would be settable. This can be used, for example, to convey frequency information about the platform on which the guest is running. Signed-off-by: Drew Schmitt <dasch@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-09-20nVMX x86: Check VPID value on vmentry of L2 guestsKrish Sadhukhan
According to section "Checks on VMX Controls" in Intel SDM vol 3C, the following check needs to be enforced on vmentry of L2 guests: If the 'enable VPID' VM-execution control is 1, the value of the of the VPID VM-execution control field must not be 0000H. Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> Reviewed-by: Mark Kanda <mark.kanda@oracle.com> Reviewed-by: Liran Alon <liran.alon@oracle.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-09-20nVMX x86: check posted-interrupt descriptor addresss on vmentry of L2Krish Sadhukhan
According to section "Checks on VMX Controls" in Intel SDM vol 3C, the following check needs to be enforced on vmentry of L2 guests: - Bits 5:0 of the posted-interrupt descriptor address are all 0. - The posted-interrupt descriptor address does not set any bits beyond the processor's physical-address width. Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> Reviewed-by: Mark Kanda <mark.kanda@oracle.com> Reviewed-by: Liran Alon <liran.alon@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Reviewed-by: Karl Heubaum <karl.heubaum@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-09-20KVM: nVMX: Wake blocked vCPU in guest-mode if pending interrupt in virtual APICvLiran Alon
In case L1 do not intercept L2 HLT or enter L2 in HLT activity-state, it is possible for a vCPU to be blocked while it is in guest-mode. According to Intel SDM 26.6.5 Interrupt-Window Exiting and Virtual-Interrupt Delivery: "These events wake the logical processor if it just entered the HLT state because of a VM entry". Therefore, if L1 enters L2 in HLT activity-state and L2 has a pending deliverable interrupt in vmcs12->guest_intr_status.RVI, then the vCPU should be waken from the HLT state and injected with the interrupt. In addition, if while the vCPU is blocked (while it is in guest-mode), it receives a nested posted-interrupt, then the vCPU should also be waken and injected with the posted interrupt. To handle these cases, this patch enhances kvm_vcpu_has_events() to also check if there is a pending interrupt in L2 virtual APICv provided by L1. That is, it evaluates if there is a pending virtual interrupt for L2 by checking RVI[7:4] > VPPR[7:4] as specified in Intel SDM 29.2.1 Evaluation of Pending Interrupts. Note that this also handles the case of nested posted-interrupt by the fact RVI is updated in vmx_complete_nested_posted_interrupt() which is called from kvm_vcpu_check_block() -> kvm_arch_vcpu_runnable() -> kvm_vcpu_running() -> vmx_check_nested_events() -> vmx_complete_nested_posted_interrupt(). Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Signed-off-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-09-20KVM: VMX: check nested state and CR4.VMXE against SMMPaolo Bonzini
VMX cannot be enabled under SMM, check it when CR4 is set and when nested virtualization state is restored. This should fix some WARNs reported by syzkaller, mostly around alloc_shadow_vmcs. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-09-20kvm: x86: make kvm_{load|put}_guest_fpu() staticSebastian Andrzej Siewior
The functions kvm_load_guest_fpu() kvm_put_guest_fpu() are only used locally, make them static. This requires also that both functions are moved because they are used before their implementation. Those functions were exported (via EXPORT_SYMBOL) before commit e5bb40251a920 ("KVM: Drop kvm_{load,put}_guest_fpu() exports"). Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-09-20x86/hyper-v: rename ipi_arg_{ex,non_ex} structuresVitaly Kuznetsov
These structures are going to be used from KVM code so let's make their names reflect their Hyper-V origin. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Roman Kagan <rkagan@virtuozzo.com> Acked-by: K. Y. Srinivasan <kys@microsoft.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-09-20KVM: VMX: use preemption timer to force immediate VMExitSean Christopherson
A VMX preemption timer value of '0' is guaranteed to cause a VMExit prior to the CPU executing any instructions in the guest. Use the preemption timer (if it's supported) to trigger immediate VMExit in place of the current method of sending a self-IPI. This ensures that pending VMExit injection to L1 occurs prior to executing any instructions in the guest (regardless of nesting level). When deferring VMExit injection, KVM generates an immediate VMExit from the (possibly nested) guest by sending itself an IPI. Because hardware interrupts are blocked prior to VMEnter and are unblocked (in hardware) after VMEnter, this results in taking a VMExit(INTR) before any guest instruction is executed. But, as this approach relies on the IPI being received before VMEnter executes, it only works as intended when KVM is running as L0. Because there are no architectural guarantees regarding when IPIs are delivered, when running nested the INTR may "arrive" long after L2 is running e.g. L0 KVM doesn't force an immediate switch to L1 to deliver an INTR. For the most part, this unintended delay is not an issue since the events being injected to L1 also do not have architectural guarantees regarding their timing. The notable exception is the VMX preemption timer[1], which is architecturally guaranteed to cause a VMExit prior to executing any instructions in the guest if the timer value is '0' at VMEnter. Specifically, the delay in injecting the VMExit causes the preemption timer KVM unit test to fail when run in a nested guest. Note: this approach is viable even on CPUs with a broken preemption timer, as broken in this context only means the timer counts at the wrong rate. There are no known errata affecting timer value of '0'. [1] I/O SMIs also have guarantees on when they arrive, but I have no idea if/how those are emulated in KVM. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> [Use a hook for SVM instead of leaving the default in x86.c - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-09-20KVM: VMX: modify preemption timer bit only when arming timerSean Christopherson
Provide a singular location where the VMX preemption timer bit is set/cleared so that future usages of the preemption timer can ensure the VMCS bit is up-to-date without having to modify unrelated code paths. For example, the preemption timer can be used to force an immediate VMExit. Cache the status of the timer to avoid redundant VMREAD and VMWRITE, e.g. if the timer stays armed across multiple VMEnters/VMExits. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-09-20KVM: VMX: immediately mark preemption timer expired only for zero valueSean Christopherson
A VMX preemption timer value of '0' at the time of VMEnter is architecturally guaranteed to cause a VMExit prior to the CPU executing any instructions in the guest. This architectural definition is in place to ensure that a previously expired timer is correctly recognized by the CPU as it is possible for the timer to reach zero and not trigger a VMexit due to a higher priority VMExit being signalled instead, e.g. a pending #DB that morphs into a VMExit. Whether by design or coincidence, commit f4124500c2c1 ("KVM: nVMX: Fully emulate preemption timer") special cased timer values of '0' and '1' to ensure prompt delivery of the VMExit. Unlike '0', a timer value of '1' has no has no architectural guarantees regarding when it is delivered. Modify the timer emulation to trigger immediate VMExit if and only if the timer value is '0', and document precisely why '0' is special. Do this even if calibration of the virtual TSC failed, i.e. VMExit will occur immediately regardless of the frequency of the timer. Making only '0' a special case gives KVM leeway to be more aggressive in ensuring the VMExit is injected prior to executing instructions in the nested guest, and also eliminates any ambiguity as to why '1' is a special case, e.g. why wasn't the threshold for a "short timeout" set to 10, 100, 1000, etc... Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-09-20KVM: SVM: Switch to bitmap_zalloc()Andy Shevchenko
Switch to bitmap_zalloc() to show clearly what we are allocating. Besides that it returns pointer of bitmap type instead of opaque void *. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-09-20KVM/MMU: Fix comment in walk_shadow_page_lockless_end()Tianyu Lan
kvm_commit_zap_page() has been renamed to kvm_mmu_commit_zap_page() This patch is to fix the commit. Signed-off-by: Lan Tianyu <Tianyu.Lan@microsoft.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-09-20KVM: x86: don't reset root in kvm_mmu_setup()Wei Yang
Here is the code path which shows kvm_mmu_setup() is invoked after kvm_mmu_create(). Since kvm_mmu_setup() is only invoked in this code path, this means the root_hpa and prev_roots are guaranteed to be invalid. And it is not necessary to reset it again. kvm_vm_ioctl_create_vcpu() kvm_arch_vcpu_create() vmx_create_vcpu() kvm_vcpu_init() kvm_arch_vcpu_init() kvm_mmu_create() kvm_arch_vcpu_setup() kvm_mmu_setup() kvm_init_mmu() This patch set reset_roots to false in kmv_mmu_setup(). Fixes: 50c28f21d045dde8c52548f8482d456b3f0956f5 Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-09-20kvm: mmu: Don't read PDPTEs when paging is not enabledJunaid Shahid
kvm should not attempt to read guest PDPTEs when CR0.PG = 0 and CR4.PAE = 1. Signed-off-by: Junaid Shahid <junaids@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-09-20x86/kvm/lapic: always disable MMIO interface in x2APIC modeVitaly Kuznetsov
When VMX is used with flexpriority disabled (because of no support or if disabled with module parameter) MMIO interface to lAPIC is still available in x2APIC mode while it shouldn't be (kvm-unit-tests): PASS: apic_disable: Local apic enabled in x2APIC mode PASS: apic_disable: CPUID.1H:EDX.APIC[bit 9] is set FAIL: apic_disable: *0xfee00030: 50014 The issue appears because we basically do nothing while switching to x2APIC mode when APIC access page is not used. apic_mmio_{read,write} only check if lAPIC is disabled before proceeding to actual write. When APIC access is virtualized we correctly manipulate with VMX controls in vmx_set_virtual_apic_mode() and we don't get vmexits from memory writes in x2APIC mode so there's no issue. Disabling MMIO interface seems to be easy. The question is: what do we do with these reads and writes? If we add apic_x2apic_mode() check to apic_mmio_in_range() and return -EOPNOTSUPP these reads and writes will go to userspace. When lAPIC is in kernel, Qemu uses this interface to inject MSIs only (see kvm_apic_mem_write() in hw/i386/kvm/apic.c). This somehow works with disabled lAPIC but when we're in xAPIC mode we will get a real injected MSI from every write to lAPIC. Not good. The simplest solution seems to be to just ignore writes to the region and return ~0 for all reads when we're in x2APIC mode. This is what this patch does. However, this approach is inconsistent with what currently happens when flexpriority is enabled: we allocate APIC access page and create KVM memory region so in x2APIC modes all reads and writes go to this pre-allocated page which is, btw, the same for all vCPUs. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-09-19xen/x86/vpmu: Zero struct pt_regs before calling into sample handling codeBoris Ostrovsky
Otherwise we may leak kernel stack for events that sample user registers. Reported-by: Mark Rutland <mark.rutland@arm.com> Reviewed-by: Juergen Gross <jgross@suse.com> Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: stable@vger.kernel.org
2018-09-19x86/paravirt: Fix some warning messagesDan Carpenter
The first argument to WARN_ONCE() is a condition. Fixes: 5800dc5c19f3 ("x86/paravirt: Fix spectre-v2 mitigations for paravirt guests") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Juergen Gross <jgross@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Alok Kataria <akataria@vmware.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: virtualization@lists.linux-foundation.org Cc: kernel-janitors@vger.kernel.org Link: https://lkml.kernel.org/r/20180919103553.GD9238@mwanda
2018-09-19Merge branch 'linus' of ↵Greg Kroah-Hartman
git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 Crypto stuff from Herbert: "This push fixes a potential boot hang in ccp and an incorrect CPU capability check in aegis/morus on x86." * 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: crypto: x86/aegis,morus - Do not require OSXSAVE for SSE2 crypto: ccp - add timeout support in the SEV command
2018-09-18x86/intel_rdt: Fix incorrect loop end conditionReinette Chatre
In order to determine a sane default cache allocation for a new CAT/CDP resource group, all resource groups are checked to determine which cache portions are available to share. At this time all possible CLOSIDs that can be supported by the resource is checked. This is problematic if the resource supports more CLOSIDs than another CAT/CDP resource. In this case, the number of CLOSIDs that could be allocated are fewer than the number of CLOSIDs that can be supported by the resource. Limit the check of closids to that what is supported by the system based on the minimum across all resources. Fixes: 95f0b77ef ("x86/intel_rdt: Initialize new resource group with sane defaults") Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Fenghua Yu <fenghua.yu@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: "H Peter Anvin" <hpa@zytor.com> Cc: "Tony Luck" <tony.luck@intel.com> Cc: "Xiaochen Shen" <xiaochen.shen@intel.com> Cc: "Chen Yu" <yu.c.chen@intel.com> Link: https://lkml.kernel.org/r/1537048707-76280-10-git-send-email-fenghua.yu@intel.com
2018-09-18x86/intel_rdt: Fix exclusive mode handling of MBA resourceReinette Chatre
It is possible for a resource group to consist out of MBA as well as CAT/CDP resources. The "exclusive" resource mode only applies to the CAT/CDP resources since MBA allocations cannot be specified to overlap or not. When a user requests a resource group to become "exclusive" then it can only be successful if there are CAT/CDP resources in the group and none of their CBMs associated with the group's CLOSID overlaps with any other resource group. Fix the "exclusive" mode setting by failing if there isn't any CAT/CDP resource in the group and ensuring that the CBM checking is only done on CAT/CDP resources. Fixes: 49f7b4efa ("x86/intel_rdt: Enable setting of exclusive mode") Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Fenghua Yu <fenghua.yu@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: "H Peter Anvin" <hpa@zytor.com> Cc: "Tony Luck" <tony.luck@intel.com> Cc: "Xiaochen Shen" <xiaochen.shen@intel.com> Cc: "Chen Yu" <yu.c.chen@intel.com> Link: https://lkml.kernel.org/r/1537048707-76280-9-git-send-email-fenghua.yu@intel.com
2018-09-18x86/intel_rdt: Fix incorrect loop end conditionReinette Chatre
A loop is used to check if a CAT resource's CBM of one CLOSID overlaps with the CBM of another CLOSID of the same resource. The loop is run over all CLOSIDs supported by the resource. The problem with running the loop over all CLOSIDs supported by the resource is that its number of supported CLOSIDs may be more than the number of supported CLOSIDs on the system, which is the minimum number of CLOSIDs supported across all resources. Fix the loop to only consider the number of system supported CLOSIDs, not all that are supported by the resource. Fixes: 49f7b4efa ("x86/intel_rdt: Enable setting of exclusive mode") Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Fenghua Yu <fenghua.yu@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: "H Peter Anvin" <hpa@zytor.com> Cc: "Tony Luck" <tony.luck@intel.com> Cc: "Xiaochen Shen" <xiaochen.shen@intel.com> Cc: "Chen Yu" <yu.c.chen@intel.com> Link: https://lkml.kernel.org/r/1537048707-76280-8-git-send-email-fenghua.yu@intel.com
2018-09-18x86/intel_rdt: Do not allow pseudo-locking of MBA resourceReinette Chatre
A system supporting pseudo-locking may have MBA as well as CAT resources of which only the CAT resources could support cache pseudo-locking. When the schemata to be pseudo-locked is provided it should be checked that that schemata does not attempt to pseudo-lock a MBA resource. Fixes: e0bdfe8e3 ("x86/intel_rdt: Support creation/removal of pseudo-locked region") Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Fenghua Yu <fenghua.yu@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: "H Peter Anvin" <hpa@zytor.com> Cc: "Tony Luck" <tony.luck@intel.com> Cc: "Xiaochen Shen" <xiaochen.shen@intel.com> Cc: "Chen Yu" <yu.c.chen@intel.com> Link: https://lkml.kernel.org/r/1537048707-76280-7-git-send-email-fenghua.yu@intel.com
2018-09-18x86/intel_rdt: Fix unchecked MSR accessReinette Chatre
When a new resource group is created, it is initialized with sane defaults that currently assume the resource being initialized is a CAT resource. This code path is also followed by a MBA resource that is not allocated the same as a CAT resource and as a result we encounter the following unchecked MSR access error: unchecked MSR access error: WRMSR to 0xd51 (tried to write 0x0000 000000000064) at rIP: 0xffffffffae059994 (native_write_msr+0x4/0x20) Call Trace: mba_wrmsr+0x41/0x80 update_domains+0x125/0x130 rdtgroup_mkdir+0x270/0x500 Fix the above by ensuring the initial allocation is only attempted on a CAT resource. Fixes: 95f0b77ef ("x86/intel_rdt: Initialize new resource group with sane defaults") Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Fenghua Yu <fenghua.yu@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: "H Peter Anvin" <hpa@zytor.com> Cc: "Tony Luck" <tony.luck@intel.com> Cc: "Xiaochen Shen" <xiaochen.shen@intel.com> Cc: "Chen Yu" <yu.c.chen@intel.com> Link: https://lkml.kernel.org/r/1537048707-76280-6-git-send-email-fenghua.yu@intel.com
2018-09-18x86/intel_rdt: Fix invalid mode warning when multiple resources are managedReinette Chatre
When multiple resources are managed by RDT, the number of CLOSIDs used is the minimum of the CLOSIDs supported by each resource. In the function rdt_bit_usage_show(), the annotated bitmask is created to depict how the CAT supporting caches are being used. During this annotated bitmask creation, each resource group is queried for its mode that is used as a label in the annotated bitmask. The maximum number of resource groups is currently assumed to be the number of CLOSIDs supported by the resource for which the information is being displayed. This is incorrect since the number of active CLOSIDs is the minimum across all resources. If information for a cache instance with more CLOSIDs than another is being generated we thus encounter a warning like: invalid mode for closid 8 WARNING: CPU: 88 PID: 1791 at [SNIP]/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c :827 rdt_bit_usage_show+0x221/0x2b0 Fix this by ensuring that only the number of supported CLOSIDs are considered. Fixes: e651901187ab8 ("x86/intel_rdt: Introduce "bit_usage" to display cache allocations details") Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Fenghua Yu <fenghua.yu@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: "H Peter Anvin" <hpa@zytor.com> Cc: "Tony Luck" <tony.luck@intel.com> Cc: "Xiaochen Shen" <xiaochen.shen@intel.com> Cc: "Chen Yu" <yu.c.chen@intel.com> Link: https://lkml.kernel.org/r/1537048707-76280-5-git-send-email-fenghua.yu@intel.com