aboutsummaryrefslogtreecommitdiff
path: root/accel/kvm
AgeCommit message (Collapse)AuthorFilesLines
8 daysMerge tag 'accel-20250704' of https://github.com/philmd/qemu into stagingStefan Hajnoczi2-13/+15
Accelerators patches - Generic API consolidation, cleanups (dead code removal, documentation added) - Remove monitor TCG 'info opcount' and @x-query-opcount - Have HVF / NVMM / WHPX use generic CPUState::vcpu_dirty field - Expose nvmm_enabled() and whpx_enabled() to common code - Have hmp_info_registers() dump vector registers # -----BEGIN PGP SIGNATURE----- # # iQIzBAABCAAdFiEE+qvnXhKRciHc/Wuy4+MsLN6twN4FAmhnql4ACgkQ4+MsLN6t # wN6Lfg//R4h6dyAg02hyopwb/DSI97hAsD9kap15ro1qszYrIOkJcEPoE37HDi6d # O0Ls+8NPpJcnMwdghHvVaRGoIH2OY5ogXKo6UK1BbOn8iAGxRrT/IPVCyFbPmQoe # Bk78Z/wne/YgCXiW4HGHSJO5sL04AQqcFYnwjisHHf3Ox8RR85LbhWqthZluta4i # a/Y8W5UO7jfwhAl1/Zb2cU+Rv75I6xcaLQAfmbt4j+wHP52I2cjLpIYo4sCn+ULJ # AVX4q4MKrkDrr6CYPXxdGJzYEzVn9evynVcQoRzL6bLZFMpa284AzVd3kQg9NWAb # p1hvKJTA57q4XDoD50qVGLhP207VVSUcdm0r2ZJA2jag5ddoT+x2talz8/f6In1b # 7BrSM/pla8x9KvTne/ko0wSL0o2dOWyig8mBxARLZWPxk+LBVs1PBZfvn+3j1pYA # rWV25Ht4QJlUYMbe3NvEIomsVThKg8Fh3b4mEuyPM+LZ1brgmhrzJG1SF+G4fH8A # aig/RVqgNHtajSnG4A723k2/QzlvnAiT7E3dKB5FogjTcVzFRaWFKsUb4ORqsCAz # c/AheCJY4PP3pAnb0ODISSVviXwAXqCLbtZhDGhHNYl3C69EyGPPMiVxCaIxKDxU # bF7AIYhRTTMyNSbnkcRS3UDO/gZS7x5/K+/YAM9akQEYADIodYM= # =Vb39 # -----END PGP SIGNATURE----- # gpg: Signature made Fri 04 Jul 2025 06:18:06 EDT # gpg: using RSA key FAABE75E12917221DCFD6BB2E3E32C2CDEADC0DE # gpg: Good signature from "Philippe Mathieu-Daudé (F4BUG) <f4bug@amsat.org>" [full] # Primary key fingerprint: FAAB E75E 1291 7221 DCFD 6BB2 E3E3 2C2C DEAD C0DE * tag 'accel-20250704' of https://github.com/philmd/qemu: (31 commits) MAINTAINERS: Add me as reviewer of overall accelerators section monitor/hmp-cmds-target: add CPU_DUMP_VPU in hmp_info_registers() accel: Pass AccelState argument to gdbstub_supported_sstep_flags() accel: Remove unused MachineState argument of AccelClass::setup_post() accel: Directly pass AccelState argument to AccelClass::has_memory() accel/kvm: Directly pass KVMState argument to do_kvm_create_vm() accel/kvm: Prefer local AccelState over global MachineState::accel accel/tcg: Prefer local AccelState over global current_accel() accel: Propagate AccelState to AccelClass::init_machine() accel: Keep reference to AccelOpsClass in AccelClass accel: Expose and register generic_handle_interrupt() accel/dummy: Extract 'dummy-cpus.h' header from 'system/cpus.h' accel/whpx: Expose whpx_enabled() to common code accel/nvmm: Expose nvmm_enabled() to common code accel/system: Document cpu_synchronize_state_post_init/reset() accel/system: Document cpu_synchronize_state() accel/kvm: Remove kvm_cpu_synchronize_state() stub accel/whpx: Replace @dirty field by generic CPUState::vcpu_dirty field accel/nvmm: Replace @dirty field by generic CPUState::vcpu_dirty field accel/hvf: Replace @dirty field by generic CPUState::vcpu_dirty field ... Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
8 daysaccel: Pass AccelState argument to gdbstub_supported_sstep_flags()Philippe Mathieu-Daudé1-1/+1
In order to have AccelClass methods instrospect their state, we need to pass AccelState by argument. Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org> Reviewed-by: Richard Henderson <richard.henderson@linaro.org> Message-Id: <20250703173248.44995-37-philmd@linaro.org>
8 daysaccel: Directly pass AccelState argument to AccelClass::has_memory()Philippe Mathieu-Daudé1-2/+2
Reviewed-by: Richard Henderson <richard.henderson@linaro.org> Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org> Message-Id: <20250703173248.44995-34-philmd@linaro.org>
8 daysaccel/kvm: Directly pass KVMState argument to do_kvm_create_vm()Philippe Mathieu-Daudé1-5/+2
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org> Reviewed-by: Richard Henderson <richard.henderson@linaro.org> Message-Id: <20250703173248.44995-35-philmd@linaro.org>
8 daysaccel/kvm: Prefer local AccelState over global MachineState::accelPhilippe Mathieu-Daudé1-3/+1
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org> Reviewed-by: Richard Henderson <richard.henderson@linaro.org> Message-Id: <20250703173248.44995-32-philmd@linaro.org>
8 daysaccel: Propagate AccelState to AccelClass::init_machine()Philippe Mathieu-Daudé1-1/+1
In order to avoid init_machine() to call current_accel(), pass AccelState along. Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org> Reviewed-by: Richard Henderson <richard.henderson@linaro.org> Reviewed-by: Alex Bennée <alex.bennee@linaro.org> Message-Id: <20250703173248.44995-31-philmd@linaro.org>
8 daysaccel: Expose and register generic_handle_interrupt()Philippe Mathieu-Daudé1-0/+1
In order to dispatch over AccelOpsClass::handle_interrupt(), we need it always defined, not calling a hidden handler under the hood. Make AccelOpsClass::handle_interrupt() mandatory. Expose generic_handle_interrupt() prototype and register it for each accelerator. Suggested-by: Richard Henderson <richard.henderson@linaro.org> Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org> Reviewed-by: Pierrick Bouvier <pierrick.bouvier@linaro.org> Reviewed-by: Zhao Liu <zhao1.liu@intel.com> Reviewed-by: Richard Henderson <richard.henderson@linaro.org> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Message-Id: <20250703173248.44995-29-philmd@linaro.org>
9 daysaccel/kvm: Reduce kvm_create_vcpu() declaration scopePhilippe Mathieu-Daudé1-1/+7
kvm_create_vcpu() is only used within the same file unit. Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org> Reviewed-by: Richard Henderson <richard.henderson@linaro.org> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Zhao Liu <zhao1.liu@intel.com> Message-Id: <20250703173248.44995-7-philmd@linaro.org>
9 daysmigration: close kvm after cprSteve Sistare1-0/+32
cpr-transfer breaks vfio network connectivity to and from the guest, and the host system log shows: irq bypass consumer (token 00000000a03c32e5) registration fails: -16 which is EBUSY. This occurs because KVM descriptors are still open in the old QEMU process. Close them. Cc: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Steve Sistare <steven.sistare@oracle.com> Reviewed-by: Fabiano Rosas <farosas@suse.de> Link: https://lore.kernel.org/qemu-devel/1751493538-202042-4-git-send-email-steven.sistare@oracle.com Signed-off-by: Cédric Le Goater <clg@redhat.com>
2025-06-23physmem: Support coordinated discarding of RAM with guest_memfdChenyi Qiang1-0/+9
A new field, attributes, was introduced in RAMBlock to link to a RamBlockAttributes object, which centralizes all guest_memfd related information (such as fd and status bitmap) within a RAMBlock. Create and initialize the RamBlockAttributes object upon ram_block_add(). Meanwhile, register the object in the target RAMBlock's MemoryRegion. After that, guest_memfd-backed RAMBlock is associated with the RamDiscardManager interface, and the users can execute RamDiscardManager specific handling. For example, VFIO will register the RamDiscardListener and get notifications when the state_change() helper invokes. As coordinate discarding of RAM with guest_memfd is now supported, only block uncoordinated discard. Tested-by: Alexey Kardashevskiy <aik@amd.com> Reviewed-by: Alexey Kardashevskiy <aik@amd.com> Acked-by: David Hildenbrand <david@redhat.com> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com> Link: https://lore.kernel.org/r/20250612082747.51539-6-chenyi.qiang@intel.com Signed-off-by: Peter Xu <peterx@redhat.com>
2025-06-06i386/kvm: Prefault memory on page state changeTom Lendacky1-0/+2
A page state change is typically followed by an access of the page(s) and results in another VMEXIT in order to map the page into the nested page table. Depending on the size of page state change request, this can generate a number of additional VMEXITs. For example, under SNP, when Linux is utilizing lazy memory acceptance, memory is typically accepted in 4M chunks. A page state change request is submitted to mark the pages as private, followed by validation of the memory. Since the guest_memfd currently only supports 4K pages, each page validation will result in VMEXIT to map the page, resulting in 1024 additional exits. When performing a page state change, invoke KVM_PRE_FAULT_MEMORY for the size of the page state change in order to pre-map the pages and avoid the additional VMEXITs. This helps speed up boot times. Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/f5411c42340bd2f5c14972551edb4e959995e42b.1743193824.git.thomas.lendacky@amd.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-05-28cpu: Don't set vcpu_dirty when guest_state_protectedXiaoyao Li1-1/+3
QEMU calls kvm_arch_put_registers() when vcpu_dirty is true in kvm_vcpu_exec(). However, for confidential guest, like TDX, putting registers is disallowed due to guest state is protected. Only set vcpu_dirty to true with guest state is not protected when creating the vcpu. Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Zhao Liu <zhao1.liu@intel.com> Link: https://lore.kernel.org/r/20250508150002.689633-43-xiaoyao.li@intel.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-05-28kvm: Check KVM_CAP_MAX_VCPUS at vm levelXiaoyao Li1-1/+1
KVM with TDX support starts to report different KVM_CAP_MAX_VCPUS per different VM types. So switch to check the KVM_CAP_MAX_VCPUS at vm level. KVM still returns the global KVM_CAP_MAX_VCPUS when the KVM is old that doesn't report different value at vm level. Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Zhao Liu <zhao1.liu@intel.com> Link: https://lore.kernel.org/r/20250508150002.689633-31-xiaoyao.li@intel.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-05-28kvm: Introduce kvm_arch_pre_create_vcpu()Xiaoyao Li1-0/+5
Introduce kvm_arch_pre_create_vcpu(), to perform arch-dependent work prior to create any vcpu. This is for i386 TDX because it needs call TDX_INIT_VM before creating any vcpu. The specific implementation for i386 will be added in the future patch. Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Acked-by: Gerd Hoffmann <kraxel@redhat.com> Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> Reviewed-by: Zhao Liu <zhao1.liu@intel.com> Link: https://lore.kernel.org/r/20250508150002.689633-8-xiaoyao.li@intel.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-25accel/kvm: Use target_needs_bswap()Philippe Mathieu-Daudé1-14/+16
Check whether we need to swap at runtime using target_needs_bswap(). Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org> Reviewed-by: Richard Henderson <richard.henderson@linaro.org> Message-Id: <20250417131004.47205-3-philmd@linaro.org>
2025-04-25qom: Have class_init() take a const data argumentPhilippe Mathieu-Daudé2-2/+2
Mechanical change using gsed, then style manually adapted to pass checkpatch.pl script. Suggested-by: Richard Henderson <richard.henderson@linaro.org> Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org> Reviewed-by: Richard Henderson <richard.henderson@linaro.org> Message-Id: <20250424194905.82506-4-philmd@linaro.org>
2025-04-23accel/kvm: move KVM_HAVE_MCE_INJECTION define to kvm-all.cPierrick Bouvier1-0/+5
This define is used only in accel/kvm/kvm-all.c, so we push directly the definition there. Add more visibility to kvm_arch_on_sigbus_vcpu() to allow removing this define from any header. The architectures defining KVM_HAVE_MCE_INJECTION are i386, x86_64 and aarch64. Reviewed-by: Richard Henderson <richard.henderson@linaro.org> Signed-off-by: Pierrick Bouvier <pierrick.bouvier@linaro.org> Signed-off-by: Richard Henderson <richard.henderson@linaro.org> Message-ID: <20250325045915.994760-18-pierrick.bouvier@linaro.org>
2025-04-23include/system: Move exec/ram_addr.h to system/ram_addr.hRichard Henderson1-1/+1
Convert the existing includes with sed. Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Reviewed-by: Pierrick Bouvier <pierrick.bouvier@linaro.org> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2025-04-23include/system: Move exec/memory.h to system/memory.hRichard Henderson1-1/+1
Convert the existing includes with sed -i ,exec/memory.h,system/memory.h,g Move the include within cpu-all.h into a !CONFIG_USER_ONLY block. Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Reviewed-by: Pierrick Bouvier <pierrick.bouvier@linaro.org> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2025-04-17target/i386: Reset parked vCPUs together with the online onesMaciej S. Szmigiero1-3/+5
Commit 3f2a05b31ee9 ("target/i386: Reset TSCs of parked vCPUs too on VM reset") introduced a way to reset TSCs of parked vCPUs during VM reset to prevent them getting desynchronized with the online vCPUs and therefore causing the KVM PV clock to lose PVCLOCK_TSC_STABLE_BIT. The way this was done was by registering a parked vCPU-specific QEMU reset callback via qemu_register_reset(). However, it turns out that on particularly device-rich VMs QEMU reset callbacks can take a long time to execute (which isn't surprising, considering that they involve resetting all of VM devices). In particular, their total runtime can exceed the 1-second TSC synchronization window introduced in KVM commit 5d3cb0f6a8e3 ("KVM: Improve TSC offset matching"). Since the TSCs of online vCPUs are only reset from "synchronize_post_reset" AccelOps handler (which runs after all qemu_register_reset() handlers) this essentially makes that fix ineffective on these VMs. The easiest way to guarantee that these parked vCPUs are reset at the same time as the online ones (regardless how long it takes for VM devices to reset) is to piggyback on post-reset vCPU synchronization handler for one of online vCPUs - as there is no generic post-reset AccelOps handler that isn't per-vCPU. The first online vCPU was selected for that since it is easily available under "first_cpu" define. This does not create an ordering issue since the order of vCPU TSC resets does not matter. Fixes: 3f2a05b31ee9 ("target/i386: Reset TSCs of parked vCPUs too on VM reset") Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Link: https://lore.kernel.org/r/e8b85a5915f79aa177ca49eccf0e9b534470c1cd.1743099810.git.maciej.szmigiero@oracle.com Cc: qemu-stable@nongnu.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-06accel/kvm: Remove unused 'system/cpus.h' header in kvm-cpus.hPhilippe Mathieu-Daudé1-2/+0
Missed in commit b86f59c7155 ("accel: replace struct CpusAccel with AccelOpsClass") which removed the single CpusAccel use. Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org> Reviewed-by: Richard Henderson <richard.henderson@linaro.org> Message-Id: <20250123234415.59850-7-philmd@linaro.org>
2025-03-06accel: Forward-declare AccelOpsClass in 'qemu/typedefs.h'Philippe Mathieu-Daudé1-0/+1
The heavily imported "system/cpus.h" header includes "accel-ops.h" to get AccelOpsClass type declaration. Reduce headers pressure by forward declaring it in "qemu/typedefs.h", where we already declare the AccelCPUState type. Reduce "system/cpus.h" inclusions by only including "system/accel-ops.h" when necessary. Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org> Reviewed-by: Richard Henderson <richard.henderson@linaro.org> Message-Id: <20250123234415.59850-14-philmd@linaro.org>
2025-02-12system/physmem: handle hugetlb correctly in qemu_ram_remap()William Roche1-1/+1
The list of hwpoison pages used to remap the memory on reset is based on the backend real page size. To correctly handle hugetlb, we must mmap(MAP_FIXED) a complete hugetlb page; hugetlb pages cannot be partially mapped. Signed-off-by: William Roche <william.roche@oracle.com> Co-developed-by: David Hildenbrand <david@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Peter Xu <peterx@redhat.com> Link: https://lore.kernel.org/r/20250211212707.302391-2-william.roche@oracle.com Signed-off-by: Peter Xu <peterx@redhat.com>
2024-12-21Merge tag 'exec-20241220' of https://github.com/philmd/qemu into stagingStefan Hajnoczi3-13/+13
Accel & Exec patch queue - Ignore writes to CNTP_CTL_EL0 on HVF ARM (Alexander) - Add '-d invalid_mem' logging option (Zoltan) - Create QOM containers explicitly (Peter) - Rename sysemu/ -> system/ (Philippe) - Re-orderning of include/exec/ headers (Philippe) Move a lot of declarations from these legacy mixed bag headers: . "exec/cpu-all.h" . "exec/cpu-common.h" . "exec/cpu-defs.h" . "exec/exec-all.h" . "exec/translate-all" to these more specific ones: . "exec/page-protection.h" . "exec/translation-block.h" . "user/cpu_loop.h" . "user/guest-host.h" . "user/page-protection.h" # -----BEGIN PGP SIGNATURE----- # # iQIzBAABCAAdFiEE+qvnXhKRciHc/Wuy4+MsLN6twN4FAmdlnyAACgkQ4+MsLN6t # wN6mBw//QFWi7CrU+bb8KMM53kOU9C507tjn99LLGFb5or73/umDsw6eo/b8DHBt # KIwGLgATel42oojKfNKavtAzLK5rOrywpboPDpa3SNeF1onW+99NGJ52LQUqIX6K # A6bS0fPdGG9ZzEuPpbjDXlp++0yhDcdSgZsS42fEsT7Dyj5gzJYlqpqhiXGqpsn8 # 4Y0UMxSL21K3HEexlzw2hsoOBFA3tUm2ujNDhNkt8QASr85yQVLCypABJnuoe/// # 5Ojl5wTBeDwhANET0rhwHK8eIYaNboiM9fHopJYhvyw1bz6yAu9jQwzF/MrL3s/r # xa4OBHBy5mq2hQV9Shcl3UfCQdk/vDaYaWpgzJGX8stgMGYfnfej1SIl8haJIfcl # VMX8/jEFdYbjhO4AeGRYcBzWjEJymkDJZoiSWp2NuEDi6jqIW+7yW1q0Rnlg9lay # ShAqLK5Pv4zUw3t0Jy3qv9KSW8sbs6PQxtzXjk8p97rTf76BJ2pF8sv1tVzmsidP # 9L92Hv5O34IqzBu2oATOUZYJk89YGmTIUSLkpT7asJZpBLwNM2qLp5jO00WVU0Sd # +kAn324guYPkko/TVnjC/AY7CMu55EOtD9NU35k3mUAnxXT9oDUeL4NlYtfgrJx6 # x1Nzr2FkS68+wlPAFKNSSU5lTjsjNaFM0bIJ4LCNtenJVP+SnRo= # =cjz8 # -----END PGP SIGNATURE----- # gpg: Signature made Fri 20 Dec 2024 11:45:20 EST # gpg: using RSA key FAABE75E12917221DCFD6BB2E3E32C2CDEADC0DE # gpg: Good signature from "Philippe Mathieu-Daudé (F4BUG) <f4bug@amsat.org>" [unknown] # gpg: WARNING: This key is not certified with a trusted signature! # gpg: There is no indication that the signature belongs to the owner. # Primary key fingerprint: FAAB E75E 1291 7221 DCFD 6BB2 E3E3 2C2C DEAD C0DE * tag 'exec-20241220' of https://github.com/philmd/qemu: (59 commits) util/qemu-timer: fix indentation meson: Do not define CONFIG_DEVICES on user emulation system/accel-ops: Remove unnecessary 'exec/cpu-common.h' header system/numa: Remove unnecessary 'exec/cpu-common.h' header hw/xen: Remove unnecessary 'exec/cpu-common.h' header target/mips: Drop left-over comment about Jazz machine target/mips: Remove tswap() calls in semihosting uhi_fstat_cb() target/xtensa: Remove tswap() calls in semihosting simcall() helper accel/tcg: Un-inline translator_is_same_page() accel/tcg: Include missing 'exec/translation-block.h' header accel/tcg: Move tcg_cflags_has/set() to 'exec/translation-block.h' accel/tcg: Restrict curr_cflags() declaration to 'internal-common.h' qemu/coroutine: Include missing 'qemu/atomic.h' header exec/translation-block: Include missing 'qemu/atomic.h' header accel/tcg: Declare cpu_loop_exit_requested() in 'exec/cpu-common.h' exec/cpu-all: Include 'cpu.h' earlier so MMU_USER_IDX is always defined target/sparc: Move sparc_restore_state_to_opc() to cpu.c target/sparc: Uninline cpu_get_tb_cpu_state() target/loongarch: Declare loongarch_cpu_dump_state() locally user: Move various declarations out of 'exec/exec-all.h' ... Conflicts: hw/char/riscv_htif.c hw/intc/riscv_aplic.c target/s390x/cpu.c Apply sysemu header path changes to not in the pull request. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2024-12-20include: Rename sysemu/ -> system/Philippe Mathieu-Daudé3-13/+13
Headers in include/sysemu/ are not only related to system *emulation*, they are also used by virtualization. Rename as system/ which is clearer. Files renamed manually then mechanical change using sed tool. Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org> Reviewed-by: Richard Henderson <richard.henderson@linaro.org> Tested-by: Lei Yang <leiyang@redhat.com> Message-Id: <20241203172445.28576-1-philmd@linaro.org>
2024-12-19target/i386: Reset TSCs of parked vCPUs too on VM resetMaciej S. Szmigiero1-0/+11
Since commit 5286c3662294 ("target/i386: properly reset TSC on reset") QEMU writes the special value of "1" to each online vCPU TSC on VM reset to reset it. However parked vCPUs don't get that handling and due to that their TSCs get desynchronized when the VM gets reset. This in turn causes KVM to turn off PVCLOCK_TSC_STABLE_BIT in its exported PV clock. Note that KVM has no understanding of vCPU being currently parked. Without PVCLOCK_TSC_STABLE_BIT the sched clock is marked unstable in the guest's kvm_sched_clock_init(). This causes a performance regressions to show in some tests. Fix this issue by writing the special value of "1" also to TSCs of parked vCPUs on VM reset. Reproducing the issue: 1) Boot a VM with "-smp 2,maxcpus=3" or similar 2) device_add host-x86_64-cpu,id=vcpu,node-id=0,socket-id=0,core-id=2,thread-id=0 3) Wait a few seconds 4) device_del vcpu 5) Inside the VM run: # echo "t" >/proc/sysrq-trigger; dmesg | grep sched_clock_stable Observe the sched_clock_stable() value is 1. 6) Reboot the VM 7) Once the VM boots once again run inside it: # echo "t" >/proc/sysrq-trigger; dmesg | grep sched_clock_stable Observe the sched_clock_stable() value is now 0. Fixes: 5286c3662294 ("target/i386: properly reset TSC on reset") Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Link: https://lore.kernel.org/r/5a605a88e9a231386dc803c60f5fed9b48108139.1734014926.git.maciej.szmigiero@oracle.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-12-19kvm: consistently return 0/-errno from kvm_convert_memoryPaolo Bonzini1-4/+4
In case of incorrect parameters, kvm_convert_memory() was returning -1 instead of -EINVAL. The guest won't notice because it will move anyway to RUN_STATE_INTERNAL_ERROR, but fix this for consistency and clarity. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-10-17accel/kvm: check for KVM_CAP_MEMORY_ATTRIBUTES on vmPaolo Bonzini1-6/+6
The exact set of available memory attributes can vary by VM. In the future it might vary depending on enabled capabilities, too. Query the extension on the VM level instead of on the KVM level, and only after architecture-specific initialization. Inspired by an analogous patch by Tom Dohrmann. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-10-17accel/kvm: check for KVM_CAP_MULTI_ADDRESS_SPACE on vmPaolo Bonzini1-6/+6
KVM_CAP_MULTI_ADDRESS_SPACE used to be a global capability, but with the introduction of AMD SEV-SNP confidential VMs, the number of address spaces can vary by VM type. Query the extension on the VM level instead of on the KVM level. Inspired by an analogous patch by Tom Dohrmann. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-10-17accel/kvm: check for KVM_CAP_READONLY_MEM on VMTom Dohrmann1-1/+1
KVM_CAP_READONLY_MEM used to be a global capability, but with the introduction of AMD SEV-SNP confidential VMs, this extension is not always available on all VM types [1,2]. Query the extension on the VM level instead of on the KVM level. [1] https://patchwork.kernel.org/project/kvm/patch/20240809190319.1710470-2-seanjc@google.com/ [2] https://patchwork.kernel.org/project/kvm/patch/20240902144219.3716974-1-erbse.13@gmx.de/ Cc: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Tom Dohrmann <erbse.13@gmx.de> Link: https://lore.kernel.org/r/20240903062953.3926498-1-erbse.13@gmx.de Cc: qemu-stable@nongnu.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-10-17KVM: Rename KVMState->nr_slots to nr_slots_maxPeter Xu1-6/+6
This value used to reflect the maximum supported memslots from KVM kernel. Rename it to be clearer. Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Peter Xu <peterx@redhat.com> Link: https://lore.kernel.org/r/20240917163835.194664-5-peterx@redhat.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-10-17KVM: Rename KVMMemoryListener.nr_used_slots to nr_slots_usedPeter Xu1-3/+3
This will make all nr_slots counters to be named in the same manner. Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Peter Xu <peterx@redhat.com> Link: https://lore.kernel.org/r/20240917163835.194664-4-peterx@redhat.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-10-17KVM: Define KVM_MEMSLOTS_NUM_MAX_DEFAULTPeter Xu1-1/+3
Make the default max nr_slots a macro, it's only used when KVM reports nothing. Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Peter Xu <peterx@redhat.com> Link: https://lore.kernel.org/r/20240917163835.194664-3-peterx@redhat.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-10-17KVM: Dynamic sized kvm memslots arrayPeter Xu2-15/+73
Zhiyi reported an infinite loop issue in VFIO use case. The cause of that was a separate discussion, however during that I found a regression of dirty sync slowness when profiling. Each KVMMemoryListerner maintains an array of kvm memslots. Currently it's statically allocated to be the max supported by the kernel. However after Linux commit 4fc096a99e ("KVM: Raise the maximum number of user memslots"), the max supported memslots reported now grows to some number large enough so that it may not be wise to always statically allocate with the max reported. What's worse, QEMU kvm code still walks all the allocated memslots entries to do any form of lookups. It can drastically slow down all memslot operations because each of such loop can run over 32K times on the new kernels. Fix this issue by making the memslots to be allocated dynamically. Here the initial size was set to 16 because it should cover the basic VM usages, so that the hope is the majority VM use case may not even need to grow at all (e.g. if one starts a VM with ./qemu-system-x86_64 by default it'll consume 9 memslots), however not too large to waste memory. There can also be even better way to address this, but so far this is the simplest and should be already better even than before we grow the max supported memslots. For example, in the case of above issue when VFIO was attached on a 32GB system, there are only ~10 memslots used. So it could be good enough as of now. In the above VFIO context, measurement shows that the precopy dirty sync shrinked from ~86ms to ~3ms after this patch applied. It should also apply to any KVM enabled VM even without VFIO. NOTE: we don't have a FIXES tag for this patch because there's no real commit that regressed this in QEMU. Such behavior existed for a long time, but only start to be a problem when the kernel reports very large nr_slots_max value. However that's pretty common now (the kernel change was merged in 2021) so we attached cc:stable because we'll want this change to be backported to stable branches. Cc: qemu-stable <qemu-stable@nongnu.org> Reported-by: Zhiyi Guo <zhguo@redhat.com> Tested-by: Zhiyi Guo <zhguo@redhat.com> Signed-off-by: Peter Xu <peterx@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Fabiano Rosas <farosas@suse.de> Link: https://lore.kernel.org/r/20240917163835.194664-2-peterx@redhat.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-10-03kvm: Allow kvm_arch_get/put_registers to accept Error**Julia Suvorova1-9/+32
This is necessary to provide discernible error messages to the caller. Signed-off-by: Julia Suvorova <jusual@redhat.com> Reviewed-by: Peter Xu <peterx@redhat.com> Link: https://lore.kernel.org/r/20240927104743.218468-2-jusual@redhat.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-10-03accel/kvm: refactor dirty ring setupAni Sinha1-38/+50
Refactor setting up of dirty ring code in kvm_init() so that is can be reused in the future patchsets. Signed-off-by: Ani Sinha <anisinha@redhat.com> Link: https://lore.kernel.org/r/20240912061838.4501-1-anisinha@redhat.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-10-03kvm: refactor core virtual machine creation into its own functionAni Sinha1-33/+56
Refactoring the core logic around KVM_CREATE_VM into its own separate function so that it can be called from other functions in subsequent patches. There is no functional change in this patch. CC: pbonzini@redhat.com CC: zhao1.liu@intel.com Signed-off-by: Ani Sinha <anisinha@redhat.com> Link: https://lore.kernel.org/r/20240808113838.1697366-1-anisinha@redhat.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-10-02kvm: replace fprintf with error_report()/printf() in kvm_init()Ani Sinha1-22/+18
error_report() is more appropriate for error situations. Replace fprintf with error_report() and error_printf() as appropriate. Some improvement in error reporting also happens as a part of this change. For example: From: $ ./qemu-system-x86_64 --accel kvm Could not access KVM kernel module: No such file or directory To: $ ./qemu-system-x86_64 --accel kvm qemu-system-x86_64: --accel kvm: Could not access KVM kernel module: No such file or directory CC: qemu-trivial@nongnu.org CC: zhao1.liu@intel.com CC: armbru@redhat.com Reviewed-by: Zhao Liu <zhao1.liu@intel.com> Reviewed-by: Markus Armbruster <armbru@redhat.com> Signed-off-by: Ani Sinha <anisinha@redhat.com> Link: https://lore.kernel.org/r/20240828124539.62672-1-anisinha@redhat.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-09-19kvm: Remove unreachable code in kvm_dirty_ring_reaper_thread()Peter Maydell1-5/+1
The code at the tail end of the loop in kvm_dirty_ring_reaper_thread() is unreachable, because there is no way for execution to leave the loop. Replace it with a g_assert_not_reached(). (The code has always been unreachable, right from the start when the function was added in commit b4420f198dd8.) Resolves: Coverity CID 1547687 Signed-off-by: Peter Maydell <peter.maydell@linaro.org> Message-id: 20240815131206.3231819-3-peter.maydell@linaro.org
2024-09-19kvm: Make 'mmap_size' be 'int' in kvm_init_vcpu(), do_kvm_destroy_vcpu()Peter Maydell1-2/+2
In kvm_init_vcpu()and do_kvm_destroy_vcpu(), the return value from kvm_ioctl(..., KVM_GET_VCPU_MMAP_SIZE, ...) is an 'int', but we put it into a 'long' logal variable mmap_size. Coverity then complains that there might be a truncation when we copy that value into the 'int ret' which we use for returning a value in an error-exit codepath. This can't ever actually overflow because the value was in an 'int' to start with, but it makes more sense to use 'int' for mmap_size so we don't do the widen-then-narrow sequence in the first place. Resolves: Coverity CID 1547515 Signed-off-by: Peter Maydell <peter.maydell@linaro.org> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Message-id: 20240815131206.3231819-2-peter.maydell@linaro.org
2024-09-13kvm: Use 'unsigned long' for request argument in functions wrapping ioctl()Johannes Stoelp2-8/+8
Change the data type of the ioctl _request_ argument from 'int' to 'unsigned long' for the various accel/kvm functions which are essentially wrappers around the ioctl() syscall. The correct type for ioctl()'s 'request' argument is confused: * POSIX defines the request argument as 'int' * glibc uses 'unsigned long' in the prototype in sys/ioctl.h * the glibc info documentation uses 'int' * the Linux manpage uses 'unsigned long' * the Linux implementation of the syscall uses 'unsigned int' If we wrap ioctl() with another function which uses 'int' as the type for the request argument, then requests with the 0x8000_0000 bit set will be sign-extended when the 'int' is cast to 'unsigned long' for the call to ioctl(). On x86_64 one such example is the KVM_IRQ_LINE_STATUS request. Bit requests with the _IOC_READ direction bit set, will have the high bit set. Fortunately the Linux Kernel truncates the upper 32bit of the request on 64bit machines (because it uses 'unsigned int', and see also Linus Torvalds' comments in https://sourceware.org/bugzilla/show_bug.cgi?id=14362 ) so this doesn't cause active problems for us. However it is more consistent to follow the glibc ioctl() prototype when we define functions that are essentially wrappers around ioctl(). This resolves a Coverity issue where it points out that in kvm_get_xsave() we assign a value (KVM_GET_XSAVE or KVM_GET_XSAVE2) to an 'int' variable which can't hold it without overflow. Resolves: Coverity CID 1547759 Signed-off-by: Johannes Stoelp <johannes.stoelp@gmail.com> Signed-off-by: Peter Maydell <peter.maydell@linaro.org> Reviewed-by: Eric Blake <eblake@redhat.com> Message-id: 20240815122747.3053871-1-peter.maydell@linaro.org [PMM: Rebased patch, adjusted commit message, included note about Coverity fix, updated the type of the local var in kvm_get_xsave, updated the comment in the KVMState struct definition] Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2024-08-01accel/kvm/kvm-all: Fixes the missing break in vCPU unpark logicSalil Mehta1-0/+1
Loop should exit prematurely on successfully finding out the parked vCPU (struct KVMParkedVcpu) in the 'struct KVMState' maintained 'kvm_parked_vcpus' list of parked vCPUs. Fixes: Coverity CID 1558552 Fixes: 08c3286822 ("accel/kvm: Extract common KVM vCPU {creation,parking} code") Reported-by: Peter Maydell <peter.maydell@linaro.org> Signed-off-by: Salil Mehta <salil.mehta@huawei.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Zhao Liu <zhao1.liu@intel.com> Reviewed-by: Igor Mammedov <imammedo@redhat.com> Message-id: 20240725145132.99355-1-salil.mehta@huawei.com Suggested-by: Peter Maydell <peter.maydell@linaro.org> Message-ID: <CAFEAcA-3_d1c7XSXWkFubD-LsW5c5i95e6xxV09r2C9yGtzcdA@mail.gmail.com> Signed-off-by: Salil Mehta <salil.mehta@huawei.com> Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2024-07-26accel/kvm: Introduce kvm_create_and_park_vcpu() helperHarsh Prateek Bora1-0/+12
There are distinct helpers for creating and parking a KVM vCPU. However, there can be cases where a platform needs to create and immediately park the vCPU during early stages of vcpu init which can later be reused when vcpu thread gets initialized. This would help detect failures with kvm_create_vcpu at an early stage. Suggested-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Harsh Prateek Bora <harshpb@linux.ibm.com> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
2024-07-24Merge tag 'for-upstream' of https://gitlab.com/bonzini/qemu into stagingRichard Henderson1-0/+27
* target/i386/kvm: support for reading RAPL MSRs using a helper program * hpet: emulation improvements # -----BEGIN PGP SIGNATURE----- # # iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmaelL4UHHBib256aW5p # QHJlZGhhdC5jb20ACgkQv/vSX3jHroMXoQf+K77lNlHLETSgeeP3dr7yZPOmXjjN # qFY/18jiyLw7MK1rZC09fF+n9SoaTH8JDKupt0z9M1R10HKHLIO04f8zDE+dOxaE # Rou3yKnlTgFPGSoPPFr1n1JJfxtYlLZRoUzaAcHUaa4W7JR/OHJX90n1Rb9MXeDk # jV6P0v1FWtIDdM6ERm9qBGoQdYhj6Ra2T4/NZKJFXwIhKEkxgu4yO7WXv8l0dxQz # jE4fKotqAvrkYW1EsiVZm30lw/19duhvGiYeQXoYhk8KKXXjAbJMblLITSNWsCio # 3l6Uud/lOxekkJDAq5nH3H9hCBm0WwvwL+0vRf3Mkr+/xRGvrhtmUdp8NQ== # =00mB # -----END PGP SIGNATURE----- # gpg: Signature made Tue 23 Jul 2024 03:19:58 AM AEST # gpg: using RSA key F13338574B662389866C7682BFFBD25F78C7AE83 # gpg: issuer "pbonzini@redhat.com" # gpg: Good signature from "Paolo Bonzini <bonzini@gnu.org>" [full] # gpg: aka "Paolo Bonzini <pbonzini@redhat.com>" [full] * tag 'for-upstream' of https://gitlab.com/bonzini/qemu: hpet: avoid timer storms on periodic timers hpet: store full 64-bit target value of the counter hpet: accept 64-bit reads and writes hpet: place read-only bits directly in "new_val" hpet: remove unnecessary variable "index" hpet: ignore high bits of comparator in 32-bit mode hpet: fix and cleanup persistence of interrupt status Add support for RAPL MSRs in KVM/Qemu tools: build qemu-vmsr-helper qio: add support for SO_PEERCRED for socket channel target/i386: do not crash if microvm guest uses SGX CPUID leaves Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2024-07-22accel/kvm: Extract common KVM vCPU {creation,parking} codeSalil Mehta3-34/+67
KVM vCPU creation is done once during the vCPU realization when Qemu vCPU thread is spawned. This is common to all the architectures as of now. Hot-unplug of vCPU results in destruction of the vCPU object in QOM but the corresponding KVM vCPU object in the Host KVM is not destroyed as KVM doesn't support vCPU removal. Therefore, its representative KVM vCPU object/context in Qemu is parked. Refactor architecture common logic so that some APIs could be reused by vCPU Hotplug code of some architectures likes ARM, Loongson etc. Update new/old APIs with trace events. New APIs qemu_{create,park,unpark}_vcpu() can be externally called. No functional change is intended here. Signed-off-by: Salil Mehta <salil.mehta@huawei.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Tested-by: Vishnu Pajjuri <vishnu@os.amperecomputing.com> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Tested-by: Xianglai Li <lixianglai@loongson.cn> Tested-by: Miguel Luis <miguel.luis@oracle.com> Reviewed-by: Shaoqin Huang <shahuang@redhat.com> Reviewed-by: Vishnu Pajjuri <vishnu@os.amperecomputing.com> Reviewed-by: Nicholas Piggin <npiggin@gmail.com> Tested-by: Zhao Liu <zhao1.liu@intel.com> Reviewed-by: Zhao Liu <zhao1.liu@intel.com> Reviewed-by: Harsh Prateek Bora <harshpb@linux.ibm.com> Reviewed-by: Igor Mammedov <imammedo@redhat.com> Message-Id: <20240716111502.202344-2-salil.mehta@huawei.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-07-22Add support for RAPL MSRs in KVM/QemuAnthony Harivel1-0/+27
Starting with the "Sandy Bridge" generation, Intel CPUs provide a RAPL interface (Running Average Power Limit) for advertising the accumulated energy consumption of various power domains (e.g. CPU packages, DRAM, etc.). The consumption is reported via MSRs (model specific registers) like MSR_PKG_ENERGY_STATUS for the CPU package power domain. These MSRs are 64 bits registers that represent the accumulated energy consumption in micro Joules. They are updated by microcode every ~1ms. For now, KVM always returns 0 when the guest requests the value of these MSRs. Use the KVM MSR filtering mechanism to allow QEMU handle these MSRs dynamically in userspace. To limit the amount of system calls for every MSR call, create a new thread in QEMU that updates the "virtual" MSR values asynchronously. Each vCPU has its own vMSR to reflect the independence of vCPUs. The thread updates the vMSR values with the ratio of energy consumed of the whole physical CPU package the vCPU thread runs on and the thread's utime and stime values. All other non-vCPU threads are also taken into account. Their energy consumption is evenly distributed among all vCPUs threads running on the same physical CPU package. To overcome the problem that reading the RAPL MSR requires priviliged access, a socket communication between QEMU and the qemu-vmsr-helper is mandatory. You can specified the socket path in the parameter. This feature is activated with -accel kvm,rapl=true,path=/path/sock.sock Actual limitation: - Works only on Intel host CPU because AMD CPUs are using different MSR adresses. - Only the Package Power-Plane (MSR_PKG_ENERGY_STATUS) is reported at the moment. Signed-off-by: Anthony Harivel <aharivel@redhat.com> Link: https://lore.kernel.org/r/20240522153453.1230389-4-aharivel@redhat.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-07-17accel/kvm/kvm-all: Fix superfluous trailing semicolonZhao Liu1-1/+1
Signed-off-by: Zhao Liu <zhao1.liu@intel.com> Reviewed-by: Peter Maydell <peter.maydell@linaro.org> Reviewed-by: Alex Bennée <alex.bennee@linaro.org> Reviewed-by: Michael Tokarev <mjt@tls.msk.ru> Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
2024-06-24gdbstub: move enums into separate headerAlex Bennée1-1/+1
This is an experiment to further reduce the amount we throw into the exec headers. It might not be as useful as I initially thought because just under half of the users also need gdbserver_start(). Reviewed-by: Pierrick Bouvier <pierrick.bouvier@linaro.org> Signed-off-by: Alex Bennée <alex.bennee@linaro.org> Reviewed-by: Richard Henderson <richard.henderson@linaro.org> Message-Id: <20240620152220.2192768-3-alex.bennee@linaro.org>
2024-06-17migration/dirtyrate: Fix segmentation faultMasato Imai1-1/+1
Since the kvm_dirty_ring_enabled function accesses a null kvm_state pointer when the KVM acceleration parameter is not specified, running calc_dirty_rate with the -r or -b option causes a segmentation fault. Signed-off-by: Masato Imai <mii@sfc.wide.ad.jp> Message-ID: <20240507025010.1968881-1-mii@sfc.wide.ad.jp> [Assert kvm_state when kvm_dirty_ring_enabled was called to fix it. - Hyman] Signed-off-by: Hyman Huang <yong.huang@smartx.com>
2024-06-04cpu: move Qemu[Thread|Cond] setup into common codeAlex Bennée1-3/+0
Aside from the round robin threads this is all common code. By moving the halt_cond setup we also no longer need hacks to work around the race between QOM object creation and thread creation. It is a little ugly to free stuff up for the round robin thread but better it deal with its own specialises than making the other accelerators jump through hoops. Signed-off-by: Alex Bennée <alex.bennee@linaro.org> Reviewed-by: Pierrick Bouvier <pierrick.bouvier@linaro.org> Message-ID: <20240530194250.1801701-3-alex.bennee@linaro.org> Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>