aboutsummaryrefslogtreecommitdiff
path: root/include/hw/ppc
AgeCommit message (Collapse)AuthorFilesLines
2020-09-08spapr, spapr_nvdimm: fold NVDIMM validation in the same placeDaniel Henrique Barboza1-2/+2
NVDIMM has different contraints and conditions than the regular DIMM and we'll need to add at least one more. Instead of relying on 'if (nvdimm)' conditionals in the body of spapr_memory_pre_plug(), use the existing spapr_nvdimm_validate_opts() and put all NVDIMM handling code there. Rename it to spapr_nvdimm_validate() to reflect that the function is now checking more than the nvdimm device options. This makes spapr_memory_pre_plug() a bit easier to follow, and we can tune in NVDIMM parameters and validation in the same place. Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> Message-Id: <20200825215749.213536-3-danielhb413@gmail.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2020-09-08spapr/xive: Add a 'hv-prio' property to represent the KVM escalation priorityCédric Le Goater1-0/+2
On POWER9, the KVM XIVE device uses priority 7 for the escalation interrupts. On POWER10, the host can use a reduced set of priorities and KVM will configure the escalation priority to a lower number. In any case, the guest is allowed to use priorities in a single range : [ 0 .. (maxprio - 1) ]. Introduce a 'hv-prio' property to represent the escalation priority number and use it to compute the "ibm,plat-res-int-priorities" property defining the priority ranges reserved by the hypervisor. Signed-off-by: Cédric Le Goater <clg@kaod.org> Message-Id: <20200819130843.2230799-2-clg@kaod.org> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2020-09-08spapr: Remove unnecessary DRC type-checker macrosDavid Gibson1-42/+1
spapr_drc.h includes typechecker macro boilerplate for the many different DRC subclasses. However, most of these types don't actually have different data in their class and/or instance, making these unneeded, unused, and in fact a bad idea. Remove them. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Greg Kurz <groug@kaod.org>
2020-08-27spapr: Move typedef SpaprMachineState to spapr.hEduardo Habkost3-19/+21
Move the typedef from spapr_irq.h to spapr.h, and use "struct SpaprMachineState" in the spapr_*.h headers (to avoid circular header dependencies). This will make future conversion to OBJECT_DECLARE* easier. Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> Signed-off-by: Eduardo Habkost <ehabkost@redhat.com> Tested-By: Roman Bolshakov <r.bolshakov@yadro.com> Message-Id: <20200825192110.3528606-28-ehabkost@redhat.com> Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
2020-08-13spapr/xive: Simplify error handling of kvmppc_xive_cpu_synchronize_state()Greg Kurz1-1/+1
Now that kvmppc_xive_cpu_get_state() returns negative on error, use that and get rid of the temporary Error object and error_propagate(). Signed-off-by: Greg Kurz <groug@kaod.org> Message-Id: <159707852916.1489912.8376334685349668124.stgit@bahia.lan> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2020-08-13spapr/xive: Rework error handling of kvmppc_xive_set_source_config()Greg Kurz1-2/+2
Since kvm_device_access() returns a negative errno on failure, convert kvmppc_xive_set_source_config() to use it for error checking. This allows to get rid of the local_err boilerplate. Propagate the return value so that callers may use it as well to check failures. Signed-off-by: Greg Kurz <groug@kaod.org> Message-Id: <159707848764.1489912.17078842252160674523.stgit@bahia.lan> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2020-08-13spapr/xive: Rework error handling of kvmppc_xive_[gs]et_queue_config()Greg Kurz1-2/+2
Since kvm_device_access() returns a negative errno on failure, convert kvmppc_xive_get_queue_config() and kvmppc_xive_set_queue_config() to use it for error checking. This allows to get rid of the local_err boilerplate. Propagate the return value so that callers may use it as well to check failures. Signed-off-by: Greg Kurz <groug@kaod.org> Message-Id: <159707847357.1489912.2032291280645236480.stgit@bahia.lan> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2020-08-13spapr/xive: Rework error handling of kvmppc_xive_cpu_[gs]et_state()Greg Kurz1-2/+2
kvm_set_one_reg() returns a negative errno on failure, use that instead of errno. Also propagate it to callers so they can use it to check for failures and hopefully get rid of their local_err boilerplate. Signed-off-by: Greg Kurz <groug@kaod.org> Message-Id: <159707846665.1489912.14267225652103441921.stgit@bahia.lan> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2020-08-13spapr/xive: Rework error handling of kvmppc_xive_cpu_connect()Greg Kurz1-1/+1
Use error_setg_errno() instead of error_setg(strerror()). While here, use -ret instead of errno since kvm_vcpu_enable_cap() returns a negative errno on failure. Use ERRP_GUARD() to ensure that errp can be passed to error_append_hint(), and get rid of the local_err boilerplate. Propagate the return value so that callers may use it as well to check failures. Signed-off-by: Greg Kurz <groug@kaod.org> Message-Id: <159707844549.1489912.4862921680328017645.stgit@bahia.lan> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2020-08-13ppc/xive: Introduce dedicated kvm_irqchip_in_kernel() wrappersGreg Kurz1-0/+1
Calls to the KVM XIVE device are guarded by kvm_irqchip_in_kernel(). This ensures that QEMU won't try to use the device if KVM is disabled or if an in-kernel irqchip isn't required. When using ic-mode=dual with the pseries machine, we have two possible interrupt controllers: XIVE and XICS. The kvm_irqchip_in_kernel() helper will return true as soon as any of the KVM device is created. It might lure QEMU to think that the other one is also around, while it is not. This is exactly what happens with ic-mode=dual at machine init when claiming IRQ numbers, which must be done on all possible IRQ backends, eg. RTAS event sources or the PHB0 LSI table : only the KVM XICS device is active but we end up calling kvmppc_xive_source_reset_one() anyway, which fails. This doesn't cause any trouble because of another bug : kvmppc_xive_source_reset_one() lacks an error_setg() and callers don't see the failure. Most of the other kvmppc_xive_* functions have similar xive->fd checks to filter out the case when KVM XIVE isn't active. It might look safer to have idempotent functions but it doesn't really help to understand what's going on when debugging. Since we already have all the kvm_irqchip_in_kernel() in place, also have the callers to check xive->fd as well before calling KVM XIVE specific code. This is straight-forward for the spapr specific XIVE code. Some more care is needed for the platform agnostic XIVE code since it cannot access xive->fd directly. Introduce new in_kernel() methods in some base XIVE classes for this purpose and implement them only in spapr. In all cases, we still need to call kvm_irqchip_in_kernel() so that compilers can optimize the kvmppc_xive_* calls away when CONFIG_KVM isn't defined, thus avoiding the need for stubs. Signed-off-by: Greg Kurz <groug@kaod.org> Message-Id: <159679993438.876294.7285654331498605426.stgit@bahia.lan> Reviewed-by: Cédric Le Goater <clg@kaod.org> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2020-08-13ppc/xive: Rework setup of XiveSource::esb_mmioGreg Kurz1-0/+6
Depending on whether XIVE is emultated or backed with a KVM XIVE device, the ESB MMIOs of a XIVE source point to an I/O memory region or a mapped memory region. This is currently handled by checking kvm_irqchip_in_kernel() returns false in xive_source_realize(). This is a bit awkward as we usually need to do extra things when we're using the in-kernel backend, not less. But most important, we can do better: turn the existing "xive.esb" memory region into a plain container, introduce an "xive.esb-emulated" I/O subregion and rename the existing "xive.esb" subregion in the KVM code to "xive.esb-kvm". Since "xive.esb-kvm" is added with overlap and a higher priority, it prevails over "xive.esb-emulated" (ie. a guest using KVM XIVE will interact with "xive.esb-kvm" instead of the default "xive.esb-emulated" region. While here, consolidate the computation of the MMIO region size in a common helper. Suggested-by: Cédric Le Goater <clg@kaod.org> Signed-off-by: Greg Kurz <groug@kaod.org> Message-Id: <159679992680.876294.7520540158586170894.stgit@bahia.lan> Reviewed-by: Cédric Le Goater <clg@kaod.org> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2020-07-20spapr: Add a new level of NUMA for GPUsReza Arbab1-0/+1
NUMA nodes corresponding to GPU memory currently have the same affinity/distance as normal memory nodes. Add a third NUMA associativity reference point enabling us to give GPU nodes more distance. This is guest visible information, which shouldn't change under a running guest across migration between different qemu versions, so make the change effective only in new (pseries > 5.0) machine types. Before, `numactl -H` output in a guest with 4 GPUs (nodes 2-5): node distances: node 0 1 2 3 4 5 0: 10 40 40 40 40 40 1: 40 10 40 40 40 40 2: 40 40 10 40 40 40 3: 40 40 40 10 40 40 4: 40 40 40 40 10 40 5: 40 40 40 40 40 10 After: node distances: node 0 1 2 3 4 5 0: 10 40 80 80 80 80 1: 40 10 80 80 80 80 2: 80 80 10 80 80 80 3: 80 80 80 10 80 80 4: 80 80 80 80 10 80 5: 80 80 80 80 80 10 These are the same distances as on the host, mirroring the change made to host firmware in skiboot commit f845a648b8cb ("numa/associativity: Add a new level of NUMA for GPU's"). Signed-off-by: Reza Arbab <arbab@linux.ibm.com> Message-Id: <20200716225655.24289-1-arbab@linux.ibm.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2020-06-26spapr: Fix typos in comments and macro indentationGustavo Romero1-1/+1
This commit fixes typos in spapr_vio_reg_to_irq() comments and a macro indentation. Signed-off-by: Gustavo Romero <gromero@linux.ibm.com> Message-Id: <1590710681-12873-1-git-send-email-gromero@linux.ibm.com> Acked-by: Cédric Le Goater <clg@kaod.org> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2020-06-15pnv/psi: Correct the pnv-psi* devices not to be sysbus devicesMarkus Armbruster1-1/+1
pnv_chip_power8_instance_init() creates a "pnv-psi-POWER8" sysbus device in a way that leaves it unplugged. pnv_chip_power9_instance_init() and pnv_chip_power10_instance_init() do the same for "pnv-psi-POWER9" and "pnv-psi-POWER10", respectively. These devices aren't actually sysbus devices. Correct that. Cc: "Cédric Le Goater" <clg@kaod.org> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: qemu-ppc@nongnu.org Signed-off-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Cédric Le Goater <clg@kaod.org> Message-Id: <20200609122339.937862-18-armbru@redhat.com>
2020-05-27ppc/spapr: Add hotremovable flag on DIMM LMBs on drmem_v2Leonardo Bras1-0/+1
On reboot, all memory that was previously added using object_add and device_add is placed in this DIMM area. The new SPAPR_LMB_FLAGS_HOTREMOVABLE flag helps Linux to put this memory in the correct memory zone, so no unmovable allocations are made there, allowing the object to be easily hot-removed by device_del and object_del. This new flag was accepted in Power Architecture documentation. Signed-off-by: Leonardo Bras <leobras.c@gmail.com> Reviewed-by: Bharata B Rao <bharata@linux.ibm.com> Message-Id: <20200511200201.58537-1-leobras.c@gmail.com> [dwg: Fixed syntax error spotted by Cédric Le Goater] Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2020-05-15Drop more @errp parameters after previous commitMarkus Armbruster1-1/+1
Several functions can't fail anymore: ich9_pm_add_properties(), device_add_bootindex_property(), ppc_compat_add_property(), spapr_caps_add_properties(), PropertyInfo.create(). Drop their @errp parameter. Signed-off-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20200505152926.18877-16-armbru@redhat.com>
2020-05-07spapr: Drop CAS reboot flagGreg Kurz1-1/+0
The CAS reboot flag is false by default and all the locations that could set it to true have been dropped. This means that all code blocks depending on the flag being set is dead code and the other code blocks should be executed always. Just do that and drop the now uneeded CAS reboot flag. Fix a comment on the way to make checkpatch happy. Signed-off-by: Greg Kurz <groug@kaod.org> Message-Id: <158514994893.478799.11772512888322840990.stgit@bahia.lan> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2020-05-07spapr/cas: Separate CAS handling from rebuilding the FDTAlexey Kardashevskiy1-0/+7
At the moment "ibm,client-architecture-support" ("CAS") is implemented in SLOF and QEMU assists via the custom H_CAS hypercall which copies an updated flatten device tree (FDT) blob to the SLOF memory which it then uses to update its internal tree. When we enable the OpenFirmware client interface in QEMU, we won't need to copy the FDT to the guest as the client is expected to fetch the device tree using the client interface. This moves FDT rebuild out to a separate helper which is going to be called from the "ibm,client-architecture-support" handler and leaves writing FDT to the guest in the H_CAS handler. This should not cause any behavioral change. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Message-Id: <20200310050733.29805-3-aik@ozlabs.ru> Signed-off-by: Greg Kurz <groug@kaod.org> Message-Id: <158514994229.478799.2178881312094922324.stgit@bahia.lan> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2020-04-07ppc/pnv: Create BMC devices only when defaults are enabledCédric Le Goater1-0/+2
Commit e2392d4395dd ("ppc/pnv: Create BMC devices at machine init") introduced default BMC devices which can be a problem when the same devices are defined on the command line with : -device ipmi-bmc-sim,id=bmc0 -device isa-ipmi-bt,bmc=bmc0,irq=10 QEMU fails with : qemu-system-ppc64: error creating device tree: node: FDT_ERR_EXISTS Use defaults_enabled() when creating the default BMC devices to let the user provide its own BMC devices using '-nodefaults'. If no BMC device are provided, output a warning but let QEMU run as this is a supported configuration. However, when multiple BMC devices are defined, stop QEMU with a clear error as the results are unexpected. Fixes: e2392d4395dd ("ppc/pnv: Create BMC devices at machine init") Reported-by: Nathan Chancellor <natechancellor@gmail.com> Signed-off-by: Cédric Le Goater <clg@kaod.org> Message-Id: <20200404153655.166834-1-clg@kaod.org> Tested-by: Nathan Chancellor <natechancellor@gmail.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2020-03-17ppc/spapr: Add FWNMI System Reset stateNicholas Piggin1-1/+2
The FWNMI option must deliver system reset interrupts to their registered address, and there are a few constraints on the handler addresses specified in PAPR. Add the system reset address state and checks. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Message-Id: <20200316142613.121089-4-npiggin@gmail.com> Reviewed-by: Greg Kurz <groug@kaod.org> Reviwed-by: Mahesh Salgaonkar <mahesh@linux.ibm.com> Reviewed-by: Cédric Le Goater <clg@kaod.org> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2020-03-17ppc/spapr: Change FWNMI namesNicholas Piggin1-10/+17
The option is called "FWNMI", and it involves more than just machine checks, also machine checks can be delivered without the FWNMI option, so re-name various things to reflect that. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Message-Id: <20200316142613.121089-3-npiggin@gmail.com> Reviewed-by: Greg Kurz <groug@kaod.org> Reviewed-by: Cédric Le Goater <clg@kaod.org> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2020-03-17spapr: Rename DT functions to newer naming conventionDavid Gibson1-2/+2
In the spapr code we've been gradually moving towards a convention that functions which create pieces of the device tree are called spapr_dt_*(). This patch speeds that along by renaming most of the things that don't yet match that so that they do. For now we leave the *_dt_populate() functions which are actual methods used in the DRCClass::dt_populate method. While we're there we remove a few comments that don't really say anything useful. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Cédric Le Goater <clg@kaod.org>
2020-03-17spapr/rtas: Reserve space for RTAS blob and logAlexey Kardashevskiy1-0/+1
At the moment SLOF reserves space for RTAS and instantiates the RTAS blob which is 20 bytes binary blob calling an hypercall. The rest of the RTAS area is a log which SLOF has no idea about but QEMU does. This moves RTAS sizing to QEMU and this overrides the size from SLOF. The only remaining problem is that SLOF copies the number of bytes it reserved (2KB for now) so QEMU needs to reserve at least this much; SLOF will be fixed separately to check that rtas-size from QEMU is enough for those 20 bytes for the H_RTAS hcall. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Message-Id: <20200316011841.99970-1-aik@ozlabs.ru> Reviewed-by: Greg Kurz <groug@kaod.org> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2020-03-17ppc/spapr: Move GPRs setup to one placeAlexey Kardashevskiy1-1/+3
At the moment "pseries" starts in SLOF which only expects the FDT blob pointer in r3. As we are going to introduce a OpenFirmware support in QEMU, we will be booting OF clients directly and these expect a stack pointer in r1, Linux looks at r3/r4 for the initramdisk location (although vmlinux can find this from the device tree but zImage from distro kernels cannot). This extends spapr_cpu_set_entry_state() to take more registers. This should cause no behavioral change. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Message-Id: <20200310050733.29805-2-aik@ozlabs.ru> Reviewed-by: Greg Kurz <groug@kaod.org> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2020-03-17spapr: Don't clamp RMA to 16GiB on new machine typesDavid Gibson1-0/+1
In spapr_machine_init() we clamp the size of the RMA to 16GiB and the comment saying why doesn't make a whole lot of sense. In fact, this was done because the real mode handling code elsewhere limited the RMA in TCG mode to the maximum value configurable in LPCR[RMLS], 16GiB. But, * Actually LPCR[RMLS] has been able to encode a 256GiB size for a very long time, we just didn't implement it properly in the softmmu * LPCR[RMLS] shouldn't really be relevant anyway, it only was because we used to abuse the RMOR based translation mode in order to handle the fact that we're not modelling the hypervisor parts of the cpu We've now removed those limitations in the modelling so the 16GiB clamp no longer serves a function. However, we can't just remove the limit universally: that would break migration to earlier qemu versions, where the 16GiB RMLS limit still applies, no matter how bad the reasons for it are. So, we replace the 16GiB clamp, with a clamp to a limit defined in the machine type class. We set it to 16 GiB for machine types 4.2 and earlier, but set it to 0 meaning unlimited for the new 5.0 machine type. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Greg Kurz <groug@kaod.org> Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
2020-03-17spapr: Don't attempt to clamp RMA to VRMA constraintDavid Gibson1-2/+1
The Real Mode Area (RMA) is the part of memory which a guest can access when in real (MMU off) mode. Of course, for a guest under KVM, the MMU isn't really turned off, it's just in a special translation mode - Virtual Real Mode Area (VRMA) - which looks like real mode in guest mode. The mechanics of how this works when using the hash MMU (HPT) put a constraint on the size of the RMA, which depends on the size of the HPT. So, the latter part of spapr_setup_hpt_and_vrma() clamps the RMA we advertise to the guest based on this VRMA limit. There are several things wrong with this: 1) spapr_setup_hpt_and_vrma() doesn't actually clamp, it takes the minimum of Node 0 memory size and the VRMA limit. That will *often* work the same as clamping, but there can be other constraints on RMA size which supersede Node 0 memory size. We have real bugs caused by this (currently worked around in the guest kernel) 2) Some callers of spapr_setup_hpt_and_vrma() are in a situation where we're past the point that we can actually advertise an RMA limit to the guest 3) But most fundamentally, the VRMA limit depends on host configuration (page size) which shouldn't be visible to the guest, but this partially exposes it. This can cause problems with migration in certain edge cases, although we will mostly get away with it. In practice, this clamping is almost never applied anyway. With 64kiB pages and the normal rules for sizing of the HPT, the theoretical VRMA limit will be 4x(guest memory size) and so never hit. It will hit with 4kiB pages, where it will be (guest memory size)/4. However all mainstream distro kernels for POWER have used a 64kiB page size for at least 10 years. So, simply replace this logic with a check that the RMA we've calculated based only on guest visible configuration will fit within the host implied VRMA limit. This can break if running HPT guests on a host kernel with 4kiB page size. As noted that's very rare. There also exist several possible workarounds: * Change the host kernel to use 64kiB pages * Use radix MMU (RPT) guests instead of HPT * Use 64kiB hugepages on the host to back guest memory * Increase the guest memory size so that the RMA hits one of the fixed limits before the RMA limit. This is relatively easy on POWER8 which has a 16GiB limit, harder on POWER9 which has a 1TiB limit. * Use a guest NUMA configuration which artificially constrains the RMA within the VRMA limit (the RMA must always fit within Node 0). Previously, on KVM, we also temporarily reduced the rma_size to 256M so that the we'd load the kernel and initrd safely, regardless of the VRMA limit. This was a) confusing, b) could significantly limit the size of images we could load and c) introduced a behavioural difference between KVM and TCG. So we remove that as well. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: Greg Kurz <groug@kaod.org>
2020-03-17spapr: Handle pending hot plug/unplug requests at CASGreg Kurz1-0/+1
If a hot plug or unplug request is pending at CAS, we currently trigger a CAS reboot, which severely increases the guest boot time. This is because SLOF doesn't handle hot plug events and we had no way to fix the FDT that gets presented to the guest. We can do better thanks to recent changes in QEMU and SLOF: - we now return a full FDT to SLOF during CAS - SLOF was fixed to correctly detect any device that was either added or removed since boot time and to update its internal DT accordingly. The right solution is to process all pending hot plug/unplug requests during CAS: convert hot plugged devices to cold plugged devices and remove the hot unplugged ones, which is exactly what spapr_drc_reset() does. Also clear all hot plug events that are currently queued since they're no longer relevant. Note that SLOF cannot currently populate hot plugged PCI bridges or PHBs at CAS. Until this limitation is lifted, SLOF will reset the machine when this scenario occurs : this will allow the FDT to be fully processed when SLOF is started again (ie. the same effect as the CAS reboot that would occur anyway without this patch). Signed-off-by: Greg Kurz <groug@kaod.org> Message-Id: <158257222352.4102917.8984214333937947307.stgit@bahia.lan> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2020-02-25Merge tag 'patchew/20200219160953.13771-1-imammedo@redhat.com' of ↵Paolo Bonzini1-5/+4
https://github.com/patchew-project/qemu into HEAD This series removes ad hoc RAM allocation API (memory_region_allocate_system_memory) and consolidates it around hostmem backend. It allows to * resolve conflicts between global -mem-prealloc and hostmem's "policy" option, fixing premature allocation before binding policy is applied * simplify complicated memory allocation routines which had to deal with 2 ways to allocate RAM. * reuse hostmem backends of a choice for main RAM without adding extra CLI options to duplicate hostmem features. A recent case was -mem-shared, to enable vhost-user on targets that don't support hostmem backends [1] (ex: s390) * move RAM allocation from individual boards into generic machine code and provide them with prepared MemoryRegion. * clean up deprecated NUMA features which were tied to the old API (see patches) - "numa: remove deprecated -mem-path fallback to anonymous RAM" - (POSTPONED, waiting on libvirt side) "forbid '-numa node,mem' for 5.0 and newer machine types" - (POSTPONED) "numa: remove deprecated implicit RAM distribution between nodes" Introduce a new machine.memory-backend property and wrapper code that aliases global -mem-path and -mem-alloc into automatically created hostmem backend properties (provided memory-backend was not set explicitly given by user). A bulk of trivial patches then follow to incrementally convert individual boards to using machine.memory-backend provided MemoryRegion. Board conversion typically involves: * providing MachineClass::default_ram_size and MachineClass::default_ram_id so generic code could create default backend if user didn't explicitly provide memory-backend or -m options * dropping memory_region_allocate_system_memory() call * using convenience MachineState::ram MemoryRegion, which points to MemoryRegion allocated by ram-memdev On top of that for some boards: * missing ram_size checks are added (typically it were boards with fixed ram size) * ram_size fixups are replaced by checks and hard errors, forcing user to provide correct "-m" values instead of ignoring it and continuing running. After all boards are converted, the old API is removed and memory allocation routines are cleaned up.
2020-02-21spapr: Don't use spapr_drc_needed() in CAS codeGreg Kurz1-1/+3
We currently don't support hotplug of devices between boot and CAS. If this happens a CAS reboot is triggered. We detect this during CAS using the spapr_drc_needed() function which is essentially a VMStateDescription .needed callback. Even if the condition for CAS reboot happens to be the same as for DRC migration, it looks wrong to piggyback a migration helper for this. Introduce a helper with slightly more explicit name and use it in both CAS and DRC migration code. Since a subsequent patch will enhance this helper to cover the case of hot unplug, let's go for spapr_drc_transient(). While here convert spapr_hotplugged_dev_before_cas() to the "transient" wording as well. This doesn't change any behaviour. Signed-off-by: Greg Kurz <groug@kaod.org> Message-Id: <158169248180.3465937.9531405453362718771.stgit@bahia.lan> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2020-02-21spapr: Allow changing offset for -kernel imageAlexey Kardashevskiy1-0/+1
This allows moving the kernel in the guest memory. The option is useful for step debugging (as Linux is linked at 0x0); it also allows loading grub which is normally linked to run at 0x20000. This uses the existing kernel address by default. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Message-Id: <20200203032943.121178-6-aik@ozlabs.ru> Reviewed-by: Fabiano Rosas <farosas@linux.ibm.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2020-02-21spapr: Add Hcalls to support PAPR NVDIMM deviceShivaprasad G Bhat1-1/+7
This patch implements few of the necessary hcalls for the nvdimm support. PAPR semantics is such that each NVDIMM device is comprising of multiple SCM(Storage Class Memory) blocks. The guest requests the hypervisor to bind each of the SCM blocks of the NVDIMM device using hcalls. There can be SCM block unbind requests in case of driver errors or unplug(not supported now) use cases. The NVDIMM label read/writes are done through hcalls. Since each virtual NVDIMM device is divided into multiple SCM blocks, the bind, unbind, and queries using hcalls on those blocks can come independently. This doesn't fit well into the qemu device semantics, where the map/unmap are done at the (whole)device/object level granularity. The patch doesnt actually bind/unbind on hcalls but let it happen at the device_add/del phase itself instead. The guest kernel makes bind/unbind requests for the virtual NVDIMM device at the region level granularity. Without interleaving, each virtual NVDIMM device is presented as a separate guest physical address range. So, there is no way a partial bind/unbind request can come for the vNVDIMM in a hcall for a subset of SCM blocks of a virtual NVDIMM. Hence it is safe to do bind/unbind everything during the device_add/del. Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com> Message-Id: <158131059899.2897.11515211602702956854.stgit@lep8c.aus.stglabs.ibm.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2020-02-21spapr: Add NVDIMM device supportShivaprasad G Bhat2-0/+46
Add support for NVDIMM devices for sPAPR. Piggyback on existing nvdimm device interface in QEMU to support virtual NVDIMM devices for Power. Create the required DT entries for the device (some entries have dummy values right now). The patch creates the required DT node and sends a hotplug interrupt to the guest. Guest is expected to undertake the normal DR resource add path in response and start issuing PAPR SCM hcalls. The device support is verified based on the machine version unlike x86. This is how it can be used .. Ex : For coldplug, the device to be added in qemu command line as shown below -object memory-backend-file,id=memnvdimm0,prealloc=yes,mem-path=/tmp/nvdimm0,share=yes,size=1073872896 -device nvdimm,label-size=128k,uuid=75a3cdd7-6a2f-4791-8d15-fe0a920e8e9e,memdev=memnvdimm0,id=nvdimm0,slot=0 For hotplug, the device to be added from monitor as below object_add memory-backend-file,id=memnvdimm0,prealloc=yes,mem-path=/tmp/nvdimm0,share=yes,size=1073872896 device_add nvdimm,label-size=128k,uuid=75a3cdd7-6a2f-4791-8d15-fe0a920e8e9e,memdev=memnvdimm0,id=nvdimm0,slot=0 Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com> Signed-off-by: Bharata B Rao <bharata@linux.ibm.com> [Early implementation] Message-Id: <158131058078.2897.12767731856697459923.stgit@lep8c.aus.stglabs.ibm.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2020-02-19ppc/{ppc440_bamboo, sam460ex}: use memdev for RAMIgor Mammedov1-1/+1
memory_region_allocate_system_memory() API is going away, so replace it with memdev allocated MemoryRegion. The later is initialized by generic code, so board only needs to opt in to memdev scheme by providing MachineClass::default_ram_id and using MachineState::ram instead of manually initializing RAM memory region. Signed-off-by: Igor Mammedov <imammedo@redhat.com> Reviewed-by: BALATON Zoltan <balaton@eik.bme.hu> Acked-by: David Gibson <david@gibson.dropbear.id.au> Message-Id: <20200219160953.13771-67-imammedo@redhat.com>
2020-02-19ppc/{ppc440_bamboo, sam460ex}: drop RAM size fixupIgor Mammedov1-5/+4
If user provided non-sense RAM size, board will complain and continue running with max RAM size supported or sometimes crash like this: %QEMU -M bamboo -m 1 exec.c:1926: find_ram_offset: Assertion `size != 0' failed. Aborted (core dumped) Also RAM is going to be allocated by generic code, so it won't be possible for board to fix things up for user. Make it error message and exit to force user fix CLI, instead of accepting non-sense CLI values. That also fixes crash issue, since wrongly calculated size isn't used to allocate RAM Signed-off-by: Igor Mammedov <imammedo@redhat.com> Reviewed-by: BALATON Zoltan <balaton@eik.bme.hu> Acked-by: David Gibson <david@gibson.dropbear.id.au> Message-Id: <20200219160953.13771-66-imammedo@redhat.com>
2020-02-03migration: Include migration support for machine check handlingAravinda Prasad1-0/+2
This patch includes migration support for machine check handling. Especially this patch blocks VM migration requests until the machine check error handling is complete as these errors are specific to the source hardware and is irrelevant on the target hardware. Signed-off-by: Aravinda Prasad <arawinda.p@gmail.com> [Do not set FWNMI cap in post_load, now its done in .apply hook] Signed-off-by: Ganesh Goudar <ganeshgr@linux.ibm.com> Message-Id: <20200130184423.20519-7-ganeshgr@linux.ibm.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2020-02-03ppc: spapr: Handle "ibm,nmi-register" and "ibm,nmi-interlock" RTAS callsAravinda Prasad1-1/+3
This patch adds support in QEMU to handle "ibm,nmi-register" and "ibm,nmi-interlock" RTAS calls. The machine check notification address is saved when the OS issues "ibm,nmi-register" RTAS call. This patch also handles the case when multiple processors experience machine check at or about the same time by handling "ibm,nmi-interlock" call. In such cases, as per PAPR, subsequent processors serialize waiting for the first processor to issue the "ibm,nmi-interlock" call. The second processor that also received a machine check error waits till the first processor is done reading the error log. The first processor issues "ibm,nmi-interlock" call when the error log is consumed. Signed-off-by: Aravinda Prasad <arawinda.p@gmail.com> [Register fwnmi RTAS calls in core_rtas_register_types() where other RTAS calls are registered] Signed-off-by: Ganesh Goudar <ganeshgr@linux.ibm.com> Message-Id: <20200130184423.20519-6-ganeshgr@linux.ibm.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2020-02-03target/ppc: Build rtas error log upon an MCEAravinda Prasad1-1/+5
Upon a machine check exception (MCE) in a guest address space, KVM causes a guest exit to enable QEMU to build and pass the error to the guest in the PAPR defined rtas error log format. This patch builds the rtas error log, copies it to the rtas_addr and then invokes the guest registered machine check handler. The handler in the guest takes suitable action(s) depending on the type and criticality of the error. For example, if an error is unrecoverable memory corruption in an application inside the guest, then the guest kernel sends a SIGBUS to the application. For recoverable errors, the guest performs recovery actions and logs the error. Signed-off-by: Aravinda Prasad <arawinda.p@gmail.com> [Assume SLOF has allocated enough room for rtas error log] Signed-off-by: Ganesh Goudar <ganeshgr@linux.ibm.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Message-Id: <20200130184423.20519-5-ganeshgr@linux.ibm.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2020-02-03target/ppc: Handle NMI guest exitAravinda Prasad1-0/+10
Memory error such as bit flips that cannot be corrected by hardware are passed on to the kernel for handling. If the memory address in error belongs to guest then the guest kernel is responsible for taking suitable action. Patch [1] enhances KVM to exit guest with exit reason set to KVM_EXIT_NMI in such cases. This patch handles KVM_EXIT_NMI exit. [1] https://www.spinics.net/lists/kvm-ppc/msg12637.html (e20bbd3d and related commits) Signed-off-by: Aravinda Prasad <arawinda.p@gmail.com> Signed-off-by: Ganesh Goudar <ganeshgr@linux.ibm.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Greg Kurz <groug@kaod.org> Message-Id: <20200130184423.20519-4-ganeshgr@linux.ibm.com> [dwg: #ifdefs to fix compile for 32-bit target] Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2020-02-03ppc: spapr: Introduce FWNMI capabilityAravinda Prasad1-1/+4
Introduce fwnmi an spapr capability and add a helper function which tries to enable it, which would be used by following patch of the series. This patch by itself does not change the existing behavior. Signed-off-by: Aravinda Prasad <arawinda.p@gmail.com> [eliminate cap_ppc_fwnmi, add fwnmi cap to migration state and reprhase the commit message] Signed-off-by: Ganesh Goudar <ganeshgr@linux.ibm.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Message-Id: <20200130184423.20519-3-ganeshgr@linux.ibm.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2020-02-02ppc/pnv: Add models for POWER8 PHB3 PCIe Host bridgeCédric Le Goater3-0/+18
This is a model of the PCIe Host Bridge (PHB3) found on a POWER8 processor. It includes the PowerBus logic interface (PBCQ), IOMMU support, a single PCIe Gen.3 Root Complex, and support for MSI and LSI interrupt sources as found on a POWER8 system using the XICS interrupt controller. The POWER8 processor comes in different flavors: Venice, Murano, Naple, each having a different number of PHBs. To make things simpler, the models provides 3 PHB3 per chip. Some platforms, like the Firestone, can also couple PHBs on the first chip to provide more bandwidth but this is too specific to model in QEMU. XICS requires some adjustment to support the PHB3 MSI. The changes are provided here but they could be decoupled in prereq patches. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Cédric Le Goater <clg@kaod.org> Message-Id: <20200127144506.11132-3-clg@kaod.org> [dwg: Use device_class_set_props()] Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2020-02-02ppc/pnv: Add models for POWER9 PHB4 PCIe Host bridgeBenjamin Herrenschmidt2-0/+18
These changes introduces models for the PCIe Host Bridge (PHB4) of the POWER9 processor. It includes the PowerBus logic interface (PBCQ), IOMMU support, a single PCIe Gen.4 Root Complex, and support for MSI and LSI interrupt sources as found on a POWER9 system using the XIVE interrupt controller. POWER9 processor comes with 3 PHB4 PEC (PCI Express Controller) and each PEC can have several PHBs. By default, * PEC0 provides 1 PHB (PHB0) * PEC1 provides 2 PHBs (PHB1 and PHB2) * PEC2 provides 3 PHBs (PHB3, PHB4 and PHB5) Each PEC has a set "global" registers and some "per-stack" (per-PHB) registers. Those are organized in two XSCOM ranges, the "Nest" range and the "PCI" range, each range contains both some "PEC" registers and some "per-stack" registers. No default device layout is provided and PCI devices can be added on any of the available PCIe Root Port (pcie.0 .. 2 of a Power9 chip) with address 0x0 as the firwware (skiboot) only accepts a single device per root port. To run a simple system with a network and a storage adapters, use a command line options such as : -device e1000e,netdev=net0,mac=C0:FF:EE:00:00:02,bus=pcie.0,addr=0x0 -netdev bridge,id=net0,helper=/usr/libexec/qemu-bridge-helper,br=virbr0,id=hostnet0 -device megasas,id=scsi0,bus=pcie.1,addr=0x0 -drive file=$disk,if=none,id=drive-scsi0-0-0-0,format=qcow2,cache=none -device scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=2 If more are needed, include a bridge. Multi chip is supported, each chip adding its set of PHB4 controllers and its PCI busses. The model doesn't emulate the EEH error handling. This model is not ready for hotplug yet. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> [ clg: - numerous cleanups - commit log - fix for broken LSI support - PHB pic printinfo - large QOM rework ] Signed-off-by: Cédric Le Goater <clg@kaod.org> Message-Id: <20200127144506.11132-2-clg@kaod.org> [dwg: Use device_class_set_props()] Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2020-02-02spapr: Implement get_dt_compatible() callbackStefan Berger1-0/+1
For devices that cannot be statically initialized, implement a get_dt_compatible() callback that allows us to ask the device for the 'compatible' value. Signed-off-by: Stefan Berger <stefanb@linux.ibm.com> Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Message-Id: <20200121152935.649898-3-stefanb@linux.ibm.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2020-02-02ppc/pnv: Add support for "hostboot" modeCédric Le Goater2-0/+3
When the "hb-mode" option is activated on the powernv machine, the firmware is mapped at 0x8000000 and the HRMOR of the HW threads are set to the same address. The PNOR mapping on the FW address space of the LPC bus is left enabled to let the firmware load any other images required to boot the host. Signed-off-by: Cédric Le Goater <clg@kaod.org> Message-Id: <20200127144154.10170-4-clg@kaod.org> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2020-02-02hw/ppc/prep: Remove the deprecated "prep" machine and the OpenHackware BIOSThomas Huth1-1/+0
It's been deprecated since QEMU v3.1. The 40p machine should be used nowadays instead. Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com> Acked-by: Hervé Poussineau <hpoussin@reactos.org> Signed-off-by: Thomas Huth <thuth@redhat.com> Message-Id: <20200114114617.28854-1-thuth@redhat.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2020-01-08ppc/pnv: fix check on return value of blk_getlength()Cédric Le Goater1-1/+1
blk_getlength() returns an int64_t but the result is stored in a uint32_t. Errors (negative values) won't be caught by the check in pnv_pnor_realize() and blk_blockalign() will allocate a very large buffer in such cases. Fixes Coverity issue CID 1412226. Signed-off-by: Cédric Le Goater <clg@kaod.org> Message-Id: <20200107171809.15556-3-clg@kaod.org> Reviewed-by: Greg Kurz <groug@kaod.org> Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2020-01-08pnv/xive: Deduce the PnvXive pointer from XiveTCTX::xptrGreg Kurz1-2/+0
And use it instead of reaching out to the machine. This allows to get rid of pnv_get_chip(). Signed-off-by: Greg Kurz <groug@kaod.org> Signed-off-by: Cédric Le Goater <clg@kaod.org> Message-Id: <20200106145645.4539-11-clg@kaod.org> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2020-01-08xive: Add a "presenter" link property to the TCTX objectCédric Le Goater1-3/+5
This will be used in subsequent patches to access the XIVE associated to a TCTX without reaching out to the machine through qdev_get_machine(). Signed-off-by: Cédric Le Goater <clg@kaod.org> [ groug: - split patch - write subject and changelog ] Signed-off-by: Greg Kurz <groug@kaod.org> Signed-off-by: Cédric Le Goater <clg@kaod.org> Message-Id: <20200106145645.4539-9-clg@kaod.org> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2020-01-08ppc/pnv: Add a "pnor" const link property to the BMC internal simulatorGreg Kurz1-1/+1
This allows to get rid of a call to qdev_get_machine(). Signed-off-by: Greg Kurz <groug@kaod.org> Signed-off-by: Cédric Le Goater <clg@kaod.org> Message-Id: <20200106145645.4539-8-clg@kaod.org> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2020-01-08ppc/pnv: Add an "nr-threads" property to the base chip classGreg Kurz1-0/+1
Set it at chip creation and forward it to the cores. This allows to drop a call to qdev_get_machine(). Signed-off-by: Greg Kurz <groug@kaod.org> Signed-off-by: Cédric Le Goater <clg@kaod.org> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Message-Id: <20200106145645.4539-7-clg@kaod.org> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2020-01-08spapr, pnv, xive: Add a "xive-fabric" link to the XIVE routerGreg Kurz1-2/+3
In order to get rid of qdev_get_machine(), first add a pointer to the XIVE fabric under the XIVE router and make it configurable through a QOM link property. Configure it in the spapr and pnv machine. In the case of pnv, the XIVE routers are under the chip, so this is done with a QOM alias property of the POWER9 pnv chip. Signed-off-by: Greg Kurz <groug@kaod.org> Signed-off-by: Cédric Le Goater <clg@kaod.org> Message-Id: <20200106145645.4539-5-clg@kaod.org> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>