diff options
Diffstat (limited to 'docs/specs')
-rw-r--r-- | docs/specs/acpi_hest_ghes.rst | 6 | ||||
-rw-r--r-- | docs/specs/acpi_hw_reduced_hotplug.rst | 3 | ||||
-rw-r--r-- | docs/specs/aspeed-intc.rst | 136 | ||||
-rw-r--r-- | docs/specs/fw_cfg.rst | 4 | ||||
-rw-r--r-- | docs/specs/index.rst | 6 | ||||
-rw-r--r-- | docs/specs/pci-ids.rst | 10 | ||||
-rw-r--r-- | docs/specs/rapl-msr.rst | 154 | ||||
-rw-r--r-- | docs/specs/riscv-aia.rst | 83 | ||||
-rw-r--r-- | docs/specs/riscv-iommu.rst | 116 | ||||
-rw-r--r-- | docs/specs/rocker.rst (renamed from docs/specs/rocker.txt) | 181 | ||||
-rw-r--r-- | docs/specs/spdm.rst | 134 | ||||
-rw-r--r-- | docs/specs/tpm.rst | 8 |
12 files changed, 740 insertions, 101 deletions
diff --git a/docs/specs/acpi_hest_ghes.rst b/docs/specs/acpi_hest_ghes.rst index 68f1fbe..c3e9f8d 100644 --- a/docs/specs/acpi_hest_ghes.rst +++ b/docs/specs/acpi_hest_ghes.rst @@ -67,8 +67,10 @@ Design Details (3) The address registers table contains N Error Block Address entries and N Read Ack Register entries. The size for each entry is 8-byte. The Error Status Data Block table contains N Error Status Data Block - entries. The size for each entry is 4096(0x1000) bytes. The total size - for the "etc/hardware_errors" fw_cfg blob is (N * 8 * 2 + N * 4096) bytes. + entries. The size for each entry is defined at the source code as + ACPI_GHES_MAX_RAW_DATA_LENGTH (currently 1024 bytes). The total size + for the "etc/hardware_errors" fw_cfg blob is + (N * 8 * 2 + N * ACPI_GHES_MAX_RAW_DATA_LENGTH) bytes. N is the number of the kinds of hardware error sources. (4) QEMU generates the ACPI linker/loader script for the firmware. The diff --git a/docs/specs/acpi_hw_reduced_hotplug.rst b/docs/specs/acpi_hw_reduced_hotplug.rst index 0bd3f93..3acd6fc 100644 --- a/docs/specs/acpi_hw_reduced_hotplug.rst +++ b/docs/specs/acpi_hw_reduced_hotplug.rst @@ -64,7 +64,8 @@ GED IO interface (4 byte access) 0: Memory hotplug event 1: System power down event 2: NVDIMM hotplug event - 3-31: Reserved + 3: CPU hotplug event + 4-31: Reserved **write_access:** diff --git a/docs/specs/aspeed-intc.rst b/docs/specs/aspeed-intc.rst new file mode 100644 index 0000000..9cefd7f --- /dev/null +++ b/docs/specs/aspeed-intc.rst @@ -0,0 +1,136 @@ +=========================== +ASPEED Interrupt Controller +=========================== + +AST2700 +------- +There are a total of 480 interrupt sources in AST2700. Due to the limitation of +interrupt numbers of processors, the interrupts are merged every 32 sources for +interrupt numbers greater than 127. + +There are two levels of interrupt controllers, INTC (CPU Die) and INTCIO +(I/O Die). + +Interrupt Mapping +----------------- +- INTC: Handles interrupt sources 0 - 127 and integrates signals from INTCIO. +- INTCIO: Handles interrupt sources 128 - 319 independently. + +QEMU Support +------------ +Currently, only GIC 192 to 201 are supported, and their source interrupts are +from INTCIO and connected to INTC at input pin 0 and output pins 0 to 9 for +GIC 192-201. + +Design for GICINT 196 +--------------------- +The orgate has interrupt sources ranging from 0 to 31, with its output pin +connected to INTCIO "T0 GICINT_196". The output pin is then connected to INTC +"GIC_192_201" at bit 4, and its bit 4 output pin is connected to GIC 196. + +INTC GIC_192_201 Output Pin Mapping +----------------------------------- +The design of INTC GIC_192_201 have 10 output pins, mapped as following: + +==== ==== +Bit GIC +==== ==== +0 192 +1 193 +2 194 +3 195 +4 196 +5 197 +6 198 +7 199 +8 200 +9 201 +==== ==== + +AST2700 A0 +---------- +It has only one INTC controller, and currently, only GIC 128-136 is supported. +To support both AST2700 A1 and AST2700 A0, there are 10 OR gates in the INTC, +with gates 1 to 9 supporting GIC 128-136. + +Design for GICINT 132 +--------------------- +The orgate has interrupt sources ranging from 0 to 31, with its output pin +connected to INTC. The output pin is then connected to GIC 132. + +Block Diagram of GICINT 196 for AST2700 A1 and GICINT 132 for AST2700 A0 +------------------------------------------------------------------------ + +.. code-block:: + + |-------------------------------------------------------------------------------------------------------| + | AST2700 A1 Design | + | To GICINT196 | + | | + | ETH1 |-----------| |--------------------------| |--------------| | + | -------->|0 | | INTCIO | | orgates[0] | | + | ETH2 | 4| orgates[0]------>|inpin[0]-------->outpin[0]|------->| 0 | | + | -------->|1 5| orgates[1]------>|inpin[1]-------->outpin[1]|------->| 1 | | + | ETH3 | 6| orgates[2]------>|inpin[2]-------->outpin[2]|------->| 2 | | + | -------->|2 19| orgates[3]------>|inpin[3]-------->outpin[3]|------->| 3 OR[0:9] |-----| | + | UART0 | 20|-->orgates[4]------>|inpin[4]-------->outpin[4]|------->| 4 | | | + | -------->|7 21| orgates[5]------>|inpin[5]-------->outpin[5]|------->| 5 | | | + | UART1 | 22| orgates[6]------>|inpin[6]-------->outpin[6]|------->| 6 | | | + | -------->|8 23| orgates[7]------>|inpin[7]-------->outpin[7]|------->| 7 | | | + | UART2 | 24| orgates[8]------>|inpin[8]-------->outpin[8]|------->| 8 | | | + | -------->|9 25| orgates[9]------>|inpin[9]-------->outpin[9]|------->| 9 | | | + | UART3 | 26| |--------------------------| |--------------| | | + | ---------|10 27| | | + | UART5 | 28| | | + | -------->|11 29| | | + | UART6 | | | | + | -------->|12 30| |-----------------------------------------------------------------------| | + | UART7 | 31| | | + | -------->|13 | | | + | UART8 | OR[0:31] | | |------------------------------| |----------| | + | -------->|14 | | | INTC | | GIC | | + | UART9 | | | |inpin[0:0]--------->outpin[0] |---------->|192 | | + | -------->|15 | | |inpin[0:1]--------->outpin[1] |---------->|193 | | + | UART10 | | | |inpin[0:2]--------->outpin[2] |---------->|194 | | + | -------->|16 | | |inpin[0:3]--------->outpin[3] |---------->|195 | | + | UART11 | | |--------------> |inpin[0:4]--------->outpin[4] |---------->|196 | | + | -------->|17 | |inpin[0:5]--------->outpin[5] |---------->|197 | | + | UART12 | | |inpin[0:6]--------->outpin[6] |---------->|198 | | + | -------->|18 | |inpin[0:7]--------->outpin[7] |---------->|199 | | + | |-----------| |inpin[0:8]--------->outpin[8] |---------->|200 | | + | |inpin[0:9]--------->outpin[9] |---------->|201 | | + |-------------------------------------------------------------------------------------------------------| + |-------------------------------------------------------------------------------------------------------| + | ETH1 |-----------| orgates[1]------->|inpin[1]----------->outpin[10]|---------->|128 | | + | -------->|0 | orgates[2]------->|inpin[2]----------->outpin[11]|---------->|129 | | + | ETH2 | 4| orgates[3]------->|inpin[3]----------->outpin[12]|---------->|130 | | + | -------->|1 5| orgates[4]------->|inpin[4]----------->outpin[13]|---------->|131 | | + | ETH3 | 6|---->orgates[5]------->|inpin[5]----------->outpin[14]|---------->|132 | | + | -------->|2 19| orgates[6]------->|inpin[6]----------->outpin[15]|---------->|133 | | + | UART0 | 20| orgates[7]------->|inpin[7]----------->outpin[16]|---------->|134 | | + | -------->|7 21| orgates[8]------->|inpin[8]----------->outpin[17]|---------->|135 | | + | UART1 | 22| orgates[9]------->|inpin[9]----------->outpin[18]|---------->|136 | | + | -------->|8 23| |------------------------------| |----------| | + | UART2 | 24| | + | -------->|9 25| AST2700 A0 Design | + | UART3 | 26| | + | -------->|10 27| | + | UART5 | 28| | + | -------->|11 29| GICINT132 | + | UART6 | | | + | -------->|12 30| | + | UART7 | 31| | + | -------->|13 | | + | UART8 | OR[0:31] | | + | -------->|14 | | + | UART9 | | | + | -------->|15 | | + | UART10 | | | + | -------->|16 | | + | UART11 | | | + | -------->|17 | | + | UART12 | | | + | -------->|18 | | + | |-----------| | + | | + |-------------------------------------------------------------------------------------------------------| diff --git a/docs/specs/fw_cfg.rst b/docs/specs/fw_cfg.rst index 5ad47a9..31ae315 100644 --- a/docs/specs/fw_cfg.rst +++ b/docs/specs/fw_cfg.rst @@ -54,11 +54,11 @@ Data Register ------------- * Read/Write (writes ignored as of QEMU v2.4, but see the DMA interface) -* Location: platform dependent (IOport [#]_ or MMIO) +* Location: platform dependent (IOport\ [#placement]_ or MMIO) * Width: 8-bit (if IOport), 8/16/32/64-bit (if MMIO) * Endianness: string-preserving -.. [#] +.. [#placement] On platforms where the data register is exposed as an IOport, its port number will always be one greater than the port number of the selector register. In other words, the two ports overlap, and can not diff --git a/docs/specs/index.rst b/docs/specs/index.rst index 1484e3e..f19d73c 100644 --- a/docs/specs/index.rst +++ b/docs/specs/index.rst @@ -29,7 +29,13 @@ guest hardware that is specific to QEMU. edu ivshmem-spec pvpanic + spdm standard-vga virt-ctlr vmcoreinfo vmgenid + rapl-msr + rocker + riscv-iommu + riscv-aia + aspeed-intc diff --git a/docs/specs/pci-ids.rst b/docs/specs/pci-ids.rst index c0a3dec..261b0f3 100644 --- a/docs/specs/pci-ids.rst +++ b/docs/specs/pci-ids.rst @@ -77,13 +77,17 @@ PCI devices (other than virtio): 1b36:0008 PCIe host bridge 1b36:0009 - PCI Expander Bridge (-device pxb) + PCI Expander Bridge (``-device pxb``) 1b36:000a PCI-PCI bridge (multiseat) 1b36:000b - PCIe Expander Bridge (-device pxb-pcie) + PCIe Expander Bridge (``-device pxb-pcie``) +1b36:000c + PCIe Root Port (``-device pcie-root-port``) 1b36:000d PCI xhci usb host adapter +1b36:000e + PCIe-to-PCI bridge (``-device pcie-pci-bridge``) 1b36:000f mdpy (mdev sample device), ``linux/samples/vfio-mdev/mdpy.c`` 1b36:0010 @@ -94,6 +98,8 @@ PCI devices (other than virtio): PCI ACPI ERST device (``-device acpi-erst``) 1b36:0013 PCI UFS device (``-device ufs``) +1b36:0014 + PCI RISC-V IOMMU device All these devices are documented in :doc:`index`. diff --git a/docs/specs/rapl-msr.rst b/docs/specs/rapl-msr.rst new file mode 100644 index 0000000..aaf0db9 --- /dev/null +++ b/docs/specs/rapl-msr.rst @@ -0,0 +1,154 @@ +================ +RAPL MSR support +================ + +The RAPL interface (Running Average Power Limit) is advertising the accumulated +energy consumption of various power domains (e.g. CPU packages, DRAM, etc.). + +The consumption is reported via MSRs (model specific registers) like +MSR_PKG_ENERGY_STATUS for the CPU package power domain. These MSRs are 64 bits +registers that represent the accumulated energy consumption in micro Joules. + +Thanks to KVM's `MSR filtering <msr-filter-patch_>`__ functionality, +not all MSRs are handled by KVM. Some of them can now be handled by the +userspace (QEMU); a list of MSRs is given at VM creation time to KVM, and +a userspace exit occurs when they are accessed. + +.. _msr-filter-patch: https://patchwork.kernel.org/project/kvm/patch/20200916202951.23760-7-graf@amazon.com/ + +At the moment the following MSRs are involved: + +.. code:: C + + #define MSR_RAPL_POWER_UNIT 0x00000606 + #define MSR_PKG_POWER_LIMIT 0x00000610 + #define MSR_PKG_ENERGY_STATUS 0x00000611 + #define MSR_PKG_POWER_INFO 0x00000614 + +The ``*_POWER_UNIT``, ``*_POWER_LIMIT``, ``*_POWER INFO`` are part of the RAPL +spec and specify the power limit of the package, provide range of parameter(min +power, max power,..) and also the information of the multiplier for the energy +counter to calculate the power. Those MSRs are populated once at the beginning +by reading the host CPU MSRs and are given back to the guest 1:1 when +requested. + +The MSR_PKG_ENERGY_STATUS is a counter; it represents the total amount of +energy consumed since the last time the register was cleared. If you multiply +it with the UNIT provided above you'll get the power in micro-joules. This +counter is always increasing and it increases more or less faster depending on +the consumption of the package. This counter is supposed to overflow at some +point. + +Each core belonging to the same Package reading the MSR_PKG_ENERGY_STATUS (i.e +"rdmsr 0x611") will retrieve the same value. The value represents the energy +for the whole package. Whatever Core reading it will get the same value and a +core that belongs to PKG-0 will not be able to get the value of PKG-1 and +vice-versa. + +High level implementation +------------------------- + +In order to update the value of the virtual MSR, a QEMU thread is created. +The thread is basically just an infinity loop that does: + +1. Snapshot of the time metrics of all QEMU threads (Time spent scheduled in + Userspace and System) + +2. Snapshot of the actual MSR_PKG_ENERGY_STATUS counter of all packages where + the QEMU threads are running on. + +3. Sleep for 1 second - During this pause the vcpu and other non-vcpu threads + will do what they have to do and so the energy counter will increase. + +4. Repeat 2. and 3. and calculate the delta of every metrics representing the + time spent scheduled for each QEMU thread *and* the energy spent by the + packages during the pause. + +5. Filter the vcpu threads and the non-vcpu threads. + +6. Retrieve the topology of the Virtual Machine. This helps identify which + vCPU is running on which virtual package. + +7. The total energy spent by the non-vcpu threads is divided by the number + of vcpu threads so that each vcpu thread will get an equal part of the + energy spent by the QEMU workers. + +8. Calculate the ratio of energy spent per vcpu threads. + +9. Calculate the energy for each virtual package. + +10. The virtual MSRs are updated for each virtual package. Each vCPU that + belongs to the same package will return the same value when accessing the + the MSR. + +11. Loop back to 1. + +Ratio calculation +----------------- + +In Linux, a process has an execution time associated with it. The scheduler is +dividing the time in clock ticks. The number of clock ticks per second can be +found by the sysconf system call. A typical value of clock ticks per second is +100. So a core can run a process at the maximum of 100 ticks per second. If a +package has 4 cores, 400 ticks maximum can be scheduled on all the cores +of the package for a period of 1 second. + +`/proc/[pid]/stat <stat_>`__ is a procfs file that can give the executed +time of a process with the [pid] as the process ID. It gives the amount +of ticks the process has been scheduled in userspace (utime) and kernel +space (stime). + +.. _stat: https://man7.org/linux/man-pages/man5/proc.5.html + +By reading those metrics for a thread, one can calculate the ratio of time the +package has spent executing the thread. + +Example: + +A 4 cores package can schedule a maximum of 400 ticks per second with 100 ticks +per second per core. If a thread was scheduled for 100 ticks between a second +on this package, that means my thread has been scheduled for 1/4 of the whole +package. With that, the calculation of the energy spent by the thread on this +package during this whole second is 1/4 of the total energy spent by the +package. + +Usage +----- + +Currently this feature is only working on an Intel CPU that has the RAPL driver +mounted and available in the sysfs. if not, QEMU fails at start-up. + +This feature is activated with -accel +kvm,rapl=true,rapl-helper-socket=/path/sock.sock + +It is important that the socket path is the same as the one +:program:`qemu-vmsr-helper` is listening to. + +qemu-vmsr-helper +---------------- + +The qemu-vmsr-helper is working very much like the qemu-pr-helper. Instead of +making persistent reservation, qemu-vmsr-helper is here to overcome the +CVE-2020-8694 which remove user access to the rapl msr attributes. + +A socket communication is established between QEMU processes that has the RAPL +MSR support activated and the qemu-vmsr-helper. A systemd service and socket +activation is provided in contrib/systemd/qemu-vmsr-helper.(service/socket). + +The systemd socket uses 600, like contrib/systemd/qemu-pr-helper.socket. The +socket can be passed via SCM_RIGHTS by libvirt, or its permissions can be +changed (e.g. 660 and root:kvm for a Debian system for example). Libvirt could +also start a separate helper if needed. All in all, the policy is left to the +user. + +See the qemu-pr-helper documentation or manpage for further details. + +Current Limitations +------------------- + +- Works only on Intel host CPUs because AMD CPUs are using different MSR + addresses. + +- Only the Package Power-Plane (MSR_PKG_ENERGY_STATUS) is reported at the + moment. + diff --git a/docs/specs/riscv-aia.rst b/docs/specs/riscv-aia.rst new file mode 100644 index 0000000..8097e2f --- /dev/null +++ b/docs/specs/riscv-aia.rst @@ -0,0 +1,83 @@ +.. _riscv-aia: + +RISC-V AIA support for RISC-V machines +====================================== + +AIA (Advanced Interrupt Architecture) support is implemented in the ``virt`` +RISC-V machine for TCG and KVM accelerators. + +The support consists of two main modes: + +- "aia=aplic": adds one or more APLIC (Advanced Platform Level Interrupt Controller) + devices +- "aia=aplic-imsic": adds one or more APLIC device and an IMSIC (Incoming MSI + Controller) device for each CPU + +From an user standpoint, these modes will behave the same regardless of the accelerator +used. From a developer standpoint the accelerator settings will change what it being +emulated in userspace versus what is being emulated by an in-kernel irqchip. + +When running TCG, all controllers are emulated in userspace, including machine mode +(m-mode) APLIC and IMSIC (when applicable). + +When running KVM: + +- no m-mode is provided, so there is no m-mode APLIC or IMSIC emulation regardless of + the AIA mode chosen +- with "aia=aplic", s-mode APLIC will be emulated by userspace +- with "aia=aplic-imsic" there are two possibilities. If no additional KVM option + is provided there will be no APLIC or IMSIC emulation in userspace, and the virtual + machine will use the provided in-kernel APLIC and IMSIC controllers. If the user + chooses to use the irqchip in split mode via "-accel kvm,kernel-irqchip=split", + s-mode APLIC will be emulated while using the s-mode IMSIC from the irqchip + +The following table summarizes how the AIA and accelerator options defines what +we will emulate in userspace: + + +.. list-table:: How AIA and accel options changes controller emulation + :widths: 25 25 25 25 25 25 25 + :header-rows: 1 + + * - Accel + - Accel props + - AIA type + - APLIC m-mode + - IMSIC m-mode + - APLIC s-mode + - IMSIC s-mode + * - tcg + - --- + - aplic + - emul + - n/a + - emul + - n/a + * - tcg + - --- + - aplic-imsic + - emul + - emul + - emul + - emul + * - kvm + - --- + - aplic + - n/a + - n/a + - emul + - n/a + * - kvm + - none + - aplic-imsic + - n/a + - n/a + - in-kernel + - in-kernel + * - kvm + - irqchip=split + - aplic-imsic + - n/a + - n/a + - emul + - in-kernel diff --git a/docs/specs/riscv-iommu.rst b/docs/specs/riscv-iommu.rst new file mode 100644 index 0000000..991d376 --- /dev/null +++ b/docs/specs/riscv-iommu.rst @@ -0,0 +1,116 @@ +.. _riscv-iommu: + +RISC-V IOMMU support for RISC-V machines +======================================== + +QEMU implements a RISC-V IOMMU emulation based on the RISC-V IOMMU spec +version 1.0 `iommu1.0.0`_. + +The emulation includes a PCI reference device (riscv-iommu-pci) and a platform +bus device (riscv-iommu-sys) that QEMU RISC-V boards can use. The 'virt' +RISC-V machine is compatible with both devices. + +riscv-iommu-pci reference device +-------------------------------- + +This device implements the RISC-V IOMMU emulation as recommended by the section +"Integrating an IOMMU as a PCIe device" of `iommu1.0.0`_: a PCI device with base +class 08h, sub-class 06h and programming interface 00h. + +As a reference device it doesn't implement anything outside of the specification, +so it uses a generic default PCI ID given by QEMU: 1b36:0014. + +To include the device in the 'virt' machine: + +.. code-block:: bash + + $ qemu-system-riscv64 -M virt -device riscv-iommu-pci,[optional_pci_opts] (...) + +This will add a RISC-V IOMMU PCI device in the board following any additional +PCI parameters (like PCI bus address). The behavior of the RISC-V IOMMU is +defined by the spec but its operation is OS dependent. + +As of this writing the existing Linux kernel support `linux-v8`_, not yet merged, +does not have support for features like VFIO passthrough. The IOMMU emulation +was tested using a public Ventana Micro Systems kernel repository in +`ventana-linux`_. This kernel is based on `linux-v8`_ with additional patches that +enable features like KVM VFIO passthrough with irqbypass. Until the kernel support +is feature complete feel free to use the kernel available in the Ventana Micro Systems +mirror. + +The current Linux kernel support will use the IOMMU device to create IOMMU groups +with any eligible cards available in the system, regardless of factors such as the +order in which the devices are added in the command line. + +This means that these command lines are equivalent as far as the current +IOMMU kernel driver behaves: + +.. code-block:: bash + + $ qemu-system-riscv64 \ + -M virt,aia=aplic-imsic,aia-guests=5 \ + -device riscv-iommu-pci,addr=1.0,vendor-id=0x1efd,device-id=0xedf1 \ + -device e1000e,netdev=net1 -netdev user,id=net1,net=192.168.0.0/24 \ + -device e1000e,netdev=net2 -netdev user,id=net2,net=192.168.200.0/24 \ + (...) + + $ qemu-system-riscv64 \ + -M virt,aia=aplic-imsic,aia-guests=5 \ + -device e1000e,netdev=net1 -netdev user,id=net1,net=192.168.0.0/24 \ + -device e1000e,netdev=net2 -netdev user,id=net2,net=192.168.200.0/24 \ + -device riscv-iommu-pci,addr=1.0,vendor-id=0x1efd,device-id=0xedf1 \ + (...) + +Both will create iommu groups for the two e1000e cards. + +Another thing to notice on `linux-v8`_ and `ventana-linux`_ is that the kernel driver +considers an IOMMU identified as a Rivos device, i.e. it uses Rivos vendor ID. To +use the riscv-iommu-pci device with the existing kernel support we need to emulate +a Rivos PCI IOMMU by setting 'vendor-id' and 'device-id': + +.. code-block:: bash + + $ qemu-system-riscv64 -M virt \ + -device riscv-iommu-pci,vendor-id=0x1efd,device-id=0xedf1 (...) + +Several options are available to control the capabilities of the device, namely: + +- "bus": the bus that the IOMMU device uses +- "ioatc-limit": size of the Address Translation Cache (default to 2Mb) +- "intremap": enable/disable MSI support +- "ats": enable ATS support +- "off" (Out-of-reset translation mode: 'on' for DMA disabled, 'off' for 'BARE' (passthrough)) +- "s-stage": enable s-stage support +- "g-stage": enable g-stage support +- "hpm-counters": number of hardware performance counters available. Maximum value is 31. + Default value is 31. Use 0 (zero) to disable HPM support + +riscv-iommu-sys device +---------------------- + +This device implements the RISC-V IOMMU emulation as a platform bus device that +RISC-V boards can use. + +For the 'virt' board the device is disabled by default. To enable it use the +'iommu-sys' machine option: + +.. code-block:: bash + + $ qemu-system-riscv64 -M virt,iommu-sys=on (...) + +There is no options to configure the capabilities of this device in the 'virt' +board using the QEMU command line. The device is configured with the following +riscv-iommu options: + +- "ioatc-limit": default value (2Mb) +- "intremap": enabled +- "ats": enabled +- "off": on (DMA disabled) +- "s-stage": enabled +- "g-stage": enabled + +.. _iommu1.0.0: https://github.com/riscv-non-isa/riscv-iommu/releases/download/v1.0.0/riscv-iommu.pdf + +.. _linux-v8: https://lore.kernel.org/linux-riscv/cover.1718388908.git.tjeznach@rivosinc.com/ + +.. _ventana-linux: https://github.com/ventanamicro/linux/tree/dev-upstream diff --git a/docs/specs/rocker.txt b/docs/specs/rocker.rst index 1857b31..3a7fc6a 100644 --- a/docs/specs/rocker.txt +++ b/docs/specs/rocker.rst @@ -1,23 +1,23 @@ Rocker Network Switch Register Programming Guide -Copyright (c) Scott Feldman <sfeldma@gmail.com> -Copyright (c) Neil Horman <nhorman@tuxdriver.com> -Version 0.11, 12/29/2014 +************************************************ -LICENSE -======= +.. + Copyright (c) Scott Feldman <sfeldma@gmail.com> + Copyright (c) Neil Horman <nhorman@tuxdriver.com> + Version 0.11, 12/29/2014 -This program is free software; you can redistribute it and/or modify -it under the terms of the GNU General Public License as published by -the Free Software Foundation; either version 2 of the License, or -(at your option) any later version. + This program is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2 of the License, or + (at your option) any later version. -This program is distributed in the hope that it will be useful, -but WITHOUT ANY WARRANTY; without even the implied warranty of -MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the -GNU General Public License for more details. + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. -SECTION 1: Introduction -======================= +Introduction +============ Overview -------- @@ -29,25 +29,25 @@ software. Notations and Conventions ------------------------- -o In register descriptions, [n:m] indicates a range from bit n to bit m, -inclusive. -o Use of leading 0x indicates a hexadecimal number. -o Use of leading 0b indicates a binary number. -o The use of RSVD or Reserved indicates that a bit or field is reserved for -future use. -o Field width is in bytes, unless otherwise noted. -o Register are (R) read-only, (R/W) read/write, (W) write-only, or (COR) clear -on read -o TLV values in network-byte-order are designated with (N). +* In register descriptions, [n:m] indicates a range from bit n to bit m, + inclusive. +* Use of leading 0x indicates a hexadecimal number. +* Use of leading 0b indicates a binary number. +* The use of RSVD or Reserved indicates that a bit or field is reserved for + future use. +* Field width is in bytes, unless otherwise noted. +* Register are (R) read-only, (R/W) read/write, (W) write-only, or (COR) clear + on read +* TLV values in network-byte-order are designated with (N). -SECTION 2: PCI Configuration Registers -====================================== +PCI Configuration Registers +=========================== PCI Configuration Space ----------------------- -Each switch instance registers as a PCI device with PCI configuration space: +Each switch instance registers as a PCI device with PCI configuration space:: offset width description value --------------------------------------------- @@ -74,11 +74,10 @@ Each switch instance registers as a PCI device with PCI configuration space: 0x41 1 Retry count 0x42 2 Reserved + * Assigned by sub-system implementation -* Assigned by sub-system implementation - -SECTION 3: Memory-Mapped Register Space -======================================= +Memory-Mapped Register Space +============================ There are two memory-mapped BARs. BAR0 maps device register space and is 0x2000 in size. BAR1 maps MSI-X vector and PBA tables and is also 0x2000 in @@ -89,7 +88,7 @@ byte registers with one 4-byte access, and 8 byte registers with either two 4-byte accesses or a single 8-byte access. In the case of two 4-byte accesses, access must be lower and then upper 4-bytes, in that order. -BAR0 device register space is organized as follows: +BAR0 device register space is organized as follows:: offset description ------------------------------------------------------ @@ -105,7 +104,7 @@ Reads to reserved registers read back as 0. No fancy stuff like write-combining is enabled on any of the registers. -BAR1 MSI-X register space is organized as follows: +BAR1 MSI-X register space is organized as follows:: offset description ------------------------------------------------------ @@ -113,8 +112,8 @@ BAR1 MSI-X register space is organized as follows: 0x1000-0x1fff MSI-X PBA table -SECTION 4: Interrupts, DMA, and Endianness -========================================== +Interrupts, DMA, and Endianness +=============================== PCI Interrupts -------------- @@ -122,7 +121,7 @@ PCI Interrupts The device supports only MSI-X interrupts. BAR1 memory-mapped region contains the MSI-X vector and PBA tables, with support for up to 256 MSI-X vectors. -The vector assignment is: +The vector assignment is:: vector description ----------------------------------------------------- @@ -134,7 +133,7 @@ The vector assignment is: Tx vector is even Rx vector is odd -A MSI-X vector table entry is 16 bytes: +A MSI-X vector table entry is 16 bytes:: field offset width description ------------------------------------------------------------- @@ -170,7 +169,7 @@ ring, and hardware will set this bit when the descriptor is complete. Descriptor ring sizes must be a power of 2 and range from 2 to 64K entries. Descriptor rings' base address must be 8-byte aligned. Descriptors must be packed within ring. Each descriptor in each ring must also be aligned on an 8 -byte boundary. Each descriptor ring will have these registers: +byte boundary. Each descriptor ring will have these registers:: DMA_DESC_xxx_BASE_ADDR, offset 0x1000 + (x * 32), 64-bit, (R/W) DMA_DESC_xxx_SIZE, offset 0x1008 + (x * 32), 32-bit, (R/W) @@ -180,7 +179,7 @@ byte boundary. Each descriptor ring will have these registers: DMA_DESC_xxx_CREDITS, offset 0x1018 + (x * 32), 32-bit, (R/W) DMA_DESC_xxx_RSVD1, offset 0x101c + (x * 32), 32-bit, (R/W) -Where x is descriptor ring index: +Where x is descriptor ring index:: index ring -------------------- @@ -203,14 +202,14 @@ written past TAIL. To do so would wrap the ring. An empty ring is when HEAD == TAIL. A full ring is when HEAD is one position behind TAIL. Both HEAD and TAIL increment and modulo wrap at the ring size. -CTRL register bits: +CTRL register bits:: bit name description ------------------------------------------------------------------------ [0] CTRL_RESET Reset the descriptor ring [1:31] Reserved -All descriptor types share some common fields: +All descriptor types share some common fields:: field width description ------------------------------------------------------------------- @@ -234,7 +233,7 @@ filled in by the switch. Likewise, the switch will ignore unknown fields filled in by software. Descriptor payload buffer is 8-byte aligned and TLVs are 8-byte aligned. The -value within a TLV is also 8-byte aligned. The (packed, 8 byte) TLV header is: +value within a TLV is also 8-byte aligned. The (packed, 8 byte) TLV header is:: field width description ----------------------------- @@ -246,7 +245,7 @@ The alignment requirements for descriptors and TLVs are to avoid unaligned access exceptions in software. Note that the payload for each TLV is also 8 byte aligned. -Figure 1 shows an example descriptor buffer with two TLVs. +Figure 1 shows an example descriptor buffer with two TLVs:: <------- 8 bytes -------> @@ -316,11 +315,11 @@ network packet data. All non-network-packet TLV multi-byte values will be LE. TLV values in network-byte-order are designated with (N). -SECTION 5: Test Registers -========================= +Test Registers +============== Rocker has several test registers to support troubleshooting register access, -interrupt generation, and DMA operations: +interrupt generation, and DMA operations:: TEST_REG, offset 0x0010, 32-bit (R/W) TEST_REG64, offset 0x0018, 64-bit (R/W) @@ -338,7 +337,7 @@ for that vector. To test basic DMA operations, allocate a DMA-able host buffer and put the buffer address into TEST_DMA_ADDR and size into TEST_DMA_SIZE. Then, write to -TEST_DMA_CTRL to manipulate the buffer contents. TEST_DMA_CTRL operations are: +TEST_DMA_CTRL to manipulate the buffer contents. TEST_DMA_CTRL operations are:: operation value description ----------------------------------------------------------- @@ -351,14 +350,14 @@ issue exists. In particular, buffers that start on odd-8-byte boundary and/or span multiple PAGE sizes should be tested. -SECTION 6: Ports -================ +Ports +===== Physical and Logical Ports ------------------------------------ The switch supports up to 62 physical (front-panel) ports. Register -PORT_PHYS_COUNT returns the actual number of physical ports available: +PORT_PHYS_COUNT returns the actual number of physical ports available:: PORT_PHYS_COUNT, offset 0x0304, 32-bit, (R) @@ -369,7 +368,7 @@ Front-panel ports and logical tunnel ports are mapped into a single 32-bit port space. A special CPU port is assigned port 0. The front-panel ports are mapped to ports 1-62. A special loopback port is assigned port 63. Logical tunnel ports are assigned ports 0x0001000-0x0001ffff. -To summarize the port assignments: +To summarize the port assignments:: port mapping ------------------------------------------------------- @@ -391,14 +390,14 @@ set/get the mode for front-panel ports, see port settings, below. Port Settings ------------- -Link status for all front-panel ports is available via PORT_PHYS_LINK_STATUS: +Link status for all front-panel ports is available via PORT_PHYS_LINK_STATUS:: PORT_PHYS_LINK_STATUS, offset 0x0310, 64-bit, (R) Value is port bitmap. Bits 0 and 63 always read 0. Bits 1-62 read 1 for link UP and 0 for link DOWN for respective front-panel ports. -Other properties for front-panel ports are available via DMA CMD descriptors: +Other properties for front-panel ports are available via DMA CMD descriptors:: Get PORT_SETTINGS descriptor: @@ -438,7 +437,7 @@ Port Enable ----------- Front-panel ports are initially disabled, which means port ingress and egress -packets will be dropped. To enable or disable a port, use PORT_PHYS_ENABLE: +packets will be dropped. To enable or disable a port, use PORT_PHYS_ENABLE:: PORT_PHYS_ENABLE: offset 0x0318, 64-bit, (R/W) @@ -447,15 +446,15 @@ packets will be dropped. To enable or disable a port, use PORT_PHYS_ENABLE: Default is 0. -SECTION 7: Switch Control -========================= +Switch Control +============== This section covers switch-wide register settings. Control ------- -This register is used for low level control of the switch. +This register is used for low level control of the switch:: CONTROL: offset 0x0300, 32-bit, (W) @@ -468,18 +467,18 @@ Switch ID --------- The switch has a SWITCH_ID to be used by software to uniquely identify the -switch: +switch:: SWITCH_ID: offset 0x0320, 64-bit, (R) Value is opaque to switch software and no special encoding is implied. -SECTION 8: Events -================= +Events +====== Non-I/O asynchronous events from the device are notified to the host using the -event ring. The TLV structure for events is: +event ring. The TLV structure for events is:: field width description --------------------------------------------------- @@ -491,7 +490,7 @@ event ring. The TLV structure for events is: Link Changed Event ------------------ -When link status changes on a physical port, this event is generated. +When link status changes on a physical port, this event is generated:: field width description --------------------------------------------------- @@ -510,6 +509,8 @@ driver should install to the device the MAC/VLAN on the port into the bridge table. Once installed, the MAC/VLAN is known on the port and this event will no longer be generated. +:: + field width description --------------------------------------------------- INFO <nest> @@ -518,8 +519,8 @@ no longer be generated. VLAN 2 VLAN ID -SECTION 9: CPU Packet Processing -================================ +CPU Packet Processing +===================== Ingress packets directed to the host CPU for further processing are delivered in the DMA RX ring. Likewise, host CPU originating packets destined to egress @@ -540,7 +541,7 @@ software that Tx is complete and software resources (e.g. skb) backing packet can be released. Figure 2 shows an example 3-fragment packet queued with one Tx descriptor. A -TLV is used for each packet fragment. +TLV is used for each packet fragment:: pkt frag 1 +–––––––+ +–+ @@ -570,7 +571,7 @@ TLV is used for each packet fragment. fig 2. -The TLVs for Tx descriptor buffer are: +The TLVs for Tx descriptor buffer are:: field width description --------------------------------------------------------------------- @@ -600,7 +601,7 @@ The TLVs for Tx descriptor buffer are: TX_FRAG_ADDR 8 DMA address of packet fragment TX_FRAG_LEN 2 Packet fragment length -Possible status return codes in descriptor on completion are: +Possible status return codes in descriptor on completion are:: DESC_COMP_ERR reason -------------------------------------------------------------------- @@ -623,7 +624,7 @@ worst-case packet size. A single Rx descriptor will contain the entire Rx packet data in one RX_FRAG. Other Rx TLVs describe and hardware offloads performed on the packet, such as checksum validation. -The TLVs for Rx descriptor buffer are: +The TLVs for Rx descriptor buffer are:: field width description --------------------------------------------------- @@ -649,7 +650,7 @@ The TLVs for Rx descriptor buffer are: Offload forward RX_FLAG indicates the device has already forwarded the packet so the host CPU should not also forward the packet. -Possible status return codes in descriptor on completion are: +Possible status return codes in descriptor on completion are:: DESC_COMP_ERR reason -------------------------------------------------------------------- @@ -660,14 +661,14 @@ Possible status return codes in descriptor on completion are: packet data TLV and other TLVs. -SECTION 10: OF-DPA Mode -====================== +OF-DPA Mode +=========== OF-DPA mode allows the switch to offload flow packet processing functions to hardware. An OpenFlow controller would communicate with an OpenFlow agent installed on the switch. The OpenFlow agent would (directly or indirectly) communicate with the Rocker switch driver, which in turn would program switch -hardware with flow functionality, as defined in OF-DPA. The block diagram is: +hardware with flow functionality, as defined in OF-DPA. The block diagram is:: +–––––––––––––––----–––+ | OF | @@ -696,14 +697,14 @@ OF-DPA Flow Table Interface There are commands to add, modify, delete, and get stats of flow table entries. The commands are issued using the DMA CMD descriptor ring. The following -commands are defined: +commands are defined:: CMD_ADD: add an entry to flow table CMD_MOD: modify an entry in flow table CMD_DEL: delete an entry from flow table CMD_GET_STATS: get stats for flow entry -TLVs for add and modify commands are: +TLVs for add and modify commands are:: field width description ---------------------------------------------------- @@ -723,14 +724,14 @@ TLVs for add and modify commands are: Additional TLVs based on flow table ID: -Table ID 0: ingress port +Table ID 0: ingress port:: field width description ---------------------------------------------------- OF_DPA_IN_PPORT 4 ingress physical port number OF_DPA_GOTO_TBL 2 goto table ID; zero to drop -Table ID 10: vlan +Table ID 10: vlan:: field width description ---------------------------------------------------- @@ -740,7 +741,7 @@ Table ID 10: vlan OF_DPA_GOTO_TBL 2 goto table ID; zero to drop OF_DPA_NEW_VLAN_ID 2 (N) new vlan ID -Table ID 20: termination mac +Table ID 20: termination mac:: field width description ---------------------------------------------------- @@ -757,7 +758,7 @@ Table ID 20: termination mac OF_DPA_OUT_PPORT 2 if specified, must be controller, set zero otherwise -Table ID 30: unicast routing +Table ID 30: unicast routing:: field width description ---------------------------------------------------- @@ -772,7 +773,7 @@ Table ID 30: unicast routing OF_DPA_GROUP_ID 4 data for GROUP action must be an L3 Unicast group entry -Table ID 40: multicast routing +Table ID 40: multicast routing:: field width description ---------------------------------------------------- @@ -797,7 +798,7 @@ Table ID 40: multicast routing OF_DPA_GROUP_ID 4 data for GROUP action must be an L3 multicast group entry -Table ID 50: bridging +Table ID 50: bridging:: field width description ---------------------------------------------------- @@ -818,7 +819,7 @@ Table ID 50: bridging restricted to CONTROLLER, set to 0 otherwise -Table ID 60: acl policy +Table ID 60: acl policy:: field width description ---------------------------------------------------- @@ -890,7 +891,7 @@ Table ID 60: acl policy dropped (all other instructions ignored) -TLVs for flow delete and get stats command are: +TLVs for flow delete and get stats command are:: field width description --------------------------------------------------- @@ -898,7 +899,7 @@ TLVs for flow delete and get stats command are: OF_DPA_COOKIE 8 Cookie On completion of get stats command, the descriptor buffer is written back with -the following TLVs: +the following TLVs:: field width description --------------------------------------------------- @@ -906,7 +907,7 @@ the following TLVs: OF_DPA_STAT_RX_PKTS 8 Received packets OF_DPA_STAT_TX_PKTS 8 Transmit packets -Possible status return codes in descriptor on completion are: +Possible status return codes in descriptor on completion are:: DESC_COMP_ERR command reason -------------------------------------------------------------------- @@ -928,14 +929,14 @@ Group Table Interface There are commands to add, modify, delete, and get stats of group table entries. The commands are issued using the DMA CMD descriptor ring. The -following commands are defined: +following commands are defined:: CMD_ADD: add an entry to group table CMD_MOD: modify an entry in group table CMD_DEL: delete an entry from group table CMD_GET_STATS: get stats for group entry -TLVs for add and modify commands are: +TLVs for add and modify commands are:: field width description ----------------------------------------------------------- @@ -969,7 +970,7 @@ TLVs for add and modify commands are: FLOW_SRC_MAC 6 (types 1, 2, 5) FLOW_DST_MAC 6 (types 1, 2) -TLVs for flow delete and get stats command are: +TLVs for flow delete and get stats command are:: field width description ----------------------------------------------------------- @@ -977,7 +978,7 @@ TLVs for flow delete and get stats command are: FLOW_GROUP_ID 2 Flow group ID On completion of get stats command, the descriptor buffer is written back with -the following TLVs: +the following TLVs:: field width description --------------------------------------------------- @@ -986,7 +987,7 @@ the following TLVs: FLOW_STAT_REF_COUNT 4 Flow reference count FLOW_STAT_BUCKET_COUNT 4 Flow bucket count -Possible status return codes in descriptor on completion are: +Possible status return codes in descriptor on completion are:: DESC_COMP_ERR command reason -------------------------------------------------------------------- diff --git a/docs/specs/spdm.rst b/docs/specs/spdm.rst new file mode 100644 index 0000000..f7de080 --- /dev/null +++ b/docs/specs/spdm.rst @@ -0,0 +1,134 @@ +====================================================== +QEMU Security Protocols and Data Models (SPDM) Support +====================================================== + +SPDM enables authentication, attestation and key exchange to assist in +providing infrastructure security enablement. It's a standard published +by the `DMTF`_. + +QEMU supports connecting to a SPDM responder implementation. This allows an +external application to emulate the SPDM responder logic for an SPDM device. + +Setting up a SPDM server +======================== + +When using QEMU with SPDM devices QEMU will connect to a server which +implements the SPDM functionality. + +SPDM-Utils +---------- + +You can use `SPDM Utils`_ to emulate a responder. This is the simplest method. + +SPDM-Utils is a Linux applications to manage, test and develop devices +supporting DMTF Security Protocol and Data Model (SPDM). It is written in Rust +and utilises libspdm. + +To use SPDM-Utils you will need to do the following steps. Details are included +in the SPDM-Utils README. + + 1. `Build libspdm`_ + 2. `Build SPDM Utils`_ + 3. `Run it as a server`_ + +spdm-emu +-------- + +You can use `spdm emu`_ to model the +SPDM responder. + +.. code-block:: shell + + $ cd spdm-emu + $ git submodule init; git submodule update --recursive + $ mkdir build; cd build + $ cmake -DARCH=x64 -DTOOLCHAIN=GCC -DTARGET=Debug -DCRYPTO=openssl .. + $ make -j32 + $ make copy_sample_key # Build certificates, required for SPDM authentication. + +It is worth noting that the certificates should be in compliance with +PCIe r6.1 sec 6.31.3. This means you will need to add the following to +openssl.cnf + +.. code-block:: + + subjectAltName = otherName:2.23.147;UTF8:Vendor=1b36:Device=0010:CC=010802:REV=02:SSVID=1af4:SSID=1100 + 2.23.147 = ASN1:OID:2.23.147 + +and then manually regenerate some certificates with: + +.. code-block:: shell + + $ openssl req -nodes -newkey ec:param.pem -keyout end_responder.key \ + -out end_responder.req -sha384 -batch \ + -subj "/CN=DMTF libspdm ECP384 responder cert" + + $ openssl x509 -req -in end_responder.req -out end_responder.cert \ + -CA inter.cert -CAkey inter.key -sha384 -days 3650 -set_serial 3 \ + -extensions v3_end -extfile ../openssl.cnf + + $ openssl asn1parse -in end_responder.cert -out end_responder.cert.der + + $ cat ca.cert.der inter.cert.der end_responder.cert.der > bundle_responder.certchain.der + +You can use SPDM-Utils instead as it will generate the correct certificates +automatically. + +The responder can then be launched with + +.. code-block:: shell + + $ cd bin + $ ./spdm_responder_emu --trans PCI_DOE + +Connecting an SPDM NVMe device +============================== + +Once a SPDM server is running we can start QEMU and connect to the server. + +For an NVMe device first let's setup a block we can use + +.. code-block:: shell + + $ cd qemu-spdm/linux/image + $ dd if=/dev/zero of=blknvme bs=1M count=2096 # 2GB NNMe Drive + +Then you can add this to your QEMU command line: + +.. code-block:: shell + + -drive file=blknvme,if=none,id=mynvme,format=raw \ + -device nvme,drive=mynvme,serial=deadbeef,spdm_port=2323 + +At which point QEMU will try to connect to the SPDM server. + +Note that if using x64-64 you will want to use the q35 machine instead +of the default. So the entire QEMU command might look like this + +.. code-block:: shell + + qemu-system-x86_64 -M q35 \ + --kernel bzImage \ + -drive file=rootfs.ext2,if=virtio,format=raw \ + -append "root=/dev/vda console=ttyS0" \ + -net none -nographic \ + -drive file=blknvme,if=none,id=mynvme,format=raw \ + -device nvme,drive=mynvme,serial=deadbeef,spdm_port=2323 + +.. _DMTF: + https://www.dmtf.org/standards/SPDM + +.. _SPDM Utils: + https://github.com/westerndigitalcorporation/spdm-utils + +.. _spdm emu: + https://github.com/dmtf/spdm-emu + +.. _Build libspdm: + https://github.com/westerndigitalcorporation/spdm-utils?tab=readme-ov-file#build-libspdm + +.. _Build SPDM Utils: + https://github.com/westerndigitalcorporation/spdm-utils?tab=readme-ov-file#build-the-binary + +.. _Run it as a server: + https://github.com/westerndigitalcorporation/spdm-utils#qemu-spdm-device-emulation diff --git a/docs/specs/tpm.rst b/docs/specs/tpm.rst index 1ad36ad..b630a35 100644 --- a/docs/specs/tpm.rst +++ b/docs/specs/tpm.rst @@ -205,8 +205,8 @@ to be used with the passthrough backend or the swtpm backend. QEMU files related to TPM backends: - ``backends/tpm.c`` - - ``include/sysemu/tpm.h`` - - ``include/sysemu/tpm_backend.h`` + - ``include/system/tpm.h`` + - ``include/system/tpm_backend.h`` The QEMU TPM passthrough device ------------------------------- @@ -240,7 +240,7 @@ PCRs. QEMU files related to the TPM passthrough device: - ``backends/tpm/tpm_passthrough.c`` - ``backends/tpm/tpm_util.c`` - - ``include/sysemu/tpm_util.h`` + - ``include/system/tpm_util.h`` Command line to start QEMU with the TPM passthrough device using the host's @@ -301,7 +301,7 @@ command. QEMU files related to the TPM emulator device: - ``backends/tpm/tpm_emulator.c`` - ``backends/tpm/tpm_util.c`` - - ``include/sysemu/tpm_util.h`` + - ``include/system/tpm_util.h`` The following commands start the swtpm with a UnixIO control channel over a socket interface. They do not need to be run as root. |