diff options
Diffstat (limited to 'docs')
37 files changed, 2850 insertions, 473 deletions
diff --git a/docs/about/build-platforms.rst b/docs/about/build-platforms.rst index 5252155..8ecbd6b 100644 --- a/docs/about/build-platforms.rst +++ b/docs/about/build-platforms.rst @@ -101,7 +101,7 @@ Python runtime option of the ``configure`` script to point QEMU to a supported version of the Python runtime. - As of QEMU |version|, the minimum supported version of Python is 3.8. + As of QEMU |version|, the minimum supported version of Python is 3.9. Python build dependencies Some of QEMU's build dependencies are written in Python. Usually these @@ -118,9 +118,14 @@ Rust build dependencies include bindgen or have an older version, it is recommended to install a newer version using ``cargo install bindgen-cli``. - Developers may want to use Cargo-based tools in the QEMU source tree; - this requires Cargo 1.74.0. Note that Cargo is not required in order - to build QEMU. + QEMU requires Rust 1.77.0. This is available on all supported platforms + with one exception, namely the ``mips64el`` architecture on Debian bookworm. + For all other architectures, Debian bookworm provides a new-enough Rust + compiler in the ``rustc-web`` package. + + Also, on Ubuntu 22.04 or 24.04 this requires the ``rustc-1.77`` + (or newer) package. The path to ``rustc`` and ``rustdoc`` must be + provided manually to the configure script. Optional build dependencies Build components whose absence does not affect the ability to build QEMU diff --git a/docs/about/deprecated.rst b/docs/about/deprecated.rst index 0538144..4203713 100644 --- a/docs/about/deprecated.rst +++ b/docs/about/deprecated.rst @@ -156,19 +156,41 @@ threads (for example, it only reports source side of multifd threads, without reporting any destination threads, or non-multifd source threads). For debugging purpose, please use ``-name $VM,debug-threads=on`` instead. -Incorrectly typed ``device_add`` arguments (since 6.2) -'''''''''''''''''''''''''''''''''''''''''''''''''''''' +``block-job-pause`` (since 10.1) +'''''''''''''''''''''''''''''''' -Due to shortcomings in the internal implementation of ``device_add``, QEMU -incorrectly accepts certain invalid arguments: Any object or list arguments are -silently ignored. Other argument types are not checked, but an implicit -conversion happens, so that e.g. string values can be assigned to integer -device properties or vice versa. +Use ``job-pause`` instead. The only difference is that ``job-pause`` +always reports GenericError on failure when ``block-job-pause`` reports +DeviceNotActive when block-job is not found. -This is a bug in QEMU that will be fixed in the future so that previously -accepted incorrect commands will return an error. Users should make sure that -all arguments passed to ``device_add`` are consistent with the documented -property types. +``block-job-resume`` (since 10.1) +''''''''''''''''''''''''''''''''' + +Use ``job-resume`` instead. The only difference is that ``job-resume`` +always reports GenericError on failure when ``block-job-resume`` reports +DeviceNotActive when block-job is not found. + +``block-job-complete`` (since 10.1) +''''''''''''''''''''''''''''''''''' + +Use ``job-complete`` instead. The only difference is that ``job-complete`` +always reports GenericError on failure when ``block-job-complete`` reports +DeviceNotActive when block-job is not found. + +``block-job-dismiss`` (since 10.1) +'''''''''''''''''''''''''''''''''' + +Use ``job-dismiss`` instead. + +``block-job-finalize`` (since 10.1) +''''''''''''''''''''''''''''''''''' + +Use ``job-finalize`` instead. + +``migrate`` argument ``detach`` (since 10.1) +'''''''''''''''''''''''''''''''''''''''''''' + +This argument has always been ignored. Host Architectures ------------------ @@ -278,6 +300,13 @@ CPU implementation for a while before removing all support. System emulator machines ------------------------ +Versioned machine types (aarch64, arm, i386, m68k, ppc64, s390x, x86_64) +'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' + +In accordance with our versioned machine type deprecation policy, all machine +types with version |VER_MACHINE_DEPRECATION_VERSION|, or older, have been +deprecated. + Arm ``virt`` machine ``dtb-kaslr-seed`` property (since 7.1) '''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' @@ -286,12 +315,6 @@ deprecated; use the new name ``dtb-randomness`` instead. The new name better reflects the way this property affects all random data within the device tree blob, not just the ``kaslr-seed`` node. -Big-Endian variants of MicroBlaze ``petalogix-ml605`` and ``xlnx-zynqmp-pmu`` machines (since 9.2) -'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' - -Both ``petalogix-ml605`` and ``xlnx-zynqmp-pmu`` were added for little endian -CPUs. Big endian support is not tested. - Mips ``mipssim`` machine (since 10.0) ''''''''''''''''''''''''''''''''''''' @@ -322,6 +345,19 @@ machine must ensure that they're setting the ``spike`` machine in the command line (``-M spike``). +System emulator binaries +------------------------ + +``qemu-system-microblazeel`` (since 10.1) +''''''''''''''''''''''''''''''''''''''''' + +The ``qemu-system-microblaze`` binary can emulate little-endian machines +now, too, so the separate binary ``qemu-system-microblazeel`` (with the +``el`` suffix) for little-endian targets is not required anymore. The +``petalogix-s3adsp1800`` machine can now be switched to little endian by +setting its ``endianness`` property to ``little``. + + Backend options --------------- @@ -493,14 +529,6 @@ PCIe passthrough shall be the mainline solution. CPU device properties ''''''''''''''''''''' -``pcommit`` on x86 (since 9.1) -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -The PCOMMIT instruction was never included in any physical processor. -It was implemented as a no-op instruction in TCG up to QEMU 9.0, but -only with ``-cpu max`` (which does not guarantee migration compatibility -across versions). - ``pmu-num=n`` on RISC-V CPUs (since 8.2) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ @@ -510,6 +538,14 @@ be calculated with ``((2 ^ n) - 1) << 3``. The least significant three bits must be left clear. +``pcommit`` on x86 (since 9.1) +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The PCOMMIT instruction was never included in any physical processor. +It was implemented as a no-op instruction in TCG up to QEMU 9.0, but +only with ``-cpu max`` (which does not guarantee migration compatibility +across versions). + Backwards compatibility ----------------------- diff --git a/docs/about/emulation.rst b/docs/about/emulation.rst index a72591e..456d01d 100644 --- a/docs/about/emulation.rst +++ b/docs/about/emulation.rst @@ -811,6 +811,10 @@ This plugin can limit the number of Instructions Per Second that are executed:: * - ips=N - Maximum number of instructions per cpu that can be executed in one second. The plugin will sleep when the given number of instructions is reached. + * - ipq=N + - Instructions per quantum. How many instructions before we re-calculate time. + The lower the number the more accurate time will be, but the less efficient the plugin. + Defaults to ips/10 Other emulation features ------------------------ diff --git a/docs/about/removed-features.rst b/docs/about/removed-features.rst index 790a5e4..d7c2113 100644 --- a/docs/about/removed-features.rst +++ b/docs/about/removed-features.rst @@ -162,6 +162,12 @@ specified with ``-mem-path`` can actually provide the guest RAM configured with The ``name`` parameter of the ``-net`` option was a synonym for the ``id`` parameter, which should now be used instead. +RISC-V firmware not booted by default (removed in 5.1) +'''''''''''''''''''''''''''''''''''''''''''''''''''''' + +QEMU 5.1 changes the default behaviour from ``-bios none`` to ``-bios default`` +for the RISC-V ``virt`` machine and ``sifive_u`` machine. + ``-numa node,mem=...`` (removed in 5.1) ''''''''''''''''''''''''''''''''''''''' @@ -324,12 +330,6 @@ devices. Drives the board doesn't pick up can no longer be used with This option was undocumented and not used in the field. Use ``-device usb-ccid`` instead. -RISC-V firmware not booted by default (removed in 5.1) -'''''''''''''''''''''''''''''''''''''''''''''''''''''' - -QEMU 5.1 changes the default behaviour from ``-bios none`` to ``-bios default`` -for the RISC-V ``virt`` machine and ``sifive_u`` machine. - ``-no-quit`` (removed in 7.0) ''''''''''''''''''''''''''''' @@ -722,6 +722,15 @@ Use ``multifd-channels`` instead. Use ``multifd-compression`` instead. +Incorrectly typed ``device_add`` arguments (since 9.2) +'''''''''''''''''''''''''''''''''''''''''''''''''''''' + +Due to shortcomings in the internal implementation of ``device_add``, +QEMU used to incorrectly accept certain invalid arguments. Any object +or list arguments were silently ignored. Other argument types were not +checked, but an implicit conversion happened, so that e.g. string +values could be assigned to integer device properties or vice versa. + QEMU Machine Protocol (QMP) events ---------------------------------- @@ -902,14 +911,6 @@ The RISC-V no MMU cpus have been removed. The two CPUs: ``rv32imacu-nommu`` and ``rv64imacu-nommu`` can no longer be used. Instead the MMU status can be specified via the CPU ``mmu`` option when using the ``rv32`` or ``rv64`` CPUs. -RISC-V 'any' CPU type ``-cpu any`` (removed in 9.2) -''''''''''''''''''''''''''''''''''''''''''''''''''' - -The 'any' CPU type was introduced back in 2018 and was around since the -initial RISC-V QEMU port. Its usage was always been unclear: users don't know -what to expect from a CPU called 'any', and in fact the CPU does not do anything -special that isn't already done by the default CPUs rv32/rv64. - ``compat`` property of server class POWER CPUs (removed in 6.0) ''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' @@ -956,6 +957,14 @@ The CRIS architecture was pulled from Linux in 4.17 and the compiler was no longer packaged in any distro making it harder to run the ``check-tcg`` tests. +RISC-V 'any' CPU type ``-cpu any`` (removed in 9.2) +''''''''''''''''''''''''''''''''''''''''''''''''''' + +The 'any' CPU type was introduced back in 2018 and was around since the +initial RISC-V QEMU port. Its usage was always been unclear: users don't know +what to expect from a CPU called 'any', and in fact the CPU does not do anything +special that isn't already done by the default CPUs rv32/rv64. + System accelerators ------------------- @@ -966,25 +975,27 @@ Userspace local APIC with KVM (x86, removed in 8.0) a local APIC. The ``split`` setting is supported, as is using ``-M kernel-irqchip=off`` when the CPU does not have a local APIC. -HAXM (``-accel hax``) (removed in 8.2) -'''''''''''''''''''''''''''''''''''''' - -The HAXM project has been retired (see https://github.com/intel/haxm#status). -Use "whpx" (on Windows) or "hvf" (on macOS) instead. - MIPS "Trap-and-Emulate" KVM support (removed in 8.0) '''''''''''''''''''''''''''''''''''''''''''''''''''' The MIPS "Trap-and-Emulate" KVM host and guest support was removed from Linux in 2021, and is not supported anymore by QEMU either. +HAXM (``-accel hax``) (removed in 8.2) +'''''''''''''''''''''''''''''''''''''' + +The HAXM project has been retired (see https://github.com/intel/haxm#status). +Use "whpx" (on Windows) or "hvf" (on macOS) instead. + System emulator machines ------------------------ -Note: Versioned machine types that have been introduced in a QEMU version -that has initially been released more than 6 years before are considered -obsolete and will be removed without further notice in this document. -Please use newer machine types instead. +Versioned machine types (aarch64, arm, i386, m68k, ppc64, s390x, x86_64) +'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' + +In accordance with our versioned machine type deprecation policy, all machine +types with version |VER_MACHINE_DELETION_VERSION|, or older, have been +removed. ``s390-virtio`` (removed in 2.6) '''''''''''''''''''''''''''''''' @@ -1033,16 +1044,6 @@ Aspeed ``swift-bmc`` machine (removed in 7.0) This machine was removed because it was unused. Alternative AST2500 based OpenPOWER machines are ``witherspoon-bmc`` and ``romulus-bmc``. -Aspeed ``tacoma-bmc`` machine (removed in 10.0) -''''''''''''''''''''''''''''''''''''''''''''''' - -The ``tacoma-bmc`` machine was removed because it didn't bring much -compared to the ``rainier-bmc`` machine. Also, the ``tacoma-bmc`` was -a board used for bring up of the AST2600 SoC that never left the -labs. It can be easily replaced by the ``rainier-bmc`` machine, which -was the actual final product, or by the ``ast2600-evb`` with some -tweaks. - ppc ``taihu`` machine (removed in 7.2) ''''''''''''''''''''''''''''''''''''''''''''' @@ -1073,6 +1074,16 @@ for all machine types using the PXA2xx and OMAP2 SoCs. We are also dropping the ``cheetah`` OMAP1 board, because we don't have any test images for it and don't know of anybody who does. +Aspeed ``tacoma-bmc`` machine (removed in 10.0) +''''''''''''''''''''''''''''''''''''''''''''''' + +The ``tacoma-bmc`` machine was removed because it didn't bring much +compared to the ``rainier-bmc`` machine. Also, the ``tacoma-bmc`` was +a board used for bring up of the AST2600 SoC that never left the +labs. It can be easily replaced by the ``rainier-bmc`` machine, which +was the actual final product, or by the ``ast2600-evb`` with some +tweaks. + ppc ``ref405ep`` machine (removed in 10.0) '''''''''''''''''''''''''''''''''''''''''' @@ -1080,6 +1091,15 @@ This machine was removed because PPC 405 CPU have no known users, firmware images are not available, OpenWRT dropped support in 2019, U-Boot in 2017, and Linux in 2024. +Big-Endian variants of ``petalogix-ml605`` and ``xlnx-zynqmp-pmu`` machines (removed in 10.1) +''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' + +Both the MicroBlaze ``petalogix-ml605`` and ``xlnx-zynqmp-pmu`` machines +were added for little endian CPUs. Big endian support was never tested +and likely never worked. Starting with QEMU v10.1, the machines are now +only available as little-endian machines. + + linux-user mode CPUs -------------------- diff --git a/docs/conf.py b/docs/conf.py index 7b5712e..f892a6e 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -117,6 +117,32 @@ finally: else: version = release = "unknown version" +bits = version.split(".") + +major = int(bits[0]) +minor = int(bits[1]) +micro = int(bits[2]) + +# Check for a dev snapshot, so we can adjust to next +# predicted release version. +# +# This assumes we do 3 releases per year, so must bump +# major if minor == 2 +if micro >= 50: + micro = 0 + if minor == 2: + major += 1 + minor = 0 + else: + minor += 1 + +# These thresholds must match the constants +# MACHINE_VER_DELETION_MAJOR & MACHINE_VER_DEPRECATION_MAJOR +# defined in include/hw/boards.h and the introductory text in +# docs/about/deprecated.rst +ver_machine_deprecation_version = "%d.%d.0" % (major - 3, minor) +ver_machine_deletion_version = "%d.%d.0" % (major - 6, minor) + # The language for content autogenerated by Sphinx. Refer to documentation # for a list of supported languages. # @@ -145,7 +171,18 @@ suppress_warnings = ["ref.option"] # environment variable is not set is for the benefit of readthedocs # style document building; our Makefile always sets the variable. confdir = os.getenv('CONFDIR', "/etc/qemu") -rst_epilog = ".. |CONFDIR| replace:: ``" + confdir + "``\n" + +vars = { + "CONFDIR": confdir, + "VER_MACHINE_DEPRECATION_VERSION": ver_machine_deprecation_version, + "VER_MACHINE_DELETION_VERSION": ver_machine_deletion_version, +} + +rst_epilog = "".join([ + ".. |" + key + "| replace:: ``" + vars[key] + "``\n" + for key in vars.keys() +]) + # We slurp in the defs.rst.inc and literally include it into rst_epilog, # because Sphinx's include:: directive doesn't work with absolute paths # and there isn't any one single relative path that will work for all diff --git a/docs/devel/build-environment.rst b/docs/devel/build-environment.rst index f133ef2..661f6ea 100644 --- a/docs/devel/build-environment.rst +++ b/docs/devel/build-environment.rst @@ -97,11 +97,11 @@ build QEMU in MSYS2 itself. :: - pacman -S wget + pacman -S wget base-devel git wget https://raw.githubusercontent.com/msys2/MINGW-packages/refs/heads/master/mingw-w64-qemu/PKGBUILD # Some packages may be missing for your environment, installation will still # be done though. - makepkg -s PKGBUILD || true + makepkg --syncdeps --nobuild PKGBUILD || true Build on windows-aarch64 ++++++++++++++++++++++++ diff --git a/docs/devel/build-system.rst b/docs/devel/build-system.rst index 258cfad..2c88419 100644 --- a/docs/devel/build-system.rst +++ b/docs/devel/build-system.rst @@ -168,7 +168,7 @@ The required versions of the packages are stored in a configuration file ``pythondeps.toml``. The format is custom to QEMU, but it is documented at the top of the file itself and it should be easy to understand. The requirements should make it possible to use the version that is packaged -that is provided by supported distros. +by QEMU's supported distros. When dependencies are downloaded, instead, ``configure`` uses a "known good" version that is also listed in ``pythondeps.toml``. In this diff --git a/docs/devel/code-provenance.rst b/docs/devel/code-provenance.rst new file mode 100644 index 0000000..b5aae2e --- /dev/null +++ b/docs/devel/code-provenance.rst @@ -0,0 +1,338 @@ +.. _code-provenance: + +Code provenance +=============== + +Certifying patch submissions +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The QEMU community **mandates** all contributors to certify provenance of +patch submissions they make to the project. To put it another way, +contributors must indicate that they are legally permitted to contribute to +the project. + +Certification is achieved with a low overhead by adding a single line to the +bottom of every git commit:: + + Signed-off-by: YOUR NAME <YOUR@EMAIL> + +The addition of this line asserts that the author of the patch is contributing +in accordance with the clauses specified in the +`Developer's Certificate of Origin <https://developercertificate.org>`__: + +.. _dco: + + Developer's Certificate of Origin 1.1 + + By making a contribution to this project, I certify that: + + (a) The contribution was created in whole or in part by me and I + have the right to submit it under the open source license + indicated in the file; or + + (b) The contribution is based upon previous work that, to the best + of my knowledge, is covered under an appropriate open source + license and I have the right under that license to submit that + work with modifications, whether created in whole or in part + by me, under the same open source license (unless I am + permitted to submit under a different license), as indicated + in the file; or + + (c) The contribution was provided directly to me by some other + person who certified (a), (b) or (c) and I have not modified + it. + + (d) I understand and agree that this project and the contribution + are public and that a record of the contribution (including all + personal information I submit with it, including my sign-off) is + maintained indefinitely and may be redistributed consistent with + this project or the open source license(s) involved. + +The name used with "Signed-off-by" does not need to be your legal name, nor +birth name, nor appear on any government ID. It is the identity you choose to +be known by in the community, but should not be anonymous, nor misrepresent +whom you are. + +It is generally expected that the name and email addresses used in one of the +``Signed-off-by`` lines, matches that of the git commit ``Author`` field. +It's okay if you subscribe or contribute to the list via more than one +address, but using multiple addresses in one commit just confuses +things. + +If the person sending the mail is not one of the patch authors, they are +nonetheless expected to add their own ``Signed-off-by`` to comply with the +DCO clause (c). + +Multiple authorship +~~~~~~~~~~~~~~~~~~~ + +It is not uncommon for a patch to have contributions from multiple authors. In +this scenario, git commits will usually be expected to have a ``Signed-off-by`` +line for each contributor involved in creation of the patch. Some edge cases: + + * The non-primary author's contributions were so trivial that they can be + considered not subject to copyright. In this case the secondary authors + need not include a ``Signed-off-by``. + + This case most commonly applies where QEMU reviewers give short snippets + of code as suggested fixes to a patch. The reviewers don't need to have + their own ``Signed-off-by`` added unless their code suggestion was + unusually large, but it is common to add ``Suggested-by`` as a credit + for non-trivial code. + + * Both contributors work for the same employer and the employer requires + copyright assignment. + + It can be said that in this case a ``Signed-off-by`` is indicating that + the person has permission to contribute from their employer who is the + copyright holder. It is nonetheless still preferable to include a + ``Signed-off-by`` for each contributor, as in some countries employees are + not able to assign copyright to their employer, and it also covers any + time invested outside working hours. + +When multiple ``Signed-off-by`` tags are present, they should be strictly kept +in order of authorship, from oldest to newest. + +Other commit tags +~~~~~~~~~~~~~~~~~ + +While the ``Signed-off-by`` tag is mandatory, there are a number of other tags +that are commonly used during QEMU development: + + * **``Reviewed-by``**: when a QEMU community member reviews a patch on the + mailing list, if they consider the patch acceptable, they should send an + email reply containing a ``Reviewed-by`` tag. Subsystem maintainers who + review a patch should add this even if they are also adding their + ``Signed-off-by`` to the same commit. + + * **``Acked-by``**: when a QEMU subsystem maintainer approves a patch that + touches their subsystem, but intends to allow a different maintainer to + queue it and send a pull request, they would send a mail containing a + ``Acked-by`` tag. Where a patch touches multiple subsystems, ``Acked-by`` + only implies review of the maintainers' own areas of responsibility. If a + maintainer wants to indicate they have done a full review they should use + a ``Reviewed-by`` tag. + + * **``Tested-by``**: when a QEMU community member has functionally tested the + behaviour of the patch in some manner, they should send an email reply + containing a ``Tested-by`` tag. + + * **``Reported-by``**: when a QEMU community member reports a problem via the + mailing list, or some other informal channel that is not the issue tracker, + it is good practice to credit them by including a ``Reported-by`` tag on + any patch fixing the issue. When the problem is reported via the GitLab + issue tracker, however, it is sufficient to just include a link to the + issue. + + * **``Suggested-by``**: when a reviewer or other 3rd party makes non-trivial + suggestions for how to change a patch, it is good practice to credit them + by including a ``Suggested-by`` tag. + +Subsystem maintainer requirements +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +When a subsystem maintainer accepts a patch from a contributor, in addition to +the normal code review points, they are expected to validate the presence of +suitable ``Signed-off-by`` tags. + +At the time they queue the patch in their subsystem tree, the maintainer +**must** also then add their own ``Signed-off-by`` to indicate that they have +done the aforementioned validation. This is in addition to any of their own +``Reviewed-by`` tags the subsystem maintainer may wish to include. + +When the maintainer modifies the patch after pulling into their tree, they +should record their contribution. This is typically done via a note in the +commit message, just prior to the maintainer's ``Signed-off-by``:: + + Signed-off-by: Cory Contributor <cory.contributor@example.com> + [Comment rephrased for clarity] + Signed-off-by: Mary Maintainer <mary.maintainer@mycorp.test> + + +Tools for adding ``Signed-off-by`` +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +There are a variety of ways tools can support adding ``Signed-off-by`` tags +for patches, avoiding the need for contributors to manually type in this +repetitive text each time. + +git commands +^^^^^^^^^^^^ + +When creating, or amending, a commit the ``-s`` flag to ``git commit`` will +append a suitable line matching the configured git author details. + +If preparing patches using the ``git format-patch`` tool, the ``-s`` flag can +be used to append a suitable line in the emails it creates, without modifying +the local commits. Alternatively to modify all the local commits on a branch:: + + git rebase master -x 'git commit --amend --no-edit -s' + +emacs +^^^^^ + +In the file ``$HOME/.emacs.d/abbrev_defs`` add: + +.. code:: elisp + + (define-abbrev-table 'global-abbrev-table + '( + ("8rev" "Reviewed-by: YOUR NAME <your@email.addr>" nil 1) + ("8ack" "Acked-by: YOUR NAME <your@email.addr>" nil 1) + ("8test" "Tested-by: YOUR NAME <your@email.addr>" nil 1) + ("8sob" "Signed-off-by: YOUR NAME <your@email.addr>" nil 1) + )) + +with this change, if you type (for example) ``8rev`` followed by ``<space>`` +or ``<enter>`` it will expand to the whole phrase. + +vim +^^^ + +In the file ``$HOME/.vimrc`` add:: + + iabbrev 8rev Reviewed-by: YOUR NAME <your@email.addr> + iabbrev 8ack Acked-by: YOUR NAME <your@email.addr> + iabbrev 8test Tested-by: YOUR NAME <your@email.addr> + iabbrev 8sob Signed-off-by: YOUR NAME <your@email.addr> + +with this change, if you type (for example) ``8rev`` followed by ``<space>`` +or ``<enter>`` it will expand to the whole phrase. + +Re-starting abandoned work +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +For a variety of reasons there are some patches that get submitted to QEMU but +never merged. An unrelated contributor may decide (months or years later) to +continue working from the abandoned patch and re-submit it with extra changes. + +The general principles when picking up abandoned work are: + + * Continue to credit the original author for their work, by maintaining their + original ``Signed-off-by`` + * Indicate where the original patch was obtained from (mailing list, bug + tracker, author's git repo, etc) when sending it for review + * Acknowledge the extra work of the new contributor by including their + ``Signed-off-by`` in the patch in addition to the orignal author's + * Indicate who is responsible for what parts of the patch. This is typically + done via a note in the commit message, just prior to the new contributor's + ``Signed-off-by``:: + + Signed-off-by: Some Person <some.person@example.com> + [Rebased and added support for 'foo'] + Signed-off-by: New Person <new.person@mycorp.test> + +In complicated cases, or if otherwise unsure, ask for advice on the project +mailing list. + +It is also recommended to attempt to contact the original author to let them +know you are interested in taking over their work, in case they still intended +to return to the work, or had any suggestions about the best way to continue. + +Inclusion of generated files +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Files in patches contributed to QEMU are generally expected to be provided +only in the preferred format for making modifications. The implication of +this is that the output of code generators or compilers is usually not +appropriate to contribute to QEMU. + +For reasons of practicality there are some exceptions to this rule, where +generated code is permitted, provided it is also accompanied by the +corresponding preferred source format. This is done where it is impractical +to expect those building QEMU to run the code generation or compilation +process. A non-exhaustive list of examples is: + + * Images: where an bitmap image is created from a vector file it is common + to include the rendered bitmaps at desired resolution(s), since subtle + changes in the rasterization process / tools may affect quality. The + original vector file is expected to accompany any generated bitmaps. + + * Firmware: QEMU includes pre-compiled binary ROMs for a variety of guest + firmwares. When such binary ROMs are contributed, the corresponding source + must also be provided, either directly, or through a git submodule link. + + * Dockerfiles: the majority of the dockerfiles are automatically generated + from a canonical list of build dependencies maintained in tree, together + with the libvirt-ci git submodule link. The generated dockerfiles are + included in tree because it is desirable to be able to directly build + container images from a clean git checkout. + + * eBPF: QEMU includes some generated eBPF machine code, since the required + eBPF compilation tools are not broadly available on all targetted OS + distributions. The corresponding eBPF C code for the binary is also + provided. This is a time-limited exception until the eBPF toolchain is + sufficiently broadly available in distros. + +In all cases above, the existence of generated files must be acknowledged +and justified in the commit that introduces them. + +Tools which perform changes to existing code with deterministic algorithmic +manipulation, driven by user specified inputs, are not generally considered +to be "generators". + +For instance, using Coccinelle to convert code from one pattern to another +pattern, or fixing documentation typos with a spell checker, or transforming +code using sed / awk / etc, are not considered to be acts of code +generation. Where an automated manipulation is performed on code, however, +this should be declared in the commit message. + +At times contributors may use or create scripts/tools to generate an initial +boilerplate code template which is then filled in to produce the final patch. +The output of such a tool would still be considered the "preferred format", +since it is intended to be a foundation for further human authored changes. +Such tools are acceptable to use, provided there is clearly defined copyright +and licensing for their output. Note in particular the caveats applying to AI +content generators below. + +Use of AI content generators +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +TL;DR: + + **Current QEMU project policy is to DECLINE any contributions which are + believed to include or derive from AI generated content. This includes + ChatGPT, Claude, Copilot, Llama and similar tools.** + +The increasing prevalence of AI-assisted software development results in a +number of difficult legal questions and risks for software projects, including +QEMU. Of particular concern is content generated by `Large Language Models +<https://en.wikipedia.org/wiki/Large_language_model>`__ (LLMs). + +The QEMU community requires that contributors certify their patch submissions +are made in accordance with the rules of the `Developer's Certificate of +Origin (DCO) <dco>`. + +To satisfy the DCO, the patch contributor has to fully understand the +copyright and license status of content they are contributing to QEMU. With AI +content generators, the copyright and license status of the output is +ill-defined with no generally accepted, settled legal foundation. + +Where the training material is known, it is common for it to include large +volumes of material under restrictive licensing/copyright terms. Even where +the training material is all known to be under open source licenses, it is +likely to be under a variety of terms, not all of which will be compatible +with QEMU's licensing requirements. + +How contributors could comply with DCO terms (b) or (c) for the output of AI +content generators commonly available today is unclear. The QEMU project is +not willing or able to accept the legal risks of non-compliance. + +The QEMU project thus requires that contributors refrain from using AI content +generators on patches intended to be submitted to the project, and will +decline any contribution if use of AI is either known or suspected. + +This policy does not apply to other uses of AI, such as researching APIs or +algorithms, static analysis, or debugging, provided their output is not to be +included in contributions. + +Examples of tools impacted by this policy includes GitHub's CoPilot, OpenAI's +ChatGPT, Anthropic's Claude, and Meta's Code Llama, and code/content +generation agents which are built on top of such tools. + +This policy may evolve as AI tools mature and the legal situation is +clarifed. In the meanwhile, requests for exceptions to this policy will be +evaluated by the QEMU project on a case by case basis. To be granted an +exception, a contributor will need to demonstrate clarity of the license and +copyright status for the tool's output in relation to its training model and +code, to the satisfaction of the project maintainers. diff --git a/docs/devel/codebase.rst b/docs/devel/codebase.rst index 40273e7..2a31437 100644 --- a/docs/devel/codebase.rst +++ b/docs/devel/codebase.rst @@ -116,7 +116,7 @@ yet, so sometimes the source code is all you have. * `monitor <https://gitlab.com/qemu-project/qemu/-/tree/master/monitor>`_: `Monitor <QEMU monitor>` implementation (HMP & QMP). * `nbd <https://gitlab.com/qemu-project/qemu/-/tree/master/nbd>`_: - QEMU `NBD (Network Block Device) <nbd>` server. + QEMU NBD (Network Block Device) server. * `net <https://gitlab.com/qemu-project/qemu/-/tree/master/net>`_: Network (host) support. * `pc-bios <https://gitlab.com/qemu-project/qemu/-/tree/master/pc-bios>`_: diff --git a/docs/devel/index-process.rst b/docs/devel/index-process.rst index cb7c664..5807752 100644 --- a/docs/devel/index-process.rst +++ b/docs/devel/index-process.rst @@ -13,6 +13,7 @@ Notes about how to interact with the community and how and where to submit patch maintainers style submitting-a-patch + code-provenance trivial-patches stable-process submitting-a-pull-request diff --git a/docs/devel/rust.rst b/docs/devel/rust.rst index 88bdec1..dc8c441 100644 --- a/docs/devel/rust.rst +++ b/docs/devel/rust.rst @@ -37,12 +37,16 @@ output directory (typically ``rust/target/``). A vanilla invocation of Cargo will complain that it cannot find the generated sources, which can be fixed in different ways: -* by using special shorthand targets in the QEMU build directory:: +* by using Makefile targets, provided by Meson, that run ``clippy`` or + ``rustdoc``: make clippy - make rustfmt make rustdoc +A target for ``rustfmt`` is also declared in ``rust/meson.build``: + + make rustfmt + * by invoking ``cargo`` through the Meson `development environment`__ feature:: @@ -50,7 +54,7 @@ which can be fixed in different ways: pyvenv/bin/meson devenv -w ../rust cargo fmt If you are going to use ``cargo`` repeatedly, ``pyvenv/bin/meson devenv`` - will enter a shell where commands like ``cargo clippy`` just work. + will enter a shell where commands like ``cargo fmt`` just work. __ https://mesonbuild.com/Commands.html#devenv @@ -66,41 +70,14 @@ be run via ``meson test`` or ``make``:: make check-rust -Building Rust code with ``--enable-modules`` is not supported yet. +Note that doctests require all ``.o`` files from the build to be available. Supported tools ''''''''''''''' -QEMU supports rustc version 1.63.0 and newer. Notably, the following features +QEMU supports rustc version 1.77.0 and newer. Notably, the following features are missing: -* ``core::ffi`` (1.64.0). Use ``std::os::raw`` and ``std::ffi`` instead. - -* ``cast_mut()``/``cast_const()`` (1.65.0). Use ``as`` instead. - -* "let ... else" (1.65.0). Use ``if let`` instead. This is currently patched - in QEMU's vendored copy of the bilge crate. - -* Generic Associated Types (1.65.0) - -* ``CStr::from_bytes_with_nul()`` as a ``const`` function (1.72.0). - -* "Return position ``impl Trait`` in Traits" (1.75.0, blocker for including - the pinned-init create). - -* ``MaybeUninit::zeroed()`` as a ``const`` function (1.75.0). QEMU's - ``Zeroable`` trait can be implemented without ``MaybeUninit::zeroed()``, - so this would be just a cleanup. - -* ``c"" literals`` (stable in 1.77.0). QEMU provides a ``c_str!()`` macro - to define ``CStr`` constants easily - -* ``offset_of!`` (stable in 1.77.0). QEMU uses ``offset_of!()`` heavily; it - provides a replacement in the ``qemu_api`` crate, but it does not support - lifetime parameters and therefore ``&'a Something`` fields in the struct - may have to be replaced by ``NonNull<Something>``. *Nested* ``offset_of!`` - was only stabilized in Rust 1.82.0, but it is not used. - * inline const expression (stable in 1.79.0), currently worked around with associated constants in the ``FnCall`` trait. @@ -119,18 +96,17 @@ are missing: architecture (VMState). Right now, VMState lacks type safety because it is hard to place the ``VMStateField`` definitions in traits. +* NUL-terminated file names with ``#[track_caller]`` are scheduled for + inclusion as ``#![feature(location_file_nul)]``, but it will be a while + before QEMU can use them. For now, there is special code in + ``util/error.c`` to support non-NUL-terminated file names. + * associated const equality would be nice to have for some users of ``callbacks::FnCall``, but is still experimental. ``ASSERT_IS_SOME`` replaces it. __ https://github.com/rust-lang/rust/pull/125258 -It is expected that QEMU will advance its minimum supported version of -rustc to 1.77.0 as soon as possible; as of January 2025, blockers -for that right now are Debian bookworm and 32-bit MIPS processors. -This unfortunately means that references to statics in constants will -remain an issue. - QEMU also supports version 0.60.x of bindgen, which is missing option ``--generate-cstr``. This option requires version 0.66.x and will be adopted as soon as supporting these older versions is not necessary @@ -152,9 +128,8 @@ QEMU includes four crates: for the ``hw/char/pl011.c`` and ``hw/timer/hpet.c`` files. .. [#issues] The ``pl011`` crate is synchronized with ``hw/char/pl011.c`` - as of commit 02b1f7f61928. The ``hpet`` crate is synchronized as of - commit f32352ff9e. Both are lacking tracing functionality; ``hpet`` - is also lacking support for migration. + as of commit 3e0f118f82. The ``hpet`` crate is synchronized as of + commit 1433e38cc8. Both are lacking tracing functionality. This section explains how to work with them. @@ -184,12 +159,12 @@ module status ``bitops`` complete ``callbacks`` complete ``cell`` stable -``c_str`` complete ``errno`` complete +``error`` stable ``irq`` complete +``log`` proof of concept ``memory`` stable ``module`` complete -``offset_of`` stable ``qdev`` stable ``qom`` stable ``sysbus`` stable @@ -441,7 +416,7 @@ Adding dependencies Generally, the set of dependent crates is kept small. Think twice before adding a new external crate, especially if it comes with a large set of dependencies itself. Sometimes QEMU only needs a small subset of the -functionality; see for example QEMU's ``assertions`` or ``c_str`` modules. +functionality; see for example QEMU's ``assertions`` module. On top of this recommendation, adding external crates to QEMU is a slightly complicated process, mostly due to the need to teach Meson how diff --git a/docs/devel/submitting-a-patch.rst b/docs/devel/submitting-a-patch.rst index 65c6407..f7917b8 100644 --- a/docs/devel/submitting-a-patch.rst +++ b/docs/devel/submitting-a-patch.rst @@ -344,28 +344,9 @@ Patch emails must include a ``Signed-off-by:`` line Your patches **must** include a Signed-off-by: line. This is a hard requirement because it's how you say "I'm legally okay to contribute -this and happy for it to go into QEMU". The process is modelled after -the `Linux kernel -<http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/SubmittingPatches?id=f6f94e2ab1b33f0082ac22d71f66385a60d8157f#n297>`__ -policy. - -If you wrote the patch, make sure your "From:" and "Signed-off-by:" -lines use the same spelling. It's okay if you subscribe or contribute to -the list via more than one address, but using multiple addresses in one -commit just confuses things. If someone else wrote the patch, git will -include a "From:" line in the body of the email (different from your -envelope From:) that will give credit to the correct author; but again, -that author's Signed-off-by: line is mandatory, with the same spelling. - -The name used with "Signed-off-by" does not need to be your legal name, -nor birth name, nor appear on any government ID. It is the identity you -choose to be known by in the community, but should not be anonymous, -nor misrepresent whom you are. - -There are various tooling options for automatically adding these tags -include using ``git commit -s`` or ``git format-patch -s``. For more -information see `SubmittingPatches 1.12 -<http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/SubmittingPatches?id=f6f94e2ab1b33f0082ac22d71f66385a60d8157f#n297>`__. +this and happy for it to go into QEMU". For full guidance, read the +:ref:`code-provenance` documentation. + .. _include_a_meaningful_cover_letter: diff --git a/docs/devel/testing/functional.rst b/docs/devel/testing/functional.rst index 8030cb4..9e56dd1 100644 --- a/docs/devel/testing/functional.rst +++ b/docs/devel/testing/functional.rst @@ -274,7 +274,7 @@ speed mode in the meson.build file, while the "quick" speed mode is fine for functional tests that can be run without downloading files. ``make check`` then only runs the quick functional tests along with the other quick tests from the other test suites. If you choose to -run only run ``make check-functional``, the "thorough" tests will be +run only ``make check-functional``, the "thorough" tests will be executed, too. And to run all functional tests along with the others, you can use something like:: diff --git a/docs/igd-assign.txt b/docs/igd-assign.txt index 3aed795..af4e839 100644 --- a/docs/igd-assign.txt +++ b/docs/igd-assign.txt @@ -47,6 +47,7 @@ Intel document [1] shows how to dump VBIOS to file. For UEFI Option ROM, see QEMU also provides a "Legacy" mode that implicitly enables full functionality on IGD, it is automatically enabled when +* IGD generation is 6 to 9 (Sandy Bridge to Comet Lake) * Machine type is i440fx * IGD is assigned to guest BDF 00:02.0 * ROM BAR or romfile is present @@ -101,7 +102,7 @@ digital formats work well. Options ======= -* x-igd-opregion=[on|*off*] +* x-igd-opregion=[*on*|off] Copy host IGD OpRegion and expose it to guest with fw_cfg * x-igd-lpc=[on|*off*] @@ -123,7 +124,7 @@ Examples * Adding IGD with OpRegion and LPC ID hack, but without VGA ranges (For UEFI guests) - -device vfio-pci,host=00:02.0,id=hostdev0,addr=2.0,x-igd-legacy-mode=off,x-igd-opregion=on,x-igd-lpc=on,romfile=efi_oprom.rom + -device vfio-pci,host=00:02.0,id=hostdev0,addr=2.0,x-igd-legacy-mode=off,x-igd-lpc=on,romfile=efi_oprom.rom Guest firmware @@ -156,6 +157,12 @@ fw_cfg requirements on the VM firmware: it's expected that this fw_cfg file is only relevant to a single PCI class VGA device with Intel vendor ID, appearing at PCI bus address 00:02.0. + Starting from Meteor Lake, IGD devices access stolen memory via its MMIO + BAR2 (LMEMBAR) and removed the BDSM register in config space. There is + no need for guest firmware to allocate data stolen memory in guest address + space and write it to BDSM register. Value of this fw_cfg file is 0 in + such case. + Upstream Seabios has OpRegion and BDSM (pre-Gen11 device only) support. However, the support is not accepted by upstream EDK2/OVMF. A recommended solution is to create a virtual OpRom with following DXE drivers: diff --git a/docs/interop/bitmaps.rst b/docs/interop/bitmaps.rst index ddf8947..7536f0b 100644 --- a/docs/interop/bitmaps.rst +++ b/docs/interop/bitmaps.rst @@ -97,7 +97,7 @@ time. - Persistent storage formats may impose their own requirements on bitmap names and namespaces. Presently, only qcow2 supports persistent bitmaps. See - docs/interop/qcow2.txt for more details on restrictions. Notably: + :doc:`qcow2` for more details on restrictions. Notably: - qcow2 bitmap names are limited to between 1 and 1023 bytes long. diff --git a/docs/interop/index.rst b/docs/interop/index.rst index 999e44e..d830c5c 100644 --- a/docs/interop/index.rst +++ b/docs/interop/index.rst @@ -17,12 +17,15 @@ are useful for making QEMU interoperate with other software. nbd parallels prl-xml + qcow2 + qed_spec pr-helper qmp-spec qemu-ga qemu-ga-ref qemu-qmp-ref qemu-storage-daemon-qmp-ref + vfio-user vhost-user vhost-user-gpu vhost-vdpa diff --git a/docs/interop/qcow2.txt b/docs/interop/qcow2.rst index 2c46183..5948591 100644 --- a/docs/interop/qcow2.txt +++ b/docs/interop/qcow2.rst @@ -1,6 +1,8 @@ -== General == +======================= +Qcow2 Image File Format +======================= -A qcow2 image file is organized in units of constant size, which are called +A ``qcow2`` image file is organized in units of constant size, which are called (host) clusters. A cluster is the unit in which all allocations are done, both for actual guest data and for image metadata. @@ -9,10 +11,10 @@ clusters of the same size. All numbers in qcow2 are stored in Big Endian byte order. +Header +------ -== Header == - -The first cluster of a qcow2 image contains the file header: +The first cluster of a qcow2 image contains the file header:: Byte 0 - 3: magic QCOW magic string ("QFI\xfb") @@ -38,7 +40,7 @@ The first cluster of a qcow2 image contains the file header: within a cluster (1 << cluster_bits is the cluster size). Must not be less than 9 (i.e. 512 byte clusters). - Note: qemu as of today has an implementation limit of 2 MB + Note: QEMU as of today has an implementation limit of 2 MB as the maximum cluster size and won't be able to open images with larger cluster sizes. @@ -48,7 +50,7 @@ The first cluster of a qcow2 image contains the file header: 24 - 31: size Virtual disk size in bytes. - Note: qemu has an implementation limit of 32 MB as + Note: QEMU has an implementation limit of 32 MB as the maximum L1 table size. With a 2 MB cluster size, it is unable to populate a virtual cluster beyond 2 EB (61 bits); with a 512 byte cluster @@ -87,7 +89,8 @@ The first cluster of a qcow2 image contains the file header: For version 2, the header is exactly 72 bytes in length, and finishes here. For version 3 or higher, the header length is at least 104 bytes, including -the next fields through header_length. +the next fields through ``header_length``. +:: 72 - 79: incompatible_features Bitmask of incompatible features. An implementation must @@ -185,7 +188,8 @@ the next fields through header_length. of 8. -=== Additional fields (version 3 and higher) === +Additional fields (version 3 and higher) +---------------------------------------- In general, these fields are optional and may be safely ignored by the software, as well as filled by zeros (which is equal to field absence), if software needs @@ -193,21 +197,25 @@ to set field B, but does not care about field A which precedes B. More formally, additional fields have the following compatibility rules: 1. If the value of the additional field must not be ignored for correct -handling of the file, it will be accompanied by a corresponding incompatible -feature bit. + handling of the file, it will be accompanied by a corresponding incompatible + feature bit. 2. If there are no unrecognized incompatible feature bits set, an unknown -additional field may be safely ignored other than preserving its value when -rewriting the image header. + additional field may be safely ignored other than preserving its value when + rewriting the image header. + +.. _ref_rules_3: 3. An explicit value of 0 will have the same behavior as when the field is not -present*, if not altered by a specific incompatible bit. + present*, if not altered by a specific incompatible bit. -*. A field is considered not present when header_length is less than or equal +(*) A field is considered not present when ``header_length`` is less than or equal to the field's offset. Also, all additional fields are not present for version 2. - 104: compression_type +:: + + 104: compression_type Defines the compression method used for compressed clusters. All compressed clusters in an image use the same compression @@ -219,8 +227,8 @@ version 2. or must be zero (which means deflate). Available compression type values: - 0: deflate <https://www.ietf.org/rfc/rfc1951.txt> - 1: zstd <http://github.com/facebook/zstd> + - 0: deflate <https://www.ietf.org/rfc/rfc1951.txt> + - 1: zstd <http://github.com/facebook/zstd> The deflate compression type is called "zlib" <https://www.zlib.net/> in QEMU. However, clusters with the @@ -228,19 +236,21 @@ version 2. 105 - 111: Padding, contents defined below. -=== Header padding === +Header padding +-------------- -@header_length must be a multiple of 8, which means that if the end of the last +``header_length`` must be a multiple of 8, which means that if the end of the last additional field is not aligned, some padding is needed. This padding must be zeroed, so that if some existing (or future) additional field will fall into -the padding, it will be interpreted accordingly to point [3.] of the previous +the padding, it will be interpreted accordingly to point `[3.] <#ref_rules_3>`_ of the previous paragraph, i.e. in the same manner as when this field is not present. -=== Header extensions === +Header extensions +----------------- Directly after the image header, optional sections called header extensions can -be stored. Each extension has a structure like the following: +be stored. Each extension has a structure like the following:: Byte 0 - 3: Header extension type: 0x00000000 - End of the header extension area @@ -270,17 +280,19 @@ data of compatible features that it doesn't support. Compatible features that need space for additional data can use a header extension. -== String header extensions == +String header extensions +------------------------ Some header extensions (such as the backing file format name and the external data file name) are just a single string. In this case, the header extension -length is the string length and the string is not '\0' terminated. (The header -extension padding can make it look like a string is '\0' terminated, but +length is the string length and the string is not ``\0`` terminated. (The header +extension padding can make it look like a string is ``\0`` terminated, but neither is padding always necessary nor is there a guarantee that zero bytes are used for padding.) -== Feature name table == +Feature name table +------------------ The feature name table is an optional header extension that contains the name for features used by the image. It can be used by applications that don't know @@ -288,7 +300,7 @@ the respective feature (e.g. because the feature was introduced only later) to display a useful error message. The number of entries in the feature name table is determined by the length of -the header extension data. Each entry look like this: +the header extension data. Each entry looks like this:: Byte 0: Type of feature (select feature bitmap) 0: Incompatible feature @@ -302,7 +314,8 @@ the header extension data. Each entry look like this: terminated if it has full length) -== Bitmaps extension == +Bitmaps extension +----------------- The bitmaps extension is an optional header extension. It provides the ability to store bitmaps related to a virtual disk. For now, there is only one bitmap @@ -310,9 +323,9 @@ type: the dirty tracking bitmap, which tracks virtual disk changes from some point in time. The data of the extension should be considered consistent only if the -corresponding auto-clear feature bit is set, see autoclear_features above. +corresponding auto-clear feature bit is set, see ``autoclear_features`` above. -The fields of the bitmaps extension are: +The fields of the bitmaps extension are:: Byte 0 - 3: nb_bitmaps The number of bitmaps contained in the image. Must be @@ -331,15 +344,17 @@ The fields of the bitmaps extension are: Offset into the image file at which the bitmap directory starts. Must be aligned to a cluster boundary. -== Full disk encryption header pointer == +Full disk encryption header pointer +----------------------------------- The full disk encryption header must be present if, and only if, the -'crypt_method' header requires metadata. Currently this is only true -of the 'LUKS' crypt method. The header extension must be absent for +``crypt_method`` header requires metadata. Currently this is only true +of the ``LUKS`` crypt method. The header extension must be absent for other methods. This header provides the offset at which the crypt method can store its additional data, as well as the length of such data. +:: Byte 0 - 7: Offset into the image file at which the encryption header starts in bytes. Must be aligned to a cluster @@ -357,10 +372,10 @@ The first 592 bytes of the header clusters will contain the LUKS partition header. This is then followed by the key material data areas. The size of the key material data areas is determined by the number of stripes in the key slot and key size. Refer to the LUKS format -specification ('docs/on-disk-format.pdf' in the cryptsetup source +specification (``docs/on-disk-format.pdf`` in the cryptsetup source package) for details of the LUKS partition header format. -In the LUKS partition header, the "payload-offset" field will be +In the LUKS partition header, the ``payload-offset`` field will be calculated as normal for the LUKS spec. ie the size of the LUKS header, plus key material regions, plus padding, relative to the start of the LUKS header. This offset value is not required to be @@ -369,11 +384,12 @@ context of qcow2, since the qcow2 file format itself defines where the real payload offset is, but none the less a valid payload offset should always be present. -In the LUKS key slots header, the "key-material-offset" is relative +In the LUKS key slots header, the ``key-material-offset`` is relative to the start of the LUKS header clusters in the qcow2 container, not the start of the qcow2 file. Logically the layout looks like +:: +-----------------------------+ | QCow2 header | @@ -405,7 +421,8 @@ Logically the layout looks like | | +-----------------------------+ -== Data encryption == +Data encryption +--------------- When an encryption method is requested in the header, the image payload data must be encrypted/decrypted on every write/read. The image headers @@ -413,7 +430,7 @@ and metadata are never encrypted. The algorithms used for encryption vary depending on the method - - AES: + - ``AES``: The AES cipher, in CBC mode, with 256 bit keys. @@ -425,7 +442,7 @@ The algorithms used for encryption vary depending on the method supported in the command line tools for the sake of back compatibility and data liberation. - - LUKS: + - ``LUKS``: The algorithms are specified in the LUKS header. @@ -433,7 +450,8 @@ The algorithms used for encryption vary depending on the method in the LUKS header, with the physical disk sector as the input tweak. -== Host cluster management == +Host cluster management +----------------------- qcow2 manages the allocation of host clusters by maintaining a reference count for each host cluster. A refcount of 0 means that the cluster is free, 1 means @@ -453,14 +471,15 @@ Although a large enough refcount table can reserve clusters past 64 PB large), note that some qcow2 metadata such as L1/L2 tables must point to clusters prior to that point. -Note: qemu has an implementation limit of 8 MB as the maximum refcount -table size. With a 2 MB cluster size and a default refcount_order of -4, it is unable to reference host resources beyond 2 EB (61 bits); in -the worst case, with a 512 cluster size and refcount_order of 6, it is -unable to access beyond 32 GB (35 bits). +.. note:: + QEMU has an implementation limit of 8 MB as the maximum refcount + table size. With a 2 MB cluster size and a default refcount_order of + 4, it is unable to reference host resources beyond 2 EB (61 bits); in + the worst case, with a 512 cluster size and refcount_order of 6, it is + unable to access beyond 32 GB (35 bits). Given an offset into the image file, the refcount of its cluster can be -obtained as follows: +obtained as follows:: refcount_block_entries = (cluster_size * 8 / refcount_bits) @@ -470,7 +489,7 @@ obtained as follows: refcount_block = load_cluster(refcount_table[refcount_table_index]); return refcount_block[refcount_block_index]; -Refcount table entry: +Refcount table entry:: Bit 0 - 8: Reserved (set to 0) @@ -482,14 +501,15 @@ Refcount table entry: been allocated. All refcounts managed by this refcount block are 0. -Refcount block entry (x = refcount_bits - 1): +Refcount block entry ``(x = refcount_bits - 1)``:: Bit 0 - x: Reference count of the cluster. If refcount_bits implies a sub-byte width, note that bit 0 means the least significant bit in this context. -== Cluster mapping == +Cluster mapping +--------------- Just as for refcounts, qcow2 uses a two-level structure for the mapping of guest clusters to host clusters. They are called L1 and L2 table. @@ -509,7 +529,7 @@ compressed clusters to reside below 512 TB (49 bits), and this limit cannot be relaxed without an incompatible layout change). Given an offset into the virtual disk, the offset into the image file can be -obtained as follows: +obtained as follows:: l2_entries = (cluster_size / sizeof(uint64_t)) [*] @@ -523,7 +543,7 @@ obtained as follows: [*] this changes if Extended L2 Entries are enabled, see next section -L1 table entry: +L1 table entry:: Bit 0 - 8: Reserved (set to 0) @@ -538,7 +558,7 @@ L1 table entry: refcount is exactly one. This information is only accurate in the active L1 table. -L2 table entry: +L2 table entry:: Bit 0 - 61: Cluster descriptor @@ -555,7 +575,7 @@ L2 table entry: mapping for guest cluster offsets), so this bit should be 1 for all allocated clusters. -Standard Cluster Descriptor: +Standard Cluster Descriptor:: Bit 0: If set to 1, the cluster reads as all zeros. The host cluster offset can be used to describe a preallocation, @@ -577,7 +597,7 @@ Standard Cluster Descriptor: 56 - 61: Reserved (set to 0) -Compressed Clusters Descriptor (x = 62 - (cluster_bits - 8)): +Compressed Clusters Descriptor ``(x = 62 - (cluster_bits - 8))``:: Bit 0 - x-1: Host cluster offset. This is usually _not_ aligned to a cluster or sector boundary! If cluster_bits is @@ -601,7 +621,8 @@ file (except if bit 0 in the Standard Cluster Descriptor is set). If there is no backing file or the backing file is smaller than the image, they shall read zeros for all parts that are not covered by the backing file. -== Extended L2 Entries == +Extended L2 Entries +------------------- An image uses Extended L2 Entries if bit 4 is set on the incompatible_features field of the header. @@ -615,6 +636,8 @@ subclusters so they are treated the same as in images without this feature. The size of an extended L2 entry is 128 bits so the number of entries per table is calculated using this formula: +.. code:: + l2_entries = (cluster_size / (2 * sizeof(uint64_t))) The first 64 bits have the same format as the standard L2 table entry described @@ -623,7 +646,7 @@ descriptor. The last 64 bits contain a subcluster allocation bitmap with this format: -Subcluster Allocation Bitmap (for standard clusters): +Subcluster Allocation Bitmap (for standard clusters):: Bit 0 - 31: Allocation status (one bit per subcluster) @@ -647,13 +670,14 @@ Subcluster Allocation Bitmap (for standard clusters): Bits are assigned starting from the least significant one (i.e. bit x is used for subcluster x - 32). -Subcluster Allocation Bitmap (for compressed clusters): +Subcluster Allocation Bitmap (for compressed clusters):: Bit 0 - 63: Reserved (set to 0) Compressed clusters don't have subclusters, so this field is not used. -== Snapshots == +Snapshots +--------- qcow2 supports internal snapshots. Their basic principle of operation is to switch the active L1 table, so that a different set of host clusters are @@ -672,7 +696,7 @@ in the image file, whose starting offset and length are given by the header fields snapshots_offset and nb_snapshots. The entries of the snapshot table have variable length, depending on the length of ID, name and extra data. -Snapshot table entry: +Snapshot table entry:: Byte 0 - 7: Offset into the image file at which the L1 table for the snapshot starts. Must be aligned to a cluster boundary. @@ -728,7 +752,8 @@ Snapshot table entry: next multiple of 8. -== Bitmaps == +Bitmaps +------- As mentioned above, the bitmaps extension provides the ability to store bitmaps related to a virtual disk. This section describes how these bitmaps are stored. @@ -739,20 +764,23 @@ each bitmap size is equal to the virtual disk size. Each bit of the bitmap is responsible for strictly defined range of the virtual disk. For bit number bit_nr the corresponding range (in bytes) will be: +.. code:: + [bit_nr * bitmap_granularity .. (bit_nr + 1) * bitmap_granularity - 1] Granularity is a property of the concrete bitmap, see below. -=== Bitmap directory === +Bitmap directory +---------------- Each bitmap saved in the image is described in a bitmap directory entry. The bitmap directory is a contiguous area in the image file, whose starting offset -and length are given by the header extension fields bitmap_directory_offset and -bitmap_directory_size. The entries of the bitmap directory have variable +and length are given by the header extension fields ``bitmap_directory_offset`` and +``bitmap_directory_size``. The entries of the bitmap directory have variable length, depending on the lengths of the bitmap name and extra data. -Structure of a bitmap directory entry: +Structure of a bitmap directory entry:: Byte 0 - 7: bitmap_table_offset Offset into the image file at which the bitmap table @@ -833,7 +861,8 @@ Structure of a bitmap directory entry: next multiple of 8. All bytes of the padding must be zero. -=== Bitmap table === +Bitmap table +------------ Each bitmap is stored using a one-level structure (as opposed to two-level structures like for refcounts and guest clusters mapping) for the mapping of @@ -843,7 +872,7 @@ Each bitmap table has a variable size (stored in the bitmap directory entry) and may use multiple clusters, however, it must be contiguous in the image file. -Structure of a bitmap table entry: +Structure of a bitmap table entry:: Bit 0: Reserved and must be zero if bits 9 - 55 are non-zero. If bits 9 - 55 are zero: @@ -860,11 +889,12 @@ Structure of a bitmap table entry: 56 - 63: Reserved and must be zero. -=== Bitmap data === +Bitmap data +----------- As noted above, bitmap data is stored in separate clusters, described by the bitmap table. Given an offset (in bytes) into the bitmap data, the offset into -the image file can be obtained as follows: +the image file can be obtained as follows:: image_offset(bitmap_data_offset) = bitmap_table[bitmap_data_offset / cluster_size] + @@ -875,7 +905,7 @@ above). Given an offset byte_nr into the virtual disk and the bitmap's granularity, the bit offset into the image file to the corresponding bit of the bitmap can be -calculated like this: +calculated like this:: bit_offset(byte_nr) = image_offset(byte_nr / granularity / 8) * 8 + @@ -886,21 +916,22 @@ last cluster of the bitmap data contains some unused tail bits. These bits must be zero. -=== Dirty tracking bitmaps === +Dirty tracking bitmaps +---------------------- -Bitmaps with 'type' field equal to one are dirty tracking bitmaps. +Bitmaps with ``type`` field equal to one are dirty tracking bitmaps. -When the virtual disk is in use dirty tracking bitmap may be 'enabled' or -'disabled'. While the bitmap is 'enabled', all writes to the virtual disk +When the virtual disk is in use dirty tracking bitmap may be ``enabled`` or +``disabled``. While the bitmap is ``enabled``, all writes to the virtual disk should be reflected in the bitmap. A set bit in the bitmap means that the corresponding range of the virtual disk (see above) was written to while the -bitmap was 'enabled'. An unset bit means that this range was not written to. +bitmap was ``enabled``. An unset bit means that this range was not written to. The software doesn't have to sync the bitmap in the image file with its -representation in RAM after each write or metadata change. Flag 'in_use' +representation in RAM after each write or metadata change. Flag ``in_use`` should be set while the bitmap is not synced. -In the image file the 'enabled' state is reflected by the 'auto' flag. If this -flag is set, the software must consider the bitmap as 'enabled' and start +In the image file the ``enabled`` state is reflected by the ``auto`` flag. If this +flag is set, the software must consider the bitmap as ``enabled`` and start tracking virtual disk changes to this bitmap from the first write to the virtual disk. If this flag is not set then the bitmap is disabled. diff --git a/docs/interop/qed_spec.rst b/docs/interop/qed_spec.rst new file mode 100644 index 0000000..cd6c7d9 --- /dev/null +++ b/docs/interop/qed_spec.rst @@ -0,0 +1,219 @@ +=================================== +QED Image File Format Specification +=================================== + +The file format looks like this:: + + +----------+----------+----------+-----+ + | cluster0 | cluster1 | cluster2 | ... | + +----------+----------+----------+-----+ + +The first cluster begins with the ``header``. The header contains information +about where regular clusters start; this allows the header to be extensible and +store extra information about the image file. A regular cluster may be +a ``data cluster``, an ``L2``, or an ``L1 table``. L1 and L2 tables are composed +of one or more contiguous clusters. + +Normally the file size will be a multiple of the cluster size. If the file size +is not a multiple, extra information after the last cluster may not be preserved +if data is written. Legitimate extra information should use space between the header +and the first regular cluster. + +All fields are little-endian. + +Header +------ + +:: + + Header { + uint32_t magic; /* QED\0 */ + + uint32_t cluster_size; /* in bytes */ + uint32_t table_size; /* for L1 and L2 tables, in clusters */ + uint32_t header_size; /* in clusters */ + + uint64_t features; /* format feature bits */ + uint64_t compat_features; /* compat feature bits */ + uint64_t autoclear_features; /* self-resetting feature bits */ + + uint64_t l1_table_offset; /* in bytes */ + uint64_t image_size; /* total logical image size, in bytes */ + + /* if (features & QED_F_BACKING_FILE) */ + uint32_t backing_filename_offset; /* in bytes from start of header */ + uint32_t backing_filename_size; /* in bytes */ + } + +Field descriptions: +~~~~~~~~~~~~~~~~~~~ + +- ``cluster_size`` must be a power of 2 in range [2^12, 2^26]. +- ``table_size`` must be a power of 2 in range [1, 16]. +- ``header_size`` is the number of clusters used by the header and any additional + information stored before regular clusters. +- ``features``, ``compat_features``, and ``autoclear_features`` are file format + extension bitmaps. They work as follows: + + - An image with unknown ``features`` bits enabled must not be opened. File format + changes that are not backwards-compatible must use ``features`` bits. + - An image with unknown ``compat_features`` bits enabled can be opened safely. + The unknown features are simply ignored and represent backwards-compatible + changes to the file format. + - An image with unknown ``autoclear_features`` bits enable can be opened safely + after clearing the unknown bits. This allows for backwards-compatible changes + to the file format which degrade gracefully and can be re-enabled again by a + new program later. +- ``l1_table_offset`` is the offset of the first byte of the L1 table in the image + file and must be a multiple of ``cluster_size``. +- ``image_size`` is the block device size seen by the guest and must be a multiple + of 512 bytes. +- ``backing_filename_offset`` and ``backing_filename_size`` describe a string in + (byte offset, byte size) form. It is not NUL-terminated and has no alignment constraints. + The string must be stored within the first ``header_size`` clusters. The backing filename + may be an absolute path or relative to the image file. + +Feature bits: +~~~~~~~~~~~~~ + +- ``QED_F_BACKING_FILE = 0x01``. The image uses a backing file. +- ``QED_F_NEED_CHECK = 0x02``. The image needs a consistency check before use. +- ``QED_F_BACKING_FORMAT_NO_PROBE = 0x04``. The backing file is a raw disk image + and no file format autodetection should be attempted. This should be used to + ensure that raw backing files are never detected as an image format if they happen + to contain magic constants. + +There are currently no defined ``compat_features`` or ``autoclear_features`` bits. + +Fields predicated on a feature bit are only used when that feature is set. +The fields always take up header space, regardless of whether or not the feature +bit is set. + +Tables +------ + +Tables provide the translation from logical offsets in the block device to cluster +offsets in the file. + +:: + + #define TABLE_NOFFSETS (table_size * cluster_size / sizeof(uint64_t)) + + Table { + uint64_t offsets[TABLE_NOFFSETS]; + } + +The tables are organized as follows:: + + +----------+ + | L1 table | + +----------+ + ,------' | '------. + +----------+ | +----------+ + | L2 table | ... | L2 table | + +----------+ +----------+ + ,------' | '------. + +----------+ | +----------+ + | Data | ... | Data | + +----------+ +----------+ + +A table is made up of one or more contiguous clusters. The ``table_size`` header +field determines table size for an image file. For example, ``cluster_size=64 KB`` +and ``table_size=4`` results in 256 KB tables. + +The logical image size must be less than or equal to the maximum possible size of +clusters rooted by the L1 table: + +.. code:: + + header.image_size <= TABLE_NOFFSETS * TABLE_NOFFSETS * header.cluster_size + +L1, L2, and data cluster offsets must be aligned to ``header.cluster_size``. +The following offsets have special meanings: + +L2 table offsets +~~~~~~~~~~~~~~~~ + +- 0 - unallocated. The L2 table is not yet allocated. + +Data cluster offsets +~~~~~~~~~~~~~~~~~~~~ + +- 0 - unallocated. The data cluster is not yet allocated. +- 1 - zero. The data cluster contents are all zeroes and no cluster is allocated. + +Future format extensions may wish to store per-offset information. The least +significant 12 bits of an offset are reserved for this purpose and must be set +to zero. Image files with ``cluster_size`` > 2^12 will have more unused bits +which should also be zeroed. + +Unallocated L2 tables and data clusters +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Reads to an unallocated area of the image file access the backing file. If there +is no backing file, then zeroes are produced. The backing file may be smaller +than the image file and reads of unallocated areas beyond the end of the backing +file produce zeroes. + +Writes to an unallocated area cause a new data clusters to be allocated, and a new +L2 table if that is also unallocated. The new data cluster is populated with data +from the backing file (or zeroes if no backing file) and the data being written. + +Zero data clusters +~~~~~~~~~~~~~~~~~~ + +Zero data clusters are a space-efficient way of storing zeroed regions of the image. + +Reads to a zero data cluster produce zeroes. + +.. note:: + The difference between an unallocated and a zero data cluster is that zero data + clusters stop the reading of contents from the backing file. + +Writes to a zero data cluster cause a new data cluster to be allocated. The new +data cluster is populated with zeroes and the data being written. + +Logical offset translation +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Logical offsets are translated into cluster offsets as follows:: + + table_bits table_bits cluster_bits + <--------> <--------> <---------------> + +----------+----------+-----------------+ + | L1 index | L2 index | byte offset | + +----------+----------+-----------------+ + + Structure of a logical offset + + offset_mask = ~(cluster_size - 1) # mask for the image file byte offset + + def logical_to_cluster_offset(l1_index, l2_index, byte_offset): + l2_offset = l1_table[l1_index] + l2_table = load_table(l2_offset) + cluster_offset = l2_table[l2_index] & offset_mask + return cluster_offset + byte_offset + +Consistency checking +-------------------- + +This section is informational and included to provide background on the use +of the ``QED_F_NEED_CHECK features`` bit. + +The ``QED_F_NEED_CHECK`` bit is used to mark an image as dirty before starting +an operation that could leave the image in an inconsistent state if interrupted +by a crash or power failure. A dirty image must be checked on open because its +metadata may not be consistent. + +Consistency check includes the following invariants: + +- Each cluster is referenced once and only once. It is an inconsistency to have + a cluster referenced more than once by L1 or L2 tables. A cluster has been leaked + if it has no references. +- Offsets must be within the image file size and must be ``cluster_size`` aligned. +- Table offsets must at least ``table_size`` * ``cluster_size`` bytes from the end + of the image file so that there is space for the entire table. + +The consistency check process starts from ``l1_table_offset`` and scans all L2 tables. +After the check completes with no other errors besides leaks, the ``QED_F_NEED_CHECK`` +bit can be cleared and the image can be accessed. diff --git a/docs/interop/qed_spec.txt b/docs/interop/qed_spec.txt deleted file mode 100644 index 7982e05..0000000 --- a/docs/interop/qed_spec.txt +++ /dev/null @@ -1,138 +0,0 @@ -=Specification= - -The file format looks like this: - - +----------+----------+----------+-----+ - | cluster0 | cluster1 | cluster2 | ... | - +----------+----------+----------+-----+ - -The first cluster begins with the '''header'''. The header contains information about where regular clusters start; this allows the header to be extensible and store extra information about the image file. A regular cluster may be a '''data cluster''', an '''L2''', or an '''L1 table'''. L1 and L2 tables are composed of one or more contiguous clusters. - -Normally the file size will be a multiple of the cluster size. If the file size is not a multiple, extra information after the last cluster may not be preserved if data is written. Legitimate extra information should use space between the header and the first regular cluster. - -All fields are little-endian. - -==Header== - Header { - uint32_t magic; /* QED\0 */ - - uint32_t cluster_size; /* in bytes */ - uint32_t table_size; /* for L1 and L2 tables, in clusters */ - uint32_t header_size; /* in clusters */ - - uint64_t features; /* format feature bits */ - uint64_t compat_features; /* compat feature bits */ - uint64_t autoclear_features; /* self-resetting feature bits */ - - uint64_t l1_table_offset; /* in bytes */ - uint64_t image_size; /* total logical image size, in bytes */ - - /* if (features & QED_F_BACKING_FILE) */ - uint32_t backing_filename_offset; /* in bytes from start of header */ - uint32_t backing_filename_size; /* in bytes */ - } - -Field descriptions: -* ''cluster_size'' must be a power of 2 in range [2^12, 2^26]. -* ''table_size'' must be a power of 2 in range [1, 16]. -* ''header_size'' is the number of clusters used by the header and any additional information stored before regular clusters. -* ''features'', ''compat_features'', and ''autoclear_features'' are file format extension bitmaps. They work as follows: -** An image with unknown ''features'' bits enabled must not be opened. File format changes that are not backwards-compatible must use ''features'' bits. -** An image with unknown ''compat_features'' bits enabled can be opened safely. The unknown features are simply ignored and represent backwards-compatible changes to the file format. -** An image with unknown ''autoclear_features'' bits enable can be opened safely after clearing the unknown bits. This allows for backwards-compatible changes to the file format which degrade gracefully and can be re-enabled again by a new program later. -* ''l1_table_offset'' is the offset of the first byte of the L1 table in the image file and must be a multiple of ''cluster_size''. -* ''image_size'' is the block device size seen by the guest and must be a multiple of 512 bytes. -* ''backing_filename_offset'' and ''backing_filename_size'' describe a string in (byte offset, byte size) form. It is not NUL-terminated and has no alignment constraints. The string must be stored within the first ''header_size'' clusters. The backing filename may be an absolute path or relative to the image file. - -Feature bits: -* QED_F_BACKING_FILE = 0x01. The image uses a backing file. -* QED_F_NEED_CHECK = 0x02. The image needs a consistency check before use. -* QED_F_BACKING_FORMAT_NO_PROBE = 0x04. The backing file is a raw disk image and no file format autodetection should be attempted. This should be used to ensure that raw backing files are never detected as an image format if they happen to contain magic constants. - -There are currently no defined ''compat_features'' or ''autoclear_features'' bits. - -Fields predicated on a feature bit are only used when that feature is set. The fields always take up header space, regardless of whether or not the feature bit is set. - -==Tables== - -Tables provide the translation from logical offsets in the block device to cluster offsets in the file. - - #define TABLE_NOFFSETS (table_size * cluster_size / sizeof(uint64_t)) - - Table { - uint64_t offsets[TABLE_NOFFSETS]; - } - -The tables are organized as follows: - - +----------+ - | L1 table | - +----------+ - ,------' | '------. - +----------+ | +----------+ - | L2 table | ... | L2 table | - +----------+ +----------+ - ,------' | '------. - +----------+ | +----------+ - | Data | ... | Data | - +----------+ +----------+ - -A table is made up of one or more contiguous clusters. The table_size header field determines table size for an image file. For example, cluster_size=64 KB and table_size=4 results in 256 KB tables. - -The logical image size must be less than or equal to the maximum possible size of clusters rooted by the L1 table: - header.image_size <= TABLE_NOFFSETS * TABLE_NOFFSETS * header.cluster_size - -L1, L2, and data cluster offsets must be aligned to header.cluster_size. The following offsets have special meanings: - -===L2 table offsets=== -* 0 - unallocated. The L2 table is not yet allocated. - -===Data cluster offsets=== -* 0 - unallocated. The data cluster is not yet allocated. -* 1 - zero. The data cluster contents are all zeroes and no cluster is allocated. - -Future format extensions may wish to store per-offset information. The least significant 12 bits of an offset are reserved for this purpose and must be set to zero. Image files with cluster_size > 2^12 will have more unused bits which should also be zeroed. - -===Unallocated L2 tables and data clusters=== -Reads to an unallocated area of the image file access the backing file. If there is no backing file, then zeroes are produced. The backing file may be smaller than the image file and reads of unallocated areas beyond the end of the backing file produce zeroes. - -Writes to an unallocated area cause a new data clusters to be allocated, and a new L2 table if that is also unallocated. The new data cluster is populated with data from the backing file (or zeroes if no backing file) and the data being written. - -===Zero data clusters=== -Zero data clusters are a space-efficient way of storing zeroed regions of the image. - -Reads to a zero data cluster produce zeroes. Note that the difference between an unallocated and a zero data cluster is that zero data clusters stop the reading of contents from the backing file. - -Writes to a zero data cluster cause a new data cluster to be allocated. The new data cluster is populated with zeroes and the data being written. - -===Logical offset translation=== -Logical offsets are translated into cluster offsets as follows: - - table_bits table_bits cluster_bits - <--------> <--------> <---------------> - +----------+----------+-----------------+ - | L1 index | L2 index | byte offset | - +----------+----------+-----------------+ - - Structure of a logical offset - - offset_mask = ~(cluster_size - 1) # mask for the image file byte offset - - def logical_to_cluster_offset(l1_index, l2_index, byte_offset): - l2_offset = l1_table[l1_index] - l2_table = load_table(l2_offset) - cluster_offset = l2_table[l2_index] & offset_mask - return cluster_offset + byte_offset - -==Consistency checking== - -This section is informational and included to provide background on the use of the QED_F_NEED_CHECK ''features'' bit. - -The QED_F_NEED_CHECK bit is used to mark an image as dirty before starting an operation that could leave the image in an inconsistent state if interrupted by a crash or power failure. A dirty image must be checked on open because its metadata may not be consistent. - -Consistency check includes the following invariants: -# Each cluster is referenced once and only once. It is an inconsistency to have a cluster referenced more than once by L1 or L2 tables. A cluster has been leaked if it has no references. -# Offsets must be within the image file size and must be ''cluster_size'' aligned. -# Table offsets must at least ''table_size'' * ''cluster_size'' bytes from the end of the image file so that there is space for the entire table. - -The consistency check process starts by from ''l1_table_offset'' and scans all L2 tables. After the check completes with no other errors besides leaks, the QED_F_NEED_CHECK bit can be cleared and the image can be accessed. diff --git a/docs/interop/vfio-user.rst b/docs/interop/vfio-user.rst new file mode 100644 index 0000000..0b06f02 --- /dev/null +++ b/docs/interop/vfio-user.rst @@ -0,0 +1,1520 @@ +.. include:: <isonum.txt> +.. SPDX-License-Identifier: GPL-2.0-or-later + +================================ +vfio-user Protocol Specification +================================ + +.. contents:: Table of Contents + +Introduction +============ +vfio-user is a protocol that allows a device to be emulated in a separate +process outside of a Virtual Machine Monitor (VMM). vfio-user devices consist +of a generic VFIO device type, living inside the VMM, which we call the client, +and the core device implementation, living outside the VMM, which we call the +server. + +The vfio-user specification is partly based on the +`Linux VFIO ioctl interface <https://www.kernel.org/doc/html/latest/driver-api/vfio.html>`_. + +VFIO is a mature and stable API, backed by an extensively used framework. The +existing VFIO client implementation in QEMU (``qemu/hw/vfio/``) can be largely +re-used, though there is nothing in this specification that requires that +particular implementation. None of the VFIO kernel modules are required for +supporting the protocol, on either the client or server side. Some source +definitions in VFIO are re-used for vfio-user. + +The main idea is to allow a virtual device to function in a separate process in +the same host over a UNIX domain socket. A UNIX domain socket (``AF_UNIX``) is +chosen because file descriptors can be trivially sent over it, which in turn +allows: + +* Sharing of client memory for DMA with the server. +* Sharing of server memory with the client for fast MMIO. +* Efficient sharing of eventfd's for triggering interrupts. + +Other socket types could be used which allow the server to run in a separate +guest in the same host (``AF_VSOCK``) or remotely (``AF_INET``). Theoretically +the underlying transport does not necessarily have to be a socket, however we do +not examine such alternatives. In this protocol version we focus on using a UNIX +domain socket and introduce basic support for the other two types of sockets +without considering performance implications. + +While passing of file descriptors is desirable for performance reasons, support +is not necessary for either the client or the server in order to implement the +protocol. There is always an in-band, message-passing fall back mechanism. + +Overview +======== + +VFIO is a framework that allows a physical device to be securely passed through +to a user space process; the device-specific kernel driver does not drive the +device at all. Typically, the user space process is a VMM and the device is +passed through to it in order to achieve high performance. VFIO provides an API +and the required functionality in the kernel. QEMU has adopted VFIO to allow a +guest to directly access physical devices, instead of emulating them in +software. + +vfio-user reuses the core VFIO concepts defined in its API, but implements them +as messages to be sent over a socket. It does not change the kernel-based VFIO +in any way, in fact none of the VFIO kernel modules need to be loaded to use +vfio-user. It is also possible for the client to concurrently use the current +kernel-based VFIO for one device, and vfio-user for another device. + +VFIO Device Model +----------------- + +A device under VFIO presents a standard interface to the user process. Many of +the VFIO operations in the existing interface use the ``ioctl()`` system call, and +references to the existing interface are called the ``ioctl()`` implementation in +this document. + +The following sections describe the set of messages that implement the vfio-user +interface over a socket. In many cases, the messages are analogous to data +structures used in the ``ioctl()`` implementation. Messages derived from the +``ioctl()`` will have a name derived from the ``ioctl()`` command name. E.g., the +``VFIO_DEVICE_GET_INFO`` ``ioctl()`` command becomes a +``VFIO_USER_DEVICE_GET_INFO`` message. The purpose of this reuse is to share as +much code as feasible with the ``ioctl()`` implementation``. + +Connection Initiation +^^^^^^^^^^^^^^^^^^^^^ + +After the client connects to the server, the initial client message is +``VFIO_USER_VERSION`` to propose a protocol version and set of capabilities to +apply to the session. The server replies with a compatible version and set of +capabilities it supports, or closes the connection if it cannot support the +advertised version. + +Device Information +^^^^^^^^^^^^^^^^^^ + +The client uses a ``VFIO_USER_DEVICE_GET_INFO`` message to query the server for +information about the device. This information includes: + +* The device type and whether it supports reset (``VFIO_DEVICE_FLAGS_``), +* the number of device regions, and +* the device presents to the client the number of interrupt types the device + supports. + +Region Information +^^^^^^^^^^^^^^^^^^ + +The client uses ``VFIO_USER_DEVICE_GET_REGION_INFO`` messages to query the +server for information about the device's regions. This information describes: + +* Read and write permissions, whether it can be memory mapped, and whether it + supports additional capabilities (``VFIO_REGION_INFO_CAP_``). +* Region index, size, and offset. + +When a device region can be mapped by the client, the server provides a file +descriptor which the client can ``mmap()``. The server is responsible for +polling for client updates to memory mapped regions. + +Region Capabilities +""""""""""""""""""" + +Some regions have additional capabilities that cannot be described adequately +by the region info data structure. These capabilities are returned in the +region info reply in a list similar to PCI capabilities in a PCI device's +configuration space. + +Sparse Regions +"""""""""""""" +A region can be memory-mappable in whole or in part. When only a subset of a +region can be mapped by the client, a ``VFIO_REGION_INFO_CAP_SPARSE_MMAP`` +capability is included in the region info reply. This capability describes +which portions can be mapped by the client. + +.. Note:: + For example, in a virtual NVMe controller, sparse regions can be used so + that accesses to the NVMe registers (found in the beginning of BAR0) are + trapped (an infrequent event), while allowing direct access to the doorbells + (an extremely frequent event as every I/O submission requires a write to + BAR0), found in the next page after the NVMe registers in BAR0. + +Device-Specific Regions +""""""""""""""""""""""" + +A device can define regions additional to the standard ones (e.g. PCI indexes +0-8). This is achieved by including a ``VFIO_REGION_INFO_CAP_TYPE`` capability +in the region info reply of a device-specific region. Such regions are reflected +in ``struct vfio_user_device_info.num_regions``. Thus, for PCI devices this +value can be equal to, or higher than, ``VFIO_PCI_NUM_REGIONS``. + +Region I/O via file descriptors +------------------------------- + +For unmapped regions, region I/O from the client is done via +``VFIO_USER_REGION_READ/WRITE``. As an optimization, ioeventfds or ioregionfds +may be configured for sub-regions of some regions. A client may request +information on these sub-regions via ``VFIO_USER_DEVICE_GET_REGION_IO_FDS``; by +configuring the returned file descriptors as ioeventfds or ioregionfds, the +server can be directly notified of I/O (for example, by KVM) without taking a +trip through the client. + +Interrupts +^^^^^^^^^^ + +The client uses ``VFIO_USER_DEVICE_GET_IRQ_INFO`` messages to query the server +for the device's interrupt types. The interrupt types are specific to the bus +the device is attached to, and the client is expected to know the capabilities +of each interrupt type. The server can signal an interrupt by directly injecting +interrupts into the guest via an event file descriptor. The client configures +how the server signals an interrupt with ``VFIO_USER_SET_IRQS`` messages. + +Device Read and Write +^^^^^^^^^^^^^^^^^^^^^ + +When the guest executes load or store operations to an unmapped device region, +the client forwards these operations to the server with +``VFIO_USER_REGION_READ`` or ``VFIO_USER_REGION_WRITE`` messages. The server +will reply with data from the device on read operations or an acknowledgement on +write operations. See `Read and Write Operations`_. + +Client memory access +-------------------- + +The client uses ``VFIO_USER_DMA_MAP`` and ``VFIO_USER_DMA_UNMAP`` messages to +inform the server of the valid DMA ranges that the server can access on behalf +of a device (typically, VM guest memory). DMA memory may be accessed by the +server via ``VFIO_USER_DMA_READ`` and ``VFIO_USER_DMA_WRITE`` messages over the +socket. In this case, the "DMA" part of the naming is a misnomer. + +Actual direct memory access of client memory from the server is possible if the +client provides file descriptors the server can ``mmap()``. Note that ``mmap()`` +privileges cannot be revoked by the client, therefore file descriptors should +only be exported in environments where the client trusts the server not to +corrupt guest memory. + +See `Read and Write Operations`_. + +Client/server interactions +========================== + +Socket +------ + +A server can serve: + +1) one or more clients, and/or +2) one or more virtual devices, belonging to one or more clients. + +The current protocol specification requires a dedicated socket per +client/server connection. It is a server-side implementation detail whether a +single server handles multiple virtual devices from the same or multiple +clients. The location of the socket is implementation-specific. Multiplexing +clients, devices, and servers over the same socket is not supported in this +version of the protocol. + +Authentication +-------------- + +For ``AF_UNIX``, we rely on OS mandatory access controls on the socket files, +therefore it is up to the management layer to set up the socket as required. +Socket types that span guests or hosts will require a proper authentication +mechanism. Defining that mechanism is deferred to a future version of the +protocol. + +Command Concurrency +------------------- + +A client may pipeline multiple commands without waiting for previous command +replies. The server will process commands in the order they are received. A +consequence of this is if a client issues a command with the *No_reply* bit, +then subsequently issues a command without *No_reply*, the older command will +have been processed before the reply to the younger command is sent by the +server. The client must be aware of the device's capability to process +concurrent commands if pipelining is used. For example, pipelining allows +multiple client threads to concurrently access device regions; the client must +ensure these accesses obey device semantics. + +An example is a frame buffer device, where the device may allow concurrent +access to different areas of video memory, but may have indeterminate behavior +if concurrent accesses are performed to command or status registers. + +Note that unrelated messages sent from the server to the client can appear in +between a client to server request/reply and vice versa. + +Implementers should be prepared for certain commands to exhibit potentially +unbounded latencies. For example, ``VFIO_USER_DEVICE_RESET`` may take an +arbitrarily long time to complete; clients should take care not to block +unnecessarily. + +Socket Disconnection Behavior +----------------------------- +The server and the client can disconnect from each other, either intentionally +or unexpectedly. Both the client and the server need to know how to handle such +events. + +Server Disconnection +^^^^^^^^^^^^^^^^^^^^ +A server disconnecting from the client may indicate that: + +1) A virtual device has been restarted, either intentionally (e.g. because of a + device update) or unintentionally (e.g. because of a crash). +2) A virtual device has been shut down with no intention to be restarted. + +It is impossible for the client to know whether or not a failure is +intermittent or innocuous and should be retried, therefore the client should +reset the VFIO device when it detects the socket has been disconnected. +Error recovery will be driven by the guest's device error handling +behavior. + +Client Disconnection +^^^^^^^^^^^^^^^^^^^^ +The client disconnecting from the server primarily means that the client +has exited. Currently, this means that the guest is shut down so the device is +no longer needed therefore the server can automatically exit. However, there +can be cases where a client disconnection should not result in a server exit: + +1) A single server serving multiple clients. +2) A multi-process QEMU upgrading itself step by step, which is not yet + implemented. + +Therefore in order for the protocol to be forward compatible, the server should +respond to a client disconnection as follows: + + - all client memory regions are unmapped and cleaned up (including closing any + passed file descriptors) + - all IRQ file descriptors passed from the old client are closed + - the device state should otherwise be retained + +The expectation is that when a client reconnects, it will re-establish IRQ and +client memory mappings. + +If anything happens to the client (such as qemu really did exit), the control +stack will know about it and can clean up resources accordingly. + +Security Considerations +----------------------- + +Speaking generally, vfio-user clients should not trust servers, and vice versa. +Standard tools and mechanisms should be used on both sides to validate input and +prevent against denial of service scenarios, buffer overflow, etc. + +Request Retry and Response Timeout +---------------------------------- +A failed command is a command that has been successfully sent and has been +responded to with an error code. Failure to send the command in the first place +(e.g. because the socket is disconnected) is a different type of error examined +earlier in the disconnect section. + +.. Note:: + QEMU's VFIO retries certain operations if they fail. While this makes sense + for real HW, we don't know for sure whether it makes sense for virtual + devices. + +Defining a retry and timeout scheme is deferred to a future version of the +protocol. + +Message sizes +------------- + +Some requests have an ``argsz`` field. In a request, it defines the maximum +expected reply payload size, which should be at least the size of the fixed +reply payload headers defined here. The *request* payload size is defined by the +usual ``msg_size`` field in the header, not the ``argsz`` field. + +In a reply, the server sets ``argsz`` field to the size needed for a full +payload size. This may be less than the requested maximum size. This may be +larger than the requested maximum size: in that case, the full payload is not +included in the reply, but the ``argsz`` field in the reply indicates the needed +size, allowing a client to allocate a larger buffer for holding the reply before +trying again. + +In addition, during negotiation (see `Version`_), the client and server may +each specify a ``max_data_xfer_size`` value; this defines the maximum data that +may be read or written via one of the ``VFIO_USER_DMA/REGION_READ/WRITE`` +messages; see `Read and Write Operations`_. + +Protocol Specification +====================== + +To distinguish from the base VFIO symbols, all vfio-user symbols are prefixed +with ``vfio_user`` or ``VFIO_USER``. In this revision, all data is in the +endianness of the host system, although this may be relaxed in future +revisions in cases where the client and server run on different hosts +with different endianness. + +Unless otherwise specified, all sizes should be presumed to be in bytes. + +.. _Commands: + +Commands +-------- +The following table lists the VFIO message command IDs, and whether the +message command is sent from the client or the server. + +====================================== ========= ================= +Name Command Request Direction +====================================== ========= ================= +``VFIO_USER_VERSION`` 1 client -> server +``VFIO_USER_DMA_MAP`` 2 client -> server +``VFIO_USER_DMA_UNMAP`` 3 client -> server +``VFIO_USER_DEVICE_GET_INFO`` 4 client -> server +``VFIO_USER_DEVICE_GET_REGION_INFO`` 5 client -> server +``VFIO_USER_DEVICE_GET_REGION_IO_FDS`` 6 client -> server +``VFIO_USER_DEVICE_GET_IRQ_INFO`` 7 client -> server +``VFIO_USER_DEVICE_SET_IRQS`` 8 client -> server +``VFIO_USER_REGION_READ`` 9 client -> server +``VFIO_USER_REGION_WRITE`` 10 client -> server +``VFIO_USER_DMA_READ`` 11 server -> client +``VFIO_USER_DMA_WRITE`` 12 server -> client +``VFIO_USER_DEVICE_RESET`` 13 client -> server +``VFIO_USER_REGION_WRITE_MULTI`` 15 client -> server +====================================== ========= ================= + +Header +------ + +All messages, both command messages and reply messages, are preceded by a +16-byte header that contains basic information about the message. The header is +followed by message-specific data described in the sections below. + ++----------------+--------+-------------+ +| Name | Offset | Size | ++================+========+=============+ +| Message ID | 0 | 2 | ++----------------+--------+-------------+ +| Command | 2 | 2 | ++----------------+--------+-------------+ +| Message size | 4 | 4 | ++----------------+--------+-------------+ +| Flags | 8 | 4 | ++----------------+--------+-------------+ +| | +-----+------------+ | +| | | Bit | Definition | | +| | +=====+============+ | +| | | 0-3 | Type | | +| | +-----+------------+ | +| | | 4 | No_reply | | +| | +-----+------------+ | +| | | 5 | Error | | +| | +-----+------------+ | ++----------------+--------+-------------+ +| Error | 12 | 4 | ++----------------+--------+-------------+ +| <message data> | 16 | variable | ++----------------+--------+-------------+ + +* *Message ID* identifies the message, and is echoed in the command's reply + message. Message IDs belong entirely to the sender, can be re-used (even + concurrently) and the receiver must not make any assumptions about their + uniqueness. +* *Command* specifies the command to be executed, listed in Commands_. It is + also set in the reply header. +* *Message size* contains the size of the entire message, including the header. +* *Flags* contains attributes of the message: + + * The *Type* bits indicate the message type. + + * *Command* (value 0x0) indicates a command message. + * *Reply* (value 0x1) indicates a reply message acknowledging a previous + command with the same message ID. + * *No_reply* in a command message indicates that no reply is needed for this + command. This is commonly used when multiple commands are sent, and only + the last needs acknowledgement. + * *Error* in a reply message indicates the command being acknowledged had + an error. In this case, the *Error* field will be valid. + +* *Error* in a reply message is an optional UNIX errno value. It may be zero + even if the Error bit is set in Flags. It is reserved in a command message. + +Each command message in Commands_ must be replied to with a reply message, +unless the message sets the *No_Reply* bit. The reply consists of the header +with the *Reply* bit set, plus any additional data. + +If an error occurs, the reply message must only include the reply header. + +As the header is standard in both requests and replies, it is not included in +the command-specific specifications below; each message definition should be +appended to the standard header, and the offsets are given from the end of the +standard header. + +``VFIO_USER_VERSION`` +--------------------- + +.. _Version: + +This is the initial message sent by the client after the socket connection is +established; the same format is used for the server's reply. + +Upon establishing a connection, the client must send a ``VFIO_USER_VERSION`` +message proposing a protocol version and a set of capabilities. The server +compares these with the versions and capabilities it supports and sends a +``VFIO_USER_VERSION`` reply according to the following rules. + +* The major version in the reply must be the same as proposed. If the client + does not support the proposed major, it closes the connection. +* The minor version in the reply must be equal to or less than the minor + version proposed. +* The capability list must be a subset of those proposed. If the server + requires a capability the client did not include, it closes the connection. + +The protocol major version will only change when incompatible protocol changes +are made, such as changing the message format. The minor version may change +when compatible changes are made, such as adding new messages or capabilities, +Both the client and server must support all minor versions less than the +maximum minor version it supports. E.g., an implementation that supports +version 1.3 must also support 1.0 through 1.2. + +When making a change to this specification, the protocol version number must +be included in the form "added in version X.Y" + +Request +^^^^^^^ + +============== ====== ==== +Name Offset Size +============== ====== ==== +version major 0 2 +version minor 2 2 +version data 4 variable (including terminating NUL). Optional. +============== ====== ==== + +The version data is an optional UTF-8 encoded JSON byte array with the following +format: + ++--------------+--------+-----------------------------------+ +| Name | Type | Description | ++==============+========+===================================+ +| capabilities | object | Contains common capabilities that | +| | | the sender supports. Optional. | ++--------------+--------+-----------------------------------+ + +Capabilities: + ++--------------------+---------+------------------------------------------------+ +| Name | Type | Description | ++====================+=========+================================================+ +| max_msg_fds | number | Maximum number of file descriptors that can be | +| | | received by the sender in one message. | +| | | Optional. If not specified then the receiver | +| | | must assume a value of ``1``. | ++--------------------+---------+------------------------------------------------+ +| max_data_xfer_size | number | Maximum ``count`` for data transfer messages; | +| | | see `Read and Write Operations`_. Optional, | +| | | with a default value of 1048576 bytes. | ++--------------------+---------+------------------------------------------------+ +| pgsizes | number | Page sizes supported in DMA map operations | +| | | or'ed together. Optional, with a default value | +| | | of supporting only 4k pages. | ++--------------------+---------+------------------------------------------------+ +| max_dma_maps | number | Maximum number DMA map windows that can be | +| | | valid simultaneously. Optional, with a | +| | | value of 65535 (64k-1). | ++--------------------+---------+------------------------------------------------+ +| migration | object | Migration capability parameters. If missing | +| | | then migration is not supported by the sender. | ++--------------------+---------+------------------------------------------------+ +| write_multiple | boolean | ``VFIO_USER_REGION_WRITE_MULTI`` messages | +| | | are supported if the value is ``true``. | ++--------------------+---------+------------------------------------------------+ + +The migration capability contains the following name/value pairs: + ++-----------------+--------+--------------------------------------------------+ +| Name | Type | Description | ++=================+========+==================================================+ +| pgsize | number | Page size of dirty pages bitmap. The smallest | +| | | between the client and the server is used. | ++-----------------+--------+--------------------------------------------------+ +| max_bitmap_size | number | Maximum bitmap size in ``VFIO_USER_DIRTY_PAGES`` | +| | | and ``VFIO_DMA_UNMAP`` messages. Optional, | +| | | with a default value of 256MB. | ++-----------------+--------+--------------------------------------------------+ + +Reply +^^^^^ + +The same message format is used in the server's reply with the semantics +described above. + +``VFIO_USER_DMA_MAP`` +--------------------- + +This command message is sent by the client to the server to inform it of the +memory regions the server can access. It must be sent before the server can +perform any DMA to the client. It is normally sent directly after the version +handshake is completed, but may also occur when memory is added to the client, +or if the client uses a vIOMMU. + +Request +^^^^^^^ + +The request payload for this message is a structure of the following format: + ++-------------+--------+-------------+ +| Name | Offset | Size | ++=============+========+=============+ +| argsz | 0 | 4 | ++-------------+--------+-------------+ +| flags | 4 | 4 | ++-------------+--------+-------------+ +| | +-----+------------+ | +| | | Bit | Definition | | +| | +=====+============+ | +| | | 0 | readable | | +| | +-----+------------+ | +| | | 1 | writeable | | +| | +-----+------------+ | ++-------------+--------+-------------+ +| offset | 8 | 8 | ++-------------+--------+-------------+ +| address | 16 | 8 | ++-------------+--------+-------------+ +| size | 24 | 8 | ++-------------+--------+-------------+ + +* *argsz* is the size of the above structure. Note there is no reply payload, + so this field differs from other message types. +* *flags* contains the following region attributes: + + * *readable* indicates that the region can be read from. + + * *writeable* indicates that the region can be written to. + +* *offset* is the file offset of the region with respect to the associated file + descriptor, or zero if the region is not mappable +* *address* is the base DMA address of the region. +* *size* is the size of the region. + +This structure is 32 bytes in size, so the message size is 16 + 32 bytes. + +If the DMA region being added can be directly mapped by the server, a file +descriptor must be sent as part of the message meta-data. The region can be +mapped via the mmap() system call. On ``AF_UNIX`` sockets, the file descriptor +must be passed as ``SCM_RIGHTS`` type ancillary data. Otherwise, if the DMA +region cannot be directly mapped by the server, no file descriptor must be sent +as part of the message meta-data and the DMA region can be accessed by the +server using ``VFIO_USER_DMA_READ`` and ``VFIO_USER_DMA_WRITE`` messages, +explained in `Read and Write Operations`_. A command to map over an existing +region must be failed by the server with ``EEXIST`` set in error field in the +reply. + +Reply +^^^^^ + +There is no payload in the reply message. + +``VFIO_USER_DMA_UNMAP`` +----------------------- + +This command message is sent by the client to the server to inform it that a +DMA region, previously made available via a ``VFIO_USER_DMA_MAP`` command +message, is no longer available for DMA. It typically occurs when memory is +subtracted from the client or if the client uses a vIOMMU. The DMA region is +described by the following structure: + +Request +^^^^^^^ + +The request payload for this message is a structure of the following format: + ++--------------+--------+------------------------+ +| Name | Offset | Size | ++==============+========+========================+ +| argsz | 0 | 4 | ++--------------+--------+------------------------+ +| flags | 4 | 4 | ++--------------+--------+------------------------+ +| address | 8 | 8 | ++--------------+--------+------------------------+ +| size | 16 | 8 | ++--------------+--------+------------------------+ + +* *argsz* is the maximum size of the reply payload. +* *flags* is unused in this version. +* *address* is the base DMA address of the DMA region. +* *size* is the size of the DMA region. + +The address and size of the DMA region being unmapped must match exactly a +previous mapping. + +Reply +^^^^^ + +Upon receiving a ``VFIO_USER_DMA_UNMAP`` command, if the file descriptor is +mapped then the server must release all references to that DMA region before +replying, which potentially includes in-flight DMA transactions. + +The server responds with the original DMA entry in the request. + + +``VFIO_USER_DEVICE_GET_INFO`` +----------------------------- + +This command message is sent by the client to the server to query for basic +information about the device. + +Request +^^^^^^^ + ++-------------+--------+--------------------------+ +| Name | Offset | Size | ++=============+========+==========================+ +| argsz | 0 | 4 | ++-------------+--------+--------------------------+ +| flags | 4 | 4 | ++-------------+--------+--------------------------+ +| | +-----+-------------------------+ | +| | | Bit | Definition | | +| | +=====+=========================+ | +| | | 0 | VFIO_DEVICE_FLAGS_RESET | | +| | +-----+-------------------------+ | +| | | 1 | VFIO_DEVICE_FLAGS_PCI | | +| | +-----+-------------------------+ | ++-------------+--------+--------------------------+ +| num_regions | 8 | 4 | ++-------------+--------+--------------------------+ +| num_irqs | 12 | 4 | ++-------------+--------+--------------------------+ + +* *argsz* is the maximum size of the reply payload +* all other fields must be zero. + +Reply +^^^^^ + ++-------------+--------+--------------------------+ +| Name | Offset | Size | ++=============+========+==========================+ +| argsz | 0 | 4 | ++-------------+--------+--------------------------+ +| flags | 4 | 4 | ++-------------+--------+--------------------------+ +| | +-----+-------------------------+ | +| | | Bit | Definition | | +| | +=====+=========================+ | +| | | 0 | VFIO_DEVICE_FLAGS_RESET | | +| | +-----+-------------------------+ | +| | | 1 | VFIO_DEVICE_FLAGS_PCI | | +| | +-----+-------------------------+ | ++-------------+--------+--------------------------+ +| num_regions | 8 | 4 | ++-------------+--------+--------------------------+ +| num_irqs | 12 | 4 | ++-------------+--------+--------------------------+ + +* *argsz* is the size required for the full reply payload (16 bytes today) +* *flags* contains the following device attributes. + + * ``VFIO_DEVICE_FLAGS_RESET`` indicates that the device supports the + ``VFIO_USER_DEVICE_RESET`` message. + * ``VFIO_DEVICE_FLAGS_PCI`` indicates that the device is a PCI device. + +* *num_regions* is the number of memory regions that the device exposes. +* *num_irqs* is the number of distinct interrupt types that the device supports. + +This version of the protocol only supports PCI devices. Additional devices may +be supported in future versions. + +``VFIO_USER_DEVICE_GET_REGION_INFO`` +------------------------------------ + +This command message is sent by the client to the server to query for +information about device regions. The VFIO region info structure is defined in +``<linux/vfio.h>`` (``struct vfio_region_info``). + +Request +^^^^^^^ + ++------------+--------+------------------------------+ +| Name | Offset | Size | ++============+========+==============================+ +| argsz | 0 | 4 | ++------------+--------+------------------------------+ +| flags | 4 | 4 | ++------------+--------+------------------------------+ +| index | 8 | 4 | ++------------+--------+------------------------------+ +| cap_offset | 12 | 4 | ++------------+--------+------------------------------+ +| size | 16 | 8 | ++------------+--------+------------------------------+ +| offset | 24 | 8 | ++------------+--------+------------------------------+ + +* *argsz* the maximum size of the reply payload +* *index* is the index of memory region being queried, it is the only field + that is required to be set in the command message. +* all other fields must be zero. + +Reply +^^^^^ + ++------------+--------+------------------------------+ +| Name | Offset | Size | ++============+========+==============================+ +| argsz | 0 | 4 | ++------------+--------+------------------------------+ +| flags | 4 | 4 | ++------------+--------+------------------------------+ +| | +-----+-----------------------------+ | +| | | Bit | Definition | | +| | +=====+=============================+ | +| | | 0 | VFIO_REGION_INFO_FLAG_READ | | +| | +-----+-----------------------------+ | +| | | 1 | VFIO_REGION_INFO_FLAG_WRITE | | +| | +-----+-----------------------------+ | +| | | 2 | VFIO_REGION_INFO_FLAG_MMAP | | +| | +-----+-----------------------------+ | +| | | 3 | VFIO_REGION_INFO_FLAG_CAPS | | +| | +-----+-----------------------------+ | ++------------+--------+------------------------------+ ++------------+--------+------------------------------+ +| index | 8 | 4 | ++------------+--------+------------------------------+ +| cap_offset | 12 | 4 | ++------------+--------+------------------------------+ +| size | 16 | 8 | ++------------+--------+------------------------------+ +| offset | 24 | 8 | ++------------+--------+------------------------------+ + +* *argsz* is the size required for the full reply payload (region info structure + plus the size of any region capabilities) +* *flags* are attributes of the region: + + * ``VFIO_REGION_INFO_FLAG_READ`` allows client read access to the region. + * ``VFIO_REGION_INFO_FLAG_WRITE`` allows client write access to the region. + * ``VFIO_REGION_INFO_FLAG_MMAP`` specifies the client can mmap() the region. + When this flag is set, the reply will include a file descriptor in its + meta-data. On ``AF_UNIX`` sockets, the file descriptors will be passed as + ``SCM_RIGHTS`` type ancillary data. + * ``VFIO_REGION_INFO_FLAG_CAPS`` indicates additional capabilities found in the + reply. + +* *index* is the index of memory region being queried, it is the only field + that is required to be set in the command message. +* *cap_offset* describes where additional region capabilities can be found. + cap_offset is relative to the beginning of the VFIO region info structure. + The data structure it points is a VFIO cap header defined in + ``<linux/vfio.h>``. +* *size* is the size of the region. +* *offset* is the offset that should be given to the mmap() system call for + regions with the MMAP attribute. It is also used as the base offset when + mapping a VFIO sparse mmap area, described below. + +VFIO region capabilities +"""""""""""""""""""""""" + +The VFIO region information can also include a capabilities list. This list is +similar to a PCI capability list - each entry has a common header that +identifies a capability and where the next capability in the list can be found. +The VFIO capability header format is defined in ``<linux/vfio.h>`` (``struct +vfio_info_cap_header``). + +VFIO cap header format +"""""""""""""""""""""" + ++---------+--------+------+ +| Name | Offset | Size | ++=========+========+======+ +| id | 0 | 2 | ++---------+--------+------+ +| version | 2 | 2 | ++---------+--------+------+ +| next | 4 | 4 | ++---------+--------+------+ + +* *id* is the capability identity. +* *version* is a capability-specific version number. +* *next* specifies the offset of the next capability in the capability list. It + is relative to the beginning of the VFIO region info structure. + +VFIO sparse mmap cap header +""""""""""""""""""""""""""" + ++------------------+----------------------------------+ +| Name | Value | ++==================+==================================+ +| id | VFIO_REGION_INFO_CAP_SPARSE_MMAP | ++------------------+----------------------------------+ +| version | 0x1 | ++------------------+----------------------------------+ +| next | <next> | ++------------------+----------------------------------+ +| sparse mmap info | VFIO region info sparse mmap | ++------------------+----------------------------------+ + +This capability is defined when only a subrange of the region supports +direct access by the client via mmap(). The VFIO sparse mmap area is defined in +``<linux/vfio.h>`` (``struct vfio_region_sparse_mmap_area`` and ``struct +vfio_region_info_cap_sparse_mmap``). + +VFIO region info cap sparse mmap +"""""""""""""""""""""""""""""""" + ++----------+--------+------+ +| Name | Offset | Size | ++==========+========+======+ +| nr_areas | 0 | 4 | ++----------+--------+------+ +| reserved | 4 | 4 | ++----------+--------+------+ +| offset | 8 | 8 | ++----------+--------+------+ +| size | 16 | 8 | ++----------+--------+------+ +| ... | | | ++----------+--------+------+ + +* *nr_areas* is the number of sparse mmap areas in the region. +* *offset* and size describe a single area that can be mapped by the client. + There will be *nr_areas* pairs of offset and size. The offset will be added to + the base offset given in the ``VFIO_USER_DEVICE_GET_REGION_INFO`` to form the + offset argument of the subsequent mmap() call. + +The VFIO sparse mmap area is defined in ``<linux/vfio.h>`` (``struct +vfio_region_info_cap_sparse_mmap``). + + +``VFIO_USER_DEVICE_GET_REGION_IO_FDS`` +-------------------------------------- + +Clients can access regions via ``VFIO_USER_REGION_READ/WRITE`` or, if provided, by +``mmap()`` of a file descriptor provided by the server. + +``VFIO_USER_DEVICE_GET_REGION_IO_FDS`` provides an alternative access mechanism via +file descriptors. This is an optional feature intended for performance +improvements where an underlying sub-system (such as KVM) supports communication +across such file descriptors to the vfio-user server, without needing to +round-trip through the client. + +The server returns an array of sub-regions for the requested region. Each +sub-region describes a span (offset and size) of a region, along with the +requested file descriptor notification mechanism to use. Each sub-region in the +response message may choose to use a different method, as defined below. The +two mechanisms supported in this specification are ioeventfds and ioregionfds. + +The server in addition returns a file descriptor in the ancillary data; clients +are expected to configure each sub-region's file descriptor with the requested +notification method. For example, a client could configure KVM with the +requested ioeventfd via a ``KVM_IOEVENTFD`` ``ioctl()``. + +Request +^^^^^^^ + ++-------------+--------+------+ +| Name | Offset | Size | ++=============+========+======+ +| argsz | 0 | 4 | ++-------------+--------+------+ +| flags | 4 | 4 | ++-------------+--------+------+ +| index | 8 | 4 | ++-------------+--------+------+ +| count | 12 | 4 | ++-------------+--------+------+ + +* *argsz* the maximum size of the reply payload +* *index* is the index of memory region being queried +* all other fields must be zero + +The client must set ``flags`` to zero and specify the region being queried in +the ``index``. + +Reply +^^^^^ + ++-------------+--------+------+ +| Name | Offset | Size | ++=============+========+======+ +| argsz | 0 | 4 | ++-------------+--------+------+ +| flags | 4 | 4 | ++-------------+--------+------+ +| index | 8 | 4 | ++-------------+--------+------+ +| count | 12 | 4 | ++-------------+--------+------+ +| sub-regions | 16 | ... | ++-------------+--------+------+ + +* *argsz* is the size of the region IO FD info structure plus the + total size of the sub-region array. Thus, each array entry "i" is at offset + i * ((argsz - 32) / count). Note that currently this is 40 bytes for both IO + FD types, but this is not to be relied on. As elsewhere, this indicates the + full reply payload size needed. +* *flags* must be zero +* *index* is the index of memory region being queried +* *count* is the number of sub-regions in the array +* *sub-regions* is the array of Sub-Region IO FD info structures + +The reply message will additionally include at least one file descriptor in the +ancillary data. Note that more than one sub-region may share the same file +descriptor. + +Note that it is the client's responsibility to verify the requested values (for +example, that the requested offset does not exceed the region's bounds). + +Each sub-region given in the response has one of two possible structures, +depending whether *type* is ``VFIO_USER_IO_FD_TYPE_IOEVENTFD`` or +``VFIO_USER_IO_FD_TYPE_IOREGIONFD``: + +Sub-Region IO FD info format (ioeventfd) +"""""""""""""""""""""""""""""""""""""""" + ++-----------+--------+------+ +| Name | Offset | Size | ++===========+========+======+ +| offset | 0 | 8 | ++-----------+--------+------+ +| size | 8 | 8 | ++-----------+--------+------+ +| fd_index | 16 | 4 | ++-----------+--------+------+ +| type | 20 | 4 | ++-----------+--------+------+ +| flags | 24 | 4 | ++-----------+--------+------+ +| padding | 28 | 4 | ++-----------+--------+------+ +| datamatch | 32 | 8 | ++-----------+--------+------+ + +* *offset* is the offset of the start of the sub-region within the region + requested ("physical address offset" for the region) +* *size* is the length of the sub-region. This may be zero if the access size is + not relevant, which may allow for optimizations +* *fd_index* is the index in the ancillary data of the FD to use for ioeventfd + notification; it may be shared. +* *type* is ``VFIO_USER_IO_FD_TYPE_IOEVENTFD`` +* *flags* is any of: + + * ``KVM_IOEVENTFD_FLAG_DATAMATCH`` + * ``KVM_IOEVENTFD_FLAG_PIO`` + * ``KVM_IOEVENTFD_FLAG_VIRTIO_CCW_NOTIFY`` (FIXME: makes sense?) + +* *datamatch* is the datamatch value if needed + +See https://www.kernel.org/doc/Documentation/virtual/kvm/api.txt, *4.59 +KVM_IOEVENTFD* for further context on the ioeventfd-specific fields. + +Sub-Region IO FD info format (ioregionfd) +""""""""""""""""""""""""""""""""""""""""" + ++-----------+--------+------+ +| Name | Offset | Size | ++===========+========+======+ +| offset | 0 | 8 | ++-----------+--------+------+ +| size | 8 | 8 | ++-----------+--------+------+ +| fd_index | 16 | 4 | ++-----------+--------+------+ +| type | 20 | 4 | ++-----------+--------+------+ +| flags | 24 | 4 | ++-----------+--------+------+ +| padding | 28 | 4 | ++-----------+--------+------+ +| user_data | 32 | 8 | ++-----------+--------+------+ + +* *offset* is the offset of the start of the sub-region within the region + requested ("physical address offset" for the region) +* *size* is the length of the sub-region. This may be zero if the access size is + not relevant, which may allow for optimizations; ``KVM_IOREGION_POSTED_WRITES`` + must be set in *flags* in this case +* *fd_index* is the index in the ancillary data of the FD to use for ioregionfd + messages; it may be shared +* *type* is ``VFIO_USER_IO_FD_TYPE_IOREGIONFD`` +* *flags* is any of: + + * ``KVM_IOREGION_PIO`` + * ``KVM_IOREGION_POSTED_WRITES`` + +* *user_data* is an opaque value passed back to the server via a message on the + file descriptor + +For further information on the ioregionfd-specific fields, see: +https://lore.kernel.org/kvm/cover.1613828726.git.eafanasova@gmail.com/ + +(FIXME: update with final API docs.) + +``VFIO_USER_DEVICE_GET_IRQ_INFO`` +--------------------------------- + +This command message is sent by the client to the server to query for +information about device interrupt types. The VFIO IRQ info structure is +defined in ``<linux/vfio.h>`` (``struct vfio_irq_info``). + +Request +^^^^^^^ + ++-------+--------+---------------------------+ +| Name | Offset | Size | ++=======+========+===========================+ +| argsz | 0 | 4 | ++-------+--------+---------------------------+ +| flags | 4 | 4 | ++-------+--------+---------------------------+ +| | +-----+--------------------------+ | +| | | Bit | Definition | | +| | +=====+==========================+ | +| | | 0 | VFIO_IRQ_INFO_EVENTFD | | +| | +-----+--------------------------+ | +| | | 1 | VFIO_IRQ_INFO_MASKABLE | | +| | +-----+--------------------------+ | +| | | 2 | VFIO_IRQ_INFO_AUTOMASKED | | +| | +-----+--------------------------+ | +| | | 3 | VFIO_IRQ_INFO_NORESIZE | | +| | +-----+--------------------------+ | ++-------+--------+---------------------------+ +| index | 8 | 4 | ++-------+--------+---------------------------+ +| count | 12 | 4 | ++-------+--------+---------------------------+ + +* *argsz* is the maximum size of the reply payload (16 bytes today) +* index is the index of IRQ type being queried (e.g. ``VFIO_PCI_MSIX_IRQ_INDEX``) +* all other fields must be zero + +Reply +^^^^^ + ++-------+--------+---------------------------+ +| Name | Offset | Size | ++=======+========+===========================+ +| argsz | 0 | 4 | ++-------+--------+---------------------------+ +| flags | 4 | 4 | ++-------+--------+---------------------------+ +| | +-----+--------------------------+ | +| | | Bit | Definition | | +| | +=====+==========================+ | +| | | 0 | VFIO_IRQ_INFO_EVENTFD | | +| | +-----+--------------------------+ | +| | | 1 | VFIO_IRQ_INFO_MASKABLE | | +| | +-----+--------------------------+ | +| | | 2 | VFIO_IRQ_INFO_AUTOMASKED | | +| | +-----+--------------------------+ | +| | | 3 | VFIO_IRQ_INFO_NORESIZE | | +| | +-----+--------------------------+ | ++-------+--------+---------------------------+ +| index | 8 | 4 | ++-------+--------+---------------------------+ +| count | 12 | 4 | ++-------+--------+---------------------------+ + +* *argsz* is the size required for the full reply payload (16 bytes today) +* *flags* defines IRQ attributes: + + * ``VFIO_IRQ_INFO_EVENTFD`` indicates the IRQ type can support server eventfd + signalling. + * ``VFIO_IRQ_INFO_MASKABLE`` indicates that the IRQ type supports the ``MASK`` + and ``UNMASK`` actions in a ``VFIO_USER_DEVICE_SET_IRQS`` message. + * ``VFIO_IRQ_INFO_AUTOMASKED`` indicates the IRQ type masks itself after being + triggered, and the client must send an ``UNMASK`` action to receive new + interrupts. + * ``VFIO_IRQ_INFO_NORESIZE`` indicates ``VFIO_USER_SET_IRQS`` operations setup + interrupts as a set, and new sub-indexes cannot be enabled without disabling + the entire type. +* index is the index of IRQ type being queried +* count describes the number of interrupts of the queried type. + +``VFIO_USER_DEVICE_SET_IRQS`` +----------------------------- + +This command message is sent by the client to the server to set actions for +device interrupt types. The VFIO IRQ set structure is defined in +``<linux/vfio.h>`` (``struct vfio_irq_set``). + +Request +^^^^^^^ + ++-------+--------+------------------------------+ +| Name | Offset | Size | ++=======+========+==============================+ +| argsz | 0 | 4 | ++-------+--------+------------------------------+ +| flags | 4 | 4 | ++-------+--------+------------------------------+ +| | +-----+-----------------------------+ | +| | | Bit | Definition | | +| | +=====+=============================+ | +| | | 0 | VFIO_IRQ_SET_DATA_NONE | | +| | +-----+-----------------------------+ | +| | | 1 | VFIO_IRQ_SET_DATA_BOOL | | +| | +-----+-----------------------------+ | +| | | 2 | VFIO_IRQ_SET_DATA_EVENTFD | | +| | +-----+-----------------------------+ | +| | | 3 | VFIO_IRQ_SET_ACTION_MASK | | +| | +-----+-----------------------------+ | +| | | 4 | VFIO_IRQ_SET_ACTION_UNMASK | | +| | +-----+-----------------------------+ | +| | | 5 | VFIO_IRQ_SET_ACTION_TRIGGER | | +| | +-----+-----------------------------+ | ++-------+--------+------------------------------+ +| index | 8 | 4 | ++-------+--------+------------------------------+ +| start | 12 | 4 | ++-------+--------+------------------------------+ +| count | 16 | 4 | ++-------+--------+------------------------------+ +| data | 20 | variable | ++-------+--------+------------------------------+ + +* *argsz* is the size of the VFIO IRQ set request payload, including any *data* + field. Note there is no reply payload, so this field differs from other + message types. +* *flags* defines the action performed on the interrupt range. The ``DATA`` + flags describe the data field sent in the message; the ``ACTION`` flags + describe the action to be performed. The flags are mutually exclusive for + both sets. + + * ``VFIO_IRQ_SET_DATA_NONE`` indicates there is no data field in the command. + The action is performed unconditionally. + * ``VFIO_IRQ_SET_DATA_BOOL`` indicates the data field is an array of boolean + bytes. The action is performed if the corresponding boolean is true. + * ``VFIO_IRQ_SET_DATA_EVENTFD`` indicates an array of event file descriptors + was sent in the message meta-data. These descriptors will be signalled when + the action defined by the action flags occurs. In ``AF_UNIX`` sockets, the + descriptors are sent as ``SCM_RIGHTS`` type ancillary data. + If no file descriptors are provided, this de-assigns the specified + previously configured interrupts. + * ``VFIO_IRQ_SET_ACTION_MASK`` indicates a masking event. It can be used with + ``VFIO_IRQ_SET_DATA_BOOL`` or ``VFIO_IRQ_SET_DATA_NONE`` to mask an interrupt, + or with ``VFIO_IRQ_SET_DATA_EVENTFD`` to generate an event when the guest masks + the interrupt. + * ``VFIO_IRQ_SET_ACTION_UNMASK`` indicates an unmasking event. It can be used + with ``VFIO_IRQ_SET_DATA_BOOL`` or ``VFIO_IRQ_SET_DATA_NONE`` to unmask an + interrupt, or with ``VFIO_IRQ_SET_DATA_EVENTFD`` to generate an event when the + guest unmasks the interrupt. + * ``VFIO_IRQ_SET_ACTION_TRIGGER`` indicates a triggering event. It can be used + with ``VFIO_IRQ_SET_DATA_BOOL`` or ``VFIO_IRQ_SET_DATA_NONE`` to trigger an + interrupt, or with ``VFIO_IRQ_SET_DATA_EVENTFD`` to generate an event when the + server triggers the interrupt. + +* *index* is the index of IRQ type being setup. +* *start* is the start of the sub-index being set. +* *count* describes the number of sub-indexes being set. As a special case, a + count (and start) of 0, with data flags of ``VFIO_IRQ_SET_DATA_NONE`` disables + all interrupts of the index. +* *data* is an optional field included when the + ``VFIO_IRQ_SET_DATA_BOOL`` flag is present. It contains an array of booleans + that specify whether the action is to be performed on the corresponding + index. It's used when the action is only performed on a subset of the range + specified. + +Not all interrupt types support every combination of data and action flags. +The client must know the capabilities of the device and IRQ index before it +sends a ``VFIO_USER_DEVICE_SET_IRQ`` message. + +In typical operation, a specific IRQ may operate as follows: + +1. The client sends a ``VFIO_USER_DEVICE_SET_IRQ`` message with + ``flags=(VFIO_IRQ_SET_DATA_EVENTFD|VFIO_IRQ_SET_ACTION_TRIGGER)`` along + with an eventfd. This associates the IRQ with a particular eventfd on the + server side. + +#. The client may send a ``VFIO_USER_DEVICE_SET_IRQ`` message with + ``flags=(VFIO_IRQ_SET_DATA_EVENTFD|VFIO_IRQ_SET_ACTION_MASK/UNMASK)`` along + with another eventfd. This associates the given eventfd with the + mask/unmask state on the server side. + +#. The server may trigger the IRQ by writing 1 to the eventfd. + +#. The server may mask/unmask an IRQ which will write 1 to the corresponding + mask/unmask eventfd, if there is one. + +5. A client may trigger a device IRQ itself, by sending a + ``VFIO_USER_DEVICE_SET_IRQ`` message with + ``flags=(VFIO_IRQ_SET_DATA_NONE/BOOL|VFIO_IRQ_SET_ACTION_TRIGGER)``. + +6. A client may mask or unmask the IRQ, by sending a + ``VFIO_USER_DEVICE_SET_IRQ`` message with + ``flags=(VFIO_IRQ_SET_DATA_NONE/BOOL|VFIO_IRQ_SET_ACTION_MASK/UNMASK)``. + +Reply +^^^^^ + +There is no payload in the reply. + +.. _Read and Write Operations: + +Note that all of these operations must be supported by the client and/or server, +even if the corresponding memory or device region has been shared as mappable. + +The ``count`` field must not exceed the value of ``max_data_xfer_size`` of the +peer, for both reads and writes. + +``VFIO_USER_REGION_READ`` +------------------------- + +If a device region is not mappable, it's not directly accessible by the client +via ``mmap()`` of the underlying file descriptor. In this case, a client can +read from a device region with this message. + +Request +^^^^^^^ + ++--------+--------+----------+ +| Name | Offset | Size | ++========+========+==========+ +| offset | 0 | 8 | ++--------+--------+----------+ +| region | 8 | 4 | ++--------+--------+----------+ +| count | 12 | 4 | ++--------+--------+----------+ + +* *offset* into the region being accessed. +* *region* is the index of the region being accessed. +* *count* is the size of the data to be transferred. + +Reply +^^^^^ + ++--------+--------+----------+ +| Name | Offset | Size | ++========+========+==========+ +| offset | 0 | 8 | ++--------+--------+----------+ +| region | 8 | 4 | ++--------+--------+----------+ +| count | 12 | 4 | ++--------+--------+----------+ +| data | 16 | variable | ++--------+--------+----------+ + +* *offset* into the region accessed. +* *region* is the index of the region accessed. +* *count* is the size of the data transferred. +* *data* is the data that was read from the device region. + +``VFIO_USER_REGION_WRITE`` +-------------------------- + +If a device region is not mappable, it's not directly accessible by the client +via mmap() of the underlying fd. In this case, a client can write to a device +region with this message. + +Request +^^^^^^^ + ++--------+--------+----------+ +| Name | Offset | Size | ++========+========+==========+ +| offset | 0 | 8 | ++--------+--------+----------+ +| region | 8 | 4 | ++--------+--------+----------+ +| count | 12 | 4 | ++--------+--------+----------+ +| data | 16 | variable | ++--------+--------+----------+ + +* *offset* into the region being accessed. +* *region* is the index of the region being accessed. +* *count* is the size of the data to be transferred. +* *data* is the data to write + +Reply +^^^^^ + ++--------+--------+----------+ +| Name | Offset | Size | ++========+========+==========+ +| offset | 0 | 8 | ++--------+--------+----------+ +| region | 8 | 4 | ++--------+--------+----------+ +| count | 12 | 4 | ++--------+--------+----------+ + +* *offset* into the region accessed. +* *region* is the index of the region accessed. +* *count* is the size of the data transferred. + +``VFIO_USER_DMA_READ`` +----------------------- + +If the client has not shared mappable memory, the server can use this message to +read from guest memory. + +Request +^^^^^^^ + ++---------+--------+----------+ +| Name | Offset | Size | ++=========+========+==========+ +| address | 0 | 8 | ++---------+--------+----------+ +| count | 8 | 8 | ++---------+--------+----------+ + +* *address* is the client DMA memory address being accessed. This address must have + been previously exported to the server with a ``VFIO_USER_DMA_MAP`` message. +* *count* is the size of the data to be transferred. + +Reply +^^^^^ + ++---------+--------+----------+ +| Name | Offset | Size | ++=========+========+==========+ +| address | 0 | 8 | ++---------+--------+----------+ +| count | 8 | 8 | ++---------+--------+----------+ +| data | 16 | variable | ++---------+--------+----------+ + +* *address* is the client DMA memory address being accessed. +* *count* is the size of the data transferred. +* *data* is the data read. + +``VFIO_USER_DMA_WRITE`` +----------------------- + +If the client has not shared mappable memory, the server can use this message to +write to guest memory. + +Request +^^^^^^^ + ++---------+--------+----------+ +| Name | Offset | Size | ++=========+========+==========+ +| address | 0 | 8 | ++---------+--------+----------+ +| count | 8 | 8 | ++---------+--------+----------+ +| data | 16 | variable | ++---------+--------+----------+ + +* *address* is the client DMA memory address being accessed. This address must have + been previously exported to the server with a ``VFIO_USER_DMA_MAP`` message. +* *count* is the size of the data to be transferred. +* *data* is the data to write + +Reply +^^^^^ + ++---------+--------+----------+ +| Name | Offset | Size | ++=========+========+==========+ +| address | 0 | 8 | ++---------+--------+----------+ +| count | 8 | 4 | ++---------+--------+----------+ + +* *address* is the client DMA memory address being accessed. +* *count* is the size of the data transferred. + +``VFIO_USER_DEVICE_RESET`` +-------------------------- + +This command message is sent from the client to the server to reset the device. +Neither the request or reply have a payload. + +``VFIO_USER_REGION_WRITE_MULTI`` +-------------------------------- + +This message can be used to coalesce multiple device write operations +into a single messgage. It is only used as an optimization when the +outgoing message queue is relatively full. + +Request +^^^^^^^ + ++---------+--------+----------+ +| Name | Offset | Size | ++=========+========+==========+ +| wr_cnt | 0 | 8 | ++---------+--------+----------+ +| wrs | 8 | variable | ++---------+--------+----------+ + +* *wr_cnt* is the number of device writes coalesced in the message +* *wrs* is an array of device writes defined below + +Single Device Write Format +"""""""""""""""""""""""""" + ++--------+--------+----------+ +| Name | Offset | Size | ++========+========+==========+ +| offset | 0 | 8 | ++--------+--------+----------+ +| region | 8 | 4 | ++--------+--------+----------+ +| count | 12 | 4 | ++--------+--------+----------+ +| data | 16 | 8 | ++--------+--------+----------+ + +* *offset* into the region being accessed. +* *region* is the index of the region being accessed. +* *count* is the size of the data to be transferred. This format can + only describe writes of 8 bytes or less. +* *data* is the data to write. + +Reply +^^^^^ + ++---------+--------+----------+ +| Name | Offset | Size | ++=========+========+==========+ +| wr_cnt | 0 | 8 | ++---------+--------+----------+ + +* *wr_cnt* is the number of device writes completed. + + +Appendices +========== + +Unused VFIO ``ioctl()`` commands +-------------------------------- + +The following VFIO commands do not have an equivalent vfio-user command: + +* ``VFIO_GET_API_VERSION`` +* ``VFIO_CHECK_EXTENSION`` +* ``VFIO_SET_IOMMU`` +* ``VFIO_GROUP_GET_STATUS`` +* ``VFIO_GROUP_SET_CONTAINER`` +* ``VFIO_GROUP_UNSET_CONTAINER`` +* ``VFIO_GROUP_GET_DEVICE_FD`` +* ``VFIO_IOMMU_GET_INFO`` + +However, once support for live migration for VFIO devices is finalized some +of the above commands may have to be handled by the client in their +corresponding vfio-user form. This will be addressed in a future protocol +version. + +VFIO groups and containers +^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The current VFIO implementation includes group and container idioms that +describe how a device relates to the host IOMMU. In the vfio-user +implementation, the IOMMU is implemented in SW by the client, and is not +visible to the server. The simplest idea would be that the client put each +device into its own group and container. + +Backend Program Conventions +--------------------------- + +vfio-user backend program conventions are based on the vhost-user ones. + +* The backend program must not daemonize itself. +* No assumptions must be made as to what access the backend program has on the + system. +* File descriptors 0, 1 and 2 must exist, must have regular + stdin/stdout/stderr semantics, and can be redirected. +* The backend program must honor the SIGTERM signal. +* The backend program must accept the following commands line options: + + * ``--socket-path=PATH``: path to UNIX domain socket, + * ``--fd=FDNUM``: file descriptor for UNIX domain socket, incompatible with + ``--socket-path`` +* The backend program must be accompanied with a JSON file stored under + ``/usr/share/vfio-user``. + +TODO add schema similar to docs/interop/vhost-user.json. diff --git a/docs/qcow2-cache.txt b/docs/qcow2-cache.txt index 5f763aa..204a574 100644 --- a/docs/qcow2-cache.txt +++ b/docs/qcow2-cache.txt @@ -15,7 +15,7 @@ not a straightforward operation. This document attempts to give an overview of the L2 and refcount caches, and how to configure them. -Please refer to the docs/interop/qcow2.txt file for an in-depth +Please refer to the docs/interop/qcow2.rst file for an in-depth technical description of the qcow2 file format. diff --git a/docs/sphinx/qapi_domain.py b/docs/sphinx/qapi_domain.py index c94af57..ebc46a7 100644 --- a/docs/sphinx/qapi_domain.py +++ b/docs/sphinx/qapi_domain.py @@ -20,16 +20,6 @@ from typing import ( from docutils import nodes from docutils.parsers.rst import directives - -from compat import ( - CompatField, - CompatGroupedField, - CompatTypedField, - KeywordNode, - ParserFix, - Signature, - SpaceNode, -) from sphinx import addnodes from sphinx.directives import ObjectDescription from sphinx.domains import ( @@ -44,6 +34,16 @@ from sphinx.util import logging from sphinx.util.docutils import SphinxDirective from sphinx.util.nodes import make_id, make_refnode +from compat import ( + CompatField, + CompatGroupedField, + CompatTypedField, + KeywordNode, + ParserFix, + Signature, + SpaceNode, +) + if TYPE_CHECKING: from typing import ( @@ -56,7 +56,6 @@ if TYPE_CHECKING: ) from docutils.nodes import Element, Node - from sphinx.addnodes import desc_signature, pending_xref from sphinx.application import Sphinx from sphinx.builders import Builder @@ -168,6 +167,8 @@ class QAPIDescription(ParserFix): """ def handle_signature(self, sig: str, signode: desc_signature) -> Signature: + # pylint: disable=unused-argument + # Do nothing. The return value here is the "name" of the entity # being documented; for QAPI, this is the same as the # "signature", which is just a name. @@ -210,6 +211,8 @@ class QAPIDescription(ParserFix): def add_target_and_index( self, name: Signature, sig: str, signode: desc_signature ) -> None: + # pylint: disable=unused-argument + # name is the return value of handle_signature. # sig is the original, raw text argument to handle_signature. # For QAPI, these are identical, currently. diff --git a/docs/sphinx/qapidoc.py b/docs/sphinx/qapidoc.py index 661b2c4..8011ac9 100644 --- a/docs/sphinx/qapidoc.py +++ b/docs/sphinx/qapidoc.py @@ -27,6 +27,7 @@ https://www.sphinx-doc.org/en/master/development/index.html from __future__ import annotations + __version__ = "2.0" from contextlib import contextmanager @@ -56,8 +57,6 @@ from qapi.schema import ( QAPISchemaVisitor, ) from qapi.source import QAPISourceInfo - -from qapidoc_legacy import QAPISchemaGenRSTVisitor # type: ignore from sphinx import addnodes from sphinx.directives.code import CodeBlock from sphinx.errors import ExtensionError @@ -65,6 +64,8 @@ from sphinx.util import logging from sphinx.util.docutils import SphinxDirective, switch_source_input from sphinx.util.nodes import nested_parse_with_titles +from qapidoc_legacy import QAPISchemaGenRSTVisitor # type: ignore + if TYPE_CHECKING: from typing import ( diff --git a/docs/system/arm/aspeed.rst b/docs/system/arm/aspeed.rst index 97fd6a0..43d27d8 100644 --- a/docs/system/arm/aspeed.rst +++ b/docs/system/arm/aspeed.rst @@ -1,12 +1,11 @@ Aspeed family boards (``ast2500-evb``, ``ast2600-evb``, ``ast2700-evb``, ``bletchley-bmc``, ``fuji-bmc``, ``fby35-bmc``, ``fp5280g2-bmc``, ``g220a-bmc``, ``palmetto-bmc``, ``qcom-dc-scm-v1-bmc``, ``qcom-firework-bmc``, ``quanta-q71l-bmc``, ``rainier-bmc``, ``romulus-bmc``, ``sonorapass-bmc``, ``supermicrox11-bmc``, ``supermicrox11spi-bmc``, ``tiogapass-bmc``, ``witherspoon-bmc``, ``yosemitev2-bmc``) -================================================================================================================================================================================================================================================================================================================================================================================================================== +================================================================================================================================================================================================================================================================================================================================================================================================================================= The QEMU Aspeed machines model BMCs of various OpenPOWER systems and Aspeed evaluation boards. They are based on different releases of the Aspeed SoC : the AST2400 integrating an ARM926EJ-S CPU (400MHz), the AST2500 with an ARM1176JZS CPU (800MHz), the AST2600 -with dual cores ARM Cortex-A7 CPUs (1.2GHz) and more recently the AST2700 -with quad cores ARM Cortex-A35 64 bits CPUs (1.6GHz) +with dual cores ARM Cortex-A7 CPUs (1.2GHz). The SoC comes with RAM, Gigabit ethernet, USB, SD/MMC, USB, SPI, I2C, etc. @@ -39,10 +38,6 @@ AST2600 SoC based machines : - ``qcom-dc-scm-v1-bmc`` Qualcomm DC-SCM V1 BMC - ``qcom-firework-bmc`` Qualcomm Firework BMC -AST2700 SoC based machines : - -- ``ast2700-evb`` Aspeed AST2700 Evaluation board (Cortex-A35) - Supported devices ----------------- @@ -247,10 +242,78 @@ under Linux), use : -M ast2500-evb,bmc-console=uart3 +Aspeed 2700 family boards (``ast2700-evb``) +================================================================== + +The QEMU Aspeed machines model BMCs of Aspeed evaluation boards. +They are based on different releases of the Aspeed SoC : +the AST2700 with quad cores ARM Cortex-A35 64 bits CPUs (1.6GHz). + +The SoC comes with RAM, Gigabit ethernet, USB, SD/MMC, USB, SPI, I2C, +etc. + +AST2700 SoC based machines : + +- ``ast2700-evb`` Aspeed AST2700 Evaluation board (Cortex-A35) +- ``ast2700fc`` Aspeed AST2700 Evaluation board (Cortex-A35 + Cortex-M4) + +Supported devices +----------------- + * Interrupt Controller + * Timer Controller + * RTC Controller + * I2C Controller + * System Control Unit (SCU) + * SRAM mapping + * X-DMA Controller (basic interface) + * Static Memory Controller (SMC or FMC) - Only SPI Flash support + * SPI Memory Controller + * USB 2.0 Controller + * SD/MMC storage controllers + * SDRAM controller (dummy interface for basic settings and training) + * Watchdog Controller + * GPIO Controller (Master only) + * UART + * Ethernet controllers + * Front LEDs (PCA9552 on I2C bus) + * LPC Peripheral Controller (a subset of subdevices are supported) + * Hash/Crypto Engine (HACE) - Hash support only. TODO: Crypto + * ADC + * eMMC Boot Controller (dummy) + * PECI Controller (minimal) + * I3C Controller + * Internal Bridge Controller (SLI dummy) + +Missing devices +--------------- + * PWM and Fan Controller + * Slave GPIO Controller + * Super I/O Controller + * PCI-Express 1 Controller + * Graphic Display Controller + * MCTP Controller + * Mailbox Controller + * Virtual UART + * eSPI Controller + +Boot options +------------ + +Images can be downloaded from the ASPEED Forked OpenBMC GitHub release repository : + + https://github.com/AspeedTech-BMC/openbmc/releases + Booting the ast2700-evb machine ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -Boot the AST2700 machine from the flash image, use an MTD drive : +Boot the AST2700 machine from the flash image. + +There are two supported methods for booting the AST2700 machine with a flash image: + +Manual boot using ``-device loader``: + +It causes all 4 CPU cores to start execution from address ``0x430000000``, which +corresponds to the BL31 image load address. .. code-block:: bash @@ -270,6 +333,89 @@ Boot the AST2700 machine from the flash image, use an MTD drive : -drive file=${IMGDIR}/image-bmc,format=raw,if=mtd \ -nographic +Boot using a virtual boot ROM (``-bios``): + +If users do not specify the ``-bios option``, QEMU will attempt to load the +default vbootrom image ``ast27x0_bootrom.bin`` from either the current working +directory or the ``pc-bios`` directory within the QEMU source tree. + +.. code-block:: bash + + $ qemu-system-aarch64 -M ast2700-evb \ + -drive file=image-bmc,format=raw,if=mtd \ + -nographic + +The ``-bios`` option allows users to specify a custom path for the vbootrom +image to be loaded during boot. This will load the vbootrom image from the +specified path in the ${HOME} directory. + +.. code-block:: bash + + -bios ${HOME}/ast27x0_bootrom.bin + +Booting the ast2700fc machine +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +AST2700 features four Cortex-A35 primary processors and two Cortex-M4 coprocessors. +**ast2700-evb** machine focuses on emulating the four Cortex-A35 primary processors, +**ast2700fc** machine extends **ast2700-evb** by adding support for the two Cortex-M4 coprocessors. + +Steps to boot the AST2700fc machine: + +1. Ensure you have the following AST2700A1 binaries available in a directory + + * u-boot-nodtb.bin + * u-boot.dtb + * bl31.bin + * optee/tee-raw.bin + * image-bmc + * zephyr-aspeed-ssp.elf (for SSP firmware, CPU 5) + * zephyr-aspeed-tsp.elf (for TSP firmware, CPU 6) + +2. Execute the following command to start ``ast2700fc`` machine: + +.. code-block:: bash + + IMGDIR=ast2700-default + UBOOT_SIZE=$(stat --format=%s -L ${IMGDIR}/u-boot-nodtb.bin) + + $ qemu-system-aarch64 -M ast2700fc \ + -device loader,force-raw=on,addr=0x400000000,file=${IMGDIR}/u-boot-nodtb.bin \ + -device loader,force-raw=on,addr=$((0x400000000 + ${UBOOT_SIZE})),file=${IMGDIR}/u-boot.dtb \ + -device loader,force-raw=on,addr=0x430000000,file=${IMGDIR}/bl31.bin \ + -device loader,force-raw=on,addr=0x430080000,file=${IMGDIR}/optee/tee-raw.bin \ + -device loader,cpu-num=0,addr=0x430000000 \ + -device loader,cpu-num=1,addr=0x430000000 \ + -device loader,cpu-num=2,addr=0x430000000 \ + -device loader,cpu-num=3,addr=0x430000000 \ + -drive file=${IMGDIR}/image-bmc,if=mtd,format=raw \ + -device loader,file=${IMGDIR}/zephyr-aspeed-ssp.elf,cpu-num=4 \ + -device loader,file=${IMGDIR}/zephyr-aspeed-tsp.elf,cpu-num=5 \ + -serial pty -serial pty -serial pty \ + -snapshot \ + -S -nographic + +After launching QEMU, serial devices will be automatically redirected. +Example output: + +.. code-block:: bash + + char device redirected to /dev/pts/55 (label serial0) + char device redirected to /dev/pts/56 (label serial1) + char device redirected to /dev/pts/57 (label serial2) + +- serial0: Console for the four Cortex-A35 primary processors. +- serial1 and serial2: Consoles for the two Cortex-M4 coprocessors. + +Use ``tio`` or another terminal emulator to connect to the consoles: + +.. code-block:: bash + + $ tio /dev/pts/55 + $ tio /dev/pts/56 + $ tio /dev/pts/57 + + Aspeed minibmc family boards (``ast1030-evb``) ================================================================== diff --git a/docs/system/confidential-guest-support.rst b/docs/system/confidential-guest-support.rst index 0c490db..66129fb 100644 --- a/docs/system/confidential-guest-support.rst +++ b/docs/system/confidential-guest-support.rst @@ -38,6 +38,7 @@ Supported mechanisms Currently supported confidential guest mechanisms are: * AMD Secure Encrypted Virtualization (SEV) (see :doc:`i386/amd-memory-encryption`) +* Intel Trust Domain Extension (TDX) (see :doc:`i386/tdx`) * POWER Protected Execution Facility (PEF) (see :ref:`power-papr-protected-execution-facility-pef`) * s390x Protected Virtualization (PV) (see :doc:`s390x/protvirt`) diff --git a/docs/system/device-emulation.rst b/docs/system/device-emulation.rst index a1b0d79..9113816 100644 --- a/docs/system/device-emulation.rst +++ b/docs/system/device-emulation.rst @@ -85,6 +85,7 @@ Emulated Devices devices/can.rst devices/ccid.rst devices/cxl.rst + devices/vfio-user.rst devices/ivshmem.rst devices/ivshmem-flat.rst devices/keyboard.rst diff --git a/docs/system/devices/cxl.rst b/docs/system/devices/cxl.rst index 882b036..e307caf 100644 --- a/docs/system/devices/cxl.rst +++ b/docs/system/devices/cxl.rst @@ -308,7 +308,7 @@ A very simple setup with just one directly attached CXL Type 3 Persistent Memory -object memory-backend-file,id=cxl-lsa1,share=on,mem-path=/tmp/lsa.raw,size=256M \ -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \ -device cxl-rp,port=0,bus=cxl.1,id=root_port13,chassis=0,slot=2 \ - -device cxl-type3,bus=root_port13,persistent-memdev=cxl-mem1,lsa=cxl-lsa1,id=cxl-pmem0 \ + -device cxl-type3,bus=root_port13,persistent-memdev=cxl-mem1,lsa=cxl-lsa1,id=cxl-pmem0,sn=0x1 \ -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G A very simple setup with just one directly attached CXL Type 3 Volatile Memory device:: @@ -349,13 +349,13 @@ the CXL Type3 device directly attached (no switches).:: -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \ -device pxb-cxl,bus_nr=222,bus=pcie.0,id=cxl.2 \ -device cxl-rp,port=0,bus=cxl.1,id=root_port13,chassis=0,slot=2 \ - -device cxl-type3,bus=root_port13,persistent-memdev=cxl-mem1,lsa=cxl-lsa1,id=cxl-pmem0 \ + -device cxl-type3,bus=root_port13,persistent-memdev=cxl-mem1,lsa=cxl-lsa1,id=cxl-pmem0,sn=0x1 \ -device cxl-rp,port=1,bus=cxl.1,id=root_port14,chassis=0,slot=3 \ - -device cxl-type3,bus=root_port14,persistent-memdev=cxl-mem2,lsa=cxl-lsa2,id=cxl-pmem1 \ + -device cxl-type3,bus=root_port14,persistent-memdev=cxl-mem2,lsa=cxl-lsa2,id=cxl-pmem1,sn=0x2 \ -device cxl-rp,port=0,bus=cxl.2,id=root_port15,chassis=0,slot=5 \ - -device cxl-type3,bus=root_port15,persistent-memdev=cxl-mem3,lsa=cxl-lsa3,id=cxl-pmem2 \ + -device cxl-type3,bus=root_port15,persistent-memdev=cxl-mem3,lsa=cxl-lsa3,id=cxl-pmem2,sn=0x3 \ -device cxl-rp,port=1,bus=cxl.2,id=root_port16,chassis=0,slot=6 \ - -device cxl-type3,bus=root_port16,persistent-memdev=cxl-mem4,lsa=cxl-lsa4,id=cxl-pmem3 \ + -device cxl-type3,bus=root_port16,persistent-memdev=cxl-mem4,lsa=cxl-lsa4,id=cxl-pmem3,sn=0x4 \ -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.targets.1=cxl.2,cxl-fmw.0.size=4G,cxl-fmw.0.interleave-granularity=8k An example of 4 devices below a switch suitable for 1, 2 or 4 way interleave:: @@ -375,13 +375,13 @@ An example of 4 devices below a switch suitable for 1, 2 or 4 way interleave:: -device cxl-rp,port=1,bus=cxl.1,id=root_port1,chassis=0,slot=1 \ -device cxl-upstream,bus=root_port0,id=us0 \ -device cxl-downstream,port=0,bus=us0,id=swport0,chassis=0,slot=4 \ - -device cxl-type3,bus=swport0,persistent-memdev=cxl-mem0,lsa=cxl-lsa0,id=cxl-pmem0 \ + -device cxl-type3,bus=swport0,persistent-memdev=cxl-mem0,lsa=cxl-lsa0,id=cxl-pmem0,sn=0x1 \ -device cxl-downstream,port=1,bus=us0,id=swport1,chassis=0,slot=5 \ - -device cxl-type3,bus=swport1,persistent-memdev=cxl-mem1,lsa=cxl-lsa1,id=cxl-pmem1 \ + -device cxl-type3,bus=swport1,persistent-memdev=cxl-mem1,lsa=cxl-lsa1,id=cxl-pmem1,sn=0x2 \ -device cxl-downstream,port=2,bus=us0,id=swport2,chassis=0,slot=6 \ - -device cxl-type3,bus=swport2,persistent-memdev=cxl-mem2,lsa=cxl-lsa2,id=cxl-pmem2 \ + -device cxl-type3,bus=swport2,persistent-memdev=cxl-mem2,lsa=cxl-lsa2,id=cxl-pmem2,sn=0x3 \ -device cxl-downstream,port=3,bus=us0,id=swport3,chassis=0,slot=7 \ - -device cxl-type3,bus=swport3,persistent-memdev=cxl-mem3,lsa=cxl-lsa3,id=cxl-pmem3 \ + -device cxl-type3,bus=swport3,persistent-memdev=cxl-mem3,lsa=cxl-lsa3,id=cxl-pmem3,sn=0x4 \ -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G,cxl-fmw.0.interleave-granularity=4k Deprecations diff --git a/docs/system/devices/vfio-user.rst b/docs/system/devices/vfio-user.rst new file mode 100644 index 0000000..b6dcaa5 --- /dev/null +++ b/docs/system/devices/vfio-user.rst @@ -0,0 +1,26 @@ +.. SPDX-License-Identifier: GPL-2.0-or-later + +========= +vfio-user +========= + +QEMU includes a ``vfio-user`` client. The ``vfio-user`` specification allows for +implementing (PCI) devices in userspace outside of QEMU; it is similar to +``vhost-user`` in this respect (see :doc:`vhost-user`), but can emulate arbitrary +PCI devices, not just ``virtio``. Whereas ``vfio`` is handled by the host +kernel, ``vfio-user``, while similar in implementation, is handled entirely in +userspace. + +For example, SPDK includes a virtual PCI NVMe controller implementation; by +setting up a ``vfio-user`` UNIX socket between QEMU and SPDK, a VM can send NVMe +I/O to the SPDK process. + +Presuming a suitable ``vfio-user`` server has opened a socket at +``/tmp/vfio-user.sock``, a device can be configured with for example: + +.. code-block:: console + +-device '{"driver": "vfio-user-pci","socket": {"path": "/tmp/vfio-user.sock", "type": "unix"}}' + +See `libvfio-user <https://github.com/nutanix/libvfio-user/>`_ for further +information. diff --git a/docs/system/gdb.rst b/docs/system/gdb.rst index 4228cb5..d50470b 100644 --- a/docs/system/gdb.rst +++ b/docs/system/gdb.rst @@ -20,7 +20,7 @@ connection, use the ``-gdb dev`` option instead of ``-s``. See .. parsed-literal:: - |qemu_system| -s -S -kernel bzImage -hda rootdisk.img -append "root=/dev/hda" + |qemu_system| -s -S -kernel bzImage -drive file=rootdisk.img,format=raw -append "root=/dev/sda" QEMU will launch but will silently wait for gdb to connect. diff --git a/docs/system/i386/tdx.rst b/docs/system/i386/tdx.rst new file mode 100644 index 0000000..8131750 --- /dev/null +++ b/docs/system/i386/tdx.rst @@ -0,0 +1,161 @@ +Intel Trusted Domain eXtension (TDX) +==================================== + +Intel Trusted Domain eXtensions (TDX) refers to an Intel technology that extends +Virtual Machine Extensions (VMX) and Multi-Key Total Memory Encryption (MKTME) +with a new kind of virtual machine guest called a Trust Domain (TD). A TD runs +in a CPU mode that is designed to protect the confidentiality of its memory +contents and its CPU state from any other software, including the hosting +Virtual Machine Monitor (VMM), unless explicitly shared by the TD itself. + +Prerequisites +------------- + +To run TD, the physical machine needs to have TDX module loaded and initialized +while KVM hypervisor has TDX support and has TDX enabled. If those requirements +are met, the ``KVM_CAP_VM_TYPES`` will report the support of ``KVM_X86_TDX_VM``. + +Trust Domain Virtual Firmware (TDVF) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Trust Domain Virtual Firmware (TDVF) is required to provide TD services to boot +TD Guest OS. TDVF needs to be copied to guest private memory and measured before +the TD boots. + +KVM vcpu ioctl ``KVM_TDX_INIT_MEM_REGION`` can be used to populate the TDVF +content into its private memory. + +Since TDX doesn't support readonly memslot, TDVF cannot be mapped as pflash +device and it actually works as RAM. "-bios" option is chosen to load TDVF. + +OVMF is the opensource firmware that implements the TDVF support. Thus the +command line to specify and load TDVF is ``-bios OVMF.fd`` + +Feature Configuration +--------------------- + +Unlike non-TDX VM, the CPU features (enumerated by CPU or MSR) of a TD are not +under full control of VMM. VMM can only configure part of features of a TD on +``KVM_TDX_INIT_VM`` command of VM scope ``MEMORY_ENCRYPT_OP`` ioctl. + +The configurable features have three types: + +- Attributes: + - PKS (bit 30) controls whether Supervisor Protection Keys is exposed to TD, + which determines related CPUID bit and CR4 bit; + - PERFMON (bit 63) controls whether PMU is exposed to TD. + +- XSAVE related features (XFAM): + XFAM is a 64b mask, which has the same format as XCR0 or IA32_XSS MSR. It + determines the set of extended features available for use by the guest TD. + +- CPUID features: + Only some bits of some CPUID leaves are directly configurable by VMM. + +What features can be configured is reported via TDX capabilities. + +TDX capabilities +~~~~~~~~~~~~~~~~ + +The VM scope ``MEMORY_ENCRYPT_OP`` ioctl provides command ``KVM_TDX_CAPABILITIES`` +to get the TDX capabilities from KVM. It returns a data structure of +``struct kvm_tdx_capabilities``, which tells the supported configuration of +attributes, XFAM and CPUIDs. + +TD attributes +~~~~~~~~~~~~~ + +QEMU supports configuring raw 64-bit TD attributes directly via "attributes" +property of "tdx-guest" object. Note, it's users' responsibility to provide a +valid value because some bits may not supported by current QEMU or KVM yet. + +QEMU also supports the configuration of individual attribute bits that are +supported by it, via properties of "tdx-guest" object. +E.g., "sept-ve-disable" (bit 28). + +MSR based features +~~~~~~~~~~~~~~~~~~ + +Current KVM doesn't support MSR based feature (e.g., MSR_IA32_ARCH_CAPABILITIES) +configuration for TDX, and it's a future work to enable it in QEMU when KVM adds +support of it. + +Feature check +~~~~~~~~~~~~~ + +QEMU checks if the final (CPU) features, determined by given cpu model and +explicit feature adjustment of "+featureA/-featureB", can be supported or not. +It can produce feature not supported warning like + + "warning: host doesn't support requested feature: CPUID.07H:EBX.intel-pt [bit 25]" + +It can also produce warning like + + "warning: TDX forcibly sets the feature: CPUID.80000007H:EDX.invtsc [bit 8]" + +if the fixed-1 feature is requested to be disabled explicitly. This is newly +added to QEMU for TDX because TDX has fixed-1 features that are forcibly enabled +by TDX module and VMM cannot disable them. + +Launching a TD (TDX VM) +----------------------- + +To launch a TD, the necessary command line options are tdx-guest object and +split kernel-irqchip, as below: + +.. parsed-literal:: + + |qemu_system_x86| \\ + -accel kvm \\ + -cpu host \\ + -object tdx-guest,id=tdx0 \\ + -machine ...,confidential-guest-support=tdx0 \\ + -bios OVMF.fd \\ + +Restrictions +------------ + + - kernel-irqchip must be split; + + This is set by default for TDX guest if kernel-irqchip is left on its default + 'auto' setting. + + - No readonly support for private memory; + + - No SMM support: SMM support requires manipulating the guest register states + which is not allowed; + +Debugging +--------- + +Bit 0 of TD attributes, is DEBUG bit, which decides if the TD runs in off-TD +debug mode. When in off-TD debug mode, TD's VCPU state and private memory are +accessible via given SEAMCALLs. This requires KVM to expose APIs to invoke those +SEAMCALLs and corresonponding QEMU change. + +It's targeted as future work. + +TD attestation +-------------- + +In TD guest, the attestation process is used to verify the TDX guest +trustworthiness to other entities before provisioning secrets to the guest. + +TD attestation is initiated first by calling TDG.MR.REPORT inside TD to get the +REPORT. Then the REPORT data needs to be converted into a remotely verifiable +Quote by SGX Quoting Enclave (QE). + +It's a future work in QEMU to add support of TD attestation since it lacks +support in current KVM. + +Live Migration +-------------- + +Future work. + +References +---------- + +- `TDX Homepage <https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html>`__ + +- `SGX QE <https://github.com/intel/SGXDataCenterAttestationPrimitives/tree/master/QuoteGeneration>`__ diff --git a/docs/system/index.rst b/docs/system/index.rst index c21065e..718e9d3 100644 --- a/docs/system/index.rst +++ b/docs/system/index.rst @@ -39,3 +39,4 @@ or Hypervisor.Framework. multi-process confidential-guest-support vm-templating + sriov diff --git a/docs/system/linuxboot.rst b/docs/system/linuxboot.rst index 5db2e56..2328b4a 100644 --- a/docs/system/linuxboot.rst +++ b/docs/system/linuxboot.rst @@ -11,7 +11,7 @@ The syntax is: .. parsed-literal:: - |qemu_system| -kernel bzImage -hda rootdisk.img -append "root=/dev/hda" + |qemu_system| -kernel bzImage -drive file=rootdisk.img,format=raw -append "root=/dev/sda" Use ``-kernel`` to provide the Linux kernel image and ``-append`` to give the kernel command line arguments. The ``-initrd`` option can be @@ -23,8 +23,8 @@ virtual serial port and the QEMU monitor to the console with the .. parsed-literal:: - |qemu_system| -kernel bzImage -hda rootdisk.img \ - -append "root=/dev/hda console=ttyS0" -nographic + |qemu_system| -kernel bzImage -drive file=rootdisk.img,format=raw \ + -append "root=/dev/sda console=ttyS0" -nographic Use Ctrl-a c to switch between the serial console and the monitor (see :ref:`GUI_keys`). diff --git a/docs/system/qemu-block-drivers.rst.inc b/docs/system/qemu-block-drivers.rst.inc index cfe1acb..384e95b 100644 --- a/docs/system/qemu-block-drivers.rst.inc +++ b/docs/system/qemu-block-drivers.rst.inc @@ -500,8 +500,6 @@ What you should *never* do: - expect it to work when loadvm'ing - write to the FAT directory on the host system while accessing it with the guest system -.. _nbd: - NBD access ~~~~~~~~~~ diff --git a/docs/system/riscv/microchip-icicle-kit.rst b/docs/system/riscv/microchip-icicle-kit.rst index 40798b1..9809e94 100644 --- a/docs/system/riscv/microchip-icicle-kit.rst +++ b/docs/system/riscv/microchip-icicle-kit.rst @@ -5,10 +5,10 @@ Microchip PolarFire SoC Icicle Kit integrates a PolarFire SoC, with one SiFive's E51 plus four U54 cores and many on-chip peripherals and an FPGA. For more details about Microchip PolarFire SoC, please see: -https://www.microsemi.com/product-directory/soc-fpgas/5498-polarfire-soc-fpga +https://www.microchip.com/en-us/products/fpgas-and-plds/system-on-chip-fpgas/polarfire-soc-fpgas The Icicle Kit board information can be found here: -https://www.microsemi.com/existing-parts/parts/152514 +https://www.microchip.com/en-us/development-tool/mpfs-icicle-kit-es Supported devices ----------------- @@ -26,95 +26,48 @@ The ``microchip-icicle-kit`` machine supports the following devices: * 2 GEM Ethernet controllers * 1 SDHC storage controller +The memory is set to 1537 MiB by default. A sanity check on RAM size is +performed in the machine init routine to prompt user to increase the RAM size +to > 1537 MiB when less than 1537 MiB RAM is detected. + Boot options ------------ -The ``microchip-icicle-kit`` machine can start using the standard -bios -functionality for loading its BIOS image, aka Hart Software Services (HSS_). -HSS loads the second stage bootloader U-Boot from an SD card. Then a kernel -can be loaded from U-Boot. It also supports direct kernel booting via the --kernel option along with the device tree blob via -dtb. When direct kernel -boot is used, the OpenSBI fw_dynamic BIOS image is used to boot a payload -like U-Boot or OS kernel directly. - -The user provided DTB should have the following requirements: - -* The /cpus node should contain at least one subnode for E51 and the number - of subnodes should match QEMU's ``-smp`` option -* The /memory reg size should match QEMU’s selected ram_size via ``-m`` -* Should contain a node for the CLINT device with a compatible string - "riscv,clint0" - -QEMU follows below truth table to select which payload to execute: - -===== ========== ========== ======= --bios -kernel -dtb payload -===== ========== ========== ======= - N N don't care HSS - Y don't care don't care HSS - N Y Y kernel -===== ========== ========== ======= - -The memory is set to 1537 MiB by default which is the minimum required high -memory size by HSS. A sanity check on ram size is performed in the machine -init routine to prompt user to increase the RAM size to > 1537 MiB when less -than 1537 MiB ram is detected. - -Running HSS ------------ - -HSS 2020.12 release is tested at the time of writing. To build an HSS image -that can be booted by the ``microchip-icicle-kit`` machine, type the following -in the HSS source tree: - -.. code-block:: bash - - $ export CROSS_COMPILE=riscv64-linux- - $ cp boards/mpfs-icicle-kit-es/def_config .config - $ make BOARD=mpfs-icicle-kit-es - -Download the official SD card image released by Microchip and prepare it for -QEMU usage: - -.. code-block:: bash - - $ wget ftp://ftpsoc.microsemi.com/outgoing/core-image-minimal-dev-icicle-kit-es-sd-20201009141623.rootfs.wic.gz - $ gunzip core-image-minimal-dev-icicle-kit-es-sd-20201009141623.rootfs.wic.gz - $ qemu-img resize core-image-minimal-dev-icicle-kit-es-sd-20201009141623.rootfs.wic 4G - -Then we can boot the machine by: - -.. code-block:: bash - - $ qemu-system-riscv64 -M microchip-icicle-kit -smp 5 \ - -bios path/to/hss.bin -sd path/to/sdcard.img \ - -nic user,model=cadence_gem \ - -nic tap,ifname=tap,model=cadence_gem,script=no \ - -display none -serial stdio \ - -chardev socket,id=serial1,path=serial1.sock,server=on,wait=on \ - -serial chardev:serial1 +The ``microchip-icicle-kit`` machine provides some options to run a firmware +(BIOS) or a kernel image. QEMU follows below truth table to select the +firmware: -With above command line, current terminal session will be used for the first -serial port. Open another terminal window, and use ``minicom`` to connect the -second serial port. +============= =========== ====================================== +-bios -kernel firmware +============= =========== ====================================== +none N this is an error +none Y the kernel image +NULL, default N hss.bin +NULL, default Y opensbi-riscv64-generic-fw_dynamic.bin +other don't care the BIOS image +============= =========== ====================================== -.. code-block:: bash +Direct Kernel Boot +------------------ - $ minicom -D unix\#serial1.sock +Use the ``-kernel`` option to directly run a kernel image. When a direct +kernel boot is requested, a device tree blob may be specified via the ``-dtb`` +option. Unlike other QEMU machines, this machine does not generate a device +tree for the kernel. It shall be provided by the user. The user provided DTB +should meet the following requirements: -HSS output is on the first serial port (stdio) and U-Boot outputs on the -second serial port. U-Boot will automatically load the Linux kernel from -the SD card image. +* The ``/cpus`` node should contain at least one subnode for E51 and the number + of subnodes should match QEMU's ``-smp`` option. -Direct Kernel Boot ------------------- +* The ``/memory`` reg size should match QEMU’s selected RAM size via the ``-m`` + option. -Sometimes we just want to test booting a new kernel, and transforming the -kernel image to the format required by the HSS bootflow is tedious. We can -use '-kernel' for direct kernel booting just like other RISC-V machines do. +* It should contain a node for the CLINT device with a compatible string + "riscv,clint0". -In this mode, the OpenSBI fw_dynamic BIOS image for 'generic' platform is -used to boot an S-mode payload like U-Boot or OS kernel directly. +When ``-bios`` is not specified or set to ``default``, the OpenSBI +``fw_dynamic`` BIOS image for the ``generic`` platform is used to boot an +S-mode payload like U-Boot or OS kernel directly. For example, the following commands show building a U-Boot image from U-Boot mainline v2021.07 for the Microchip Icicle Kit board: @@ -146,4 +99,13 @@ CAVEATS: ``u-boot.bin`` has to be used which does contain one. To use the ELF image, we need to change to CONFIG_OF_EMBED or CONFIG_OF_PRIOR_STAGE. +Running HSS +----------- + +The machine ``microchip-icicle-kit`` used to run the Hart Software Services +(HSS_), however, the HSS development progressed and the QEMU machine +implementation lacks behind. Currently, running the HSS no longer works. +There is missing support in the clock and memory controller devices. In +particular, reading from the SD card does not work. + .. _HSS: https://github.com/polarfire-soc/hart-software-services diff --git a/docs/system/sriov.rst b/docs/system/sriov.rst new file mode 100644 index 0000000..d12178f --- /dev/null +++ b/docs/system/sriov.rst @@ -0,0 +1,37 @@ +.. SPDX-License-Identifier: GPL-2.0-or-later + +Compsable SR-IOV device +======================= + +SR-IOV (Single Root I/O Virtualization) is an optional extended capability of a +PCI Express device. It allows a single physical function (PF) to appear as +multiple virtual functions (VFs) for the main purpose of eliminating software +overhead in I/O from virtual machines. + +There are devices with predefined SR-IOV configurations, but it is also possible +to compose an SR-IOV device yourself. Composing an SR-IOV device is currently +only supported by virtio-net-pci. + +Users can configure an SR-IOV-capable virtio-net device by adding +virtio-net-pci functions to a bus. Below is a command line example: + +.. code-block:: shell + + -netdev user,id=n -netdev user,id=o + -netdev user,id=p -netdev user,id=q + -device pcie-root-port,id=b + -device virtio-net-pci,bus=b,addr=0x0.0x3,netdev=q,sriov-pf=f + -device virtio-net-pci,bus=b,addr=0x0.0x2,netdev=p,sriov-pf=f + -device virtio-net-pci,bus=b,addr=0x0.0x1,netdev=o,sriov-pf=f + -device virtio-net-pci,bus=b,addr=0x0.0x0,netdev=n,id=f + +The VFs specify the paired PF with ``sriov-pf`` property. The PF must be +added after all VFs. It is the user's responsibility to ensure that VFs have +function numbers larger than one of the PF, and that the function numbers +have a consistent stride. Both the PF and VFs are ARI-capable so you can have +255 VFs at maximum. + +You may also need to perform additional steps to activate the SR-IOV feature on +your guest. For Linux, refer to [1]_. + +.. [1] https://docs.kernel.org/PCI/pci-iov-howto.html diff --git a/docs/system/target-i386.rst b/docs/system/target-i386.rst index ab7af1a..43b09c7 100644 --- a/docs/system/target-i386.rst +++ b/docs/system/target-i386.rst @@ -31,6 +31,7 @@ Architectural features i386/kvm-pv i386/sgx i386/amd-memory-encryption + i386/tdx OS requirements ~~~~~~~~~~~~~~~ diff --git a/docs/system/target-mips.rst b/docs/system/target-mips.rst index 83239fb..9028c3b 100644 --- a/docs/system/target-mips.rst +++ b/docs/system/target-mips.rst @@ -112,5 +112,5 @@ https://mipsdistros.mips.com/LinuxDistro/nanomips/kernels/v4.15.18-432-gb2eb9a8b Start system emulation of Malta board with nanoMIPS I7200 CPU:: qemu-system-mipsel -cpu I7200 -kernel <kernel_image_file> \ - -M malta -serial stdio -m <memory_size> -hda <disk_image_file> \ + -M malta -serial stdio -m <memory_size> -drive file=<disk_image_file>,format=raw \ -append "mem=256m@0x0 rw console=ttyS0 vga=cirrus vesa=0x111 root=/dev/sda" |