aboutsummaryrefslogtreecommitdiff
path: root/docs/vfio-user.rst
diff options
context:
space:
mode:
authorWilliam Henderson <william.henderson@nutanix.com>2023-09-15 16:07:01 +0100
committerGitHub <noreply@github.com>2023-09-15 16:07:01 +0100
commit190f85bf9c114bf7c981bb8908394368f84c0c04 (patch)
tree92273a811fc3a8af74a5f62cec8871f345d6999b /docs/vfio-user.rst
parent1569a37a54ecb63bd4008708c76339ccf7d06115 (diff)
downloadlibvfio-user-190f85bf9c114bf7c981bb8908394368f84c0c04.zip
libvfio-user-190f85bf9c114bf7c981bb8908394368f84c0c04.tar.gz
libvfio-user-190f85bf9c114bf7c981bb8908394368f84c0c04.tar.bz2
adapt to VFIO live migration v2 (#782)
This commit adapts the vfio-user protocol specification and the libvfio-user implementation to v2 of the VFIO live migration interface, as used in the kernel and QEMU. The differences between v1 and v2 are discussed in this email thread [1], and we slightly differ from upstream VFIO v2 in that instead of transferring data over a new FD, we use the existing UNIX socket with new commands VFIO_USER_MIG_DATA_READ/WRITE. We also don't yet use P2P states. The updated spec was submitted to qemu-devel [2]. [1] https://lore.kernel.org/all/20220130160826.32449-9-yishaih@nvidia.com/ [2] https://lore.kernel.org/all/20230718094150.110183-1-william.henderson@nutanix.com/ Signed-off-by: William Henderson <william.henderson@nutanix.com>
Diffstat (limited to 'docs/vfio-user.rst')
-rw-r--r--docs/vfio-user.rst836
1 files changed, 463 insertions, 373 deletions
diff --git a/docs/vfio-user.rst b/docs/vfio-user.rst
index 3c26da5..b83b359 100644
--- a/docs/vfio-user.rst
+++ b/docs/vfio-user.rst
@@ -1,11 +1,10 @@
.. include:: <isonum.txt>
-
********************************
vfio-user Protocol Specification
********************************
--------------
-Version_ 0.9.1
+Version_ 0.9.2
--------------
.. contents:: Table of Contents
@@ -342,9 +341,9 @@ usual ``msg_size`` field in the header, not the ``argsz`` field.
In a reply, the server sets ``argsz`` field to the size needed for a full
payload size. This may be less than the requested maximum size. This may be
-larger than the requested maximum size: in that case, the payload reply header
-is returned, but the ``argsz`` field in the reply indicates the needed size,
-allowing a client to allocate a larger buffer for holding the reply before
+larger than the requested maximum size: in that case, the full payload is not
+included in the reply, but the ``argsz`` field in the reply indicates the needed
+size, allowing a client to allocate a larger buffer for holding the reply before
trying again.
In addition, during negotiation (see `Version`_), the client and server may
@@ -357,8 +356,9 @@ Protocol Specification
To distinguish from the base VFIO symbols, all vfio-user symbols are prefixed
with ``vfio_user`` or ``VFIO_USER``. In this revision, all data is in the
-little-endian format, although this may be relaxed in future revisions in cases
-where the client and server are both big-endian.
+endianness of the host system, although this may be relaxed in future
+revisions in cases where the client and server run on different hosts
+with different endianness.
Unless otherwise specified, all sizes should be presumed to be in bytes.
@@ -385,7 +385,10 @@ Name Command Request Direction
``VFIO_USER_DMA_READ`` 11 server -> client
``VFIO_USER_DMA_WRITE`` 12 server -> client
``VFIO_USER_DEVICE_RESET`` 13 client -> server
-``VFIO_USER_DIRTY_PAGES`` 14 client -> server
+``VFIO_USER_REGION_WRITE_MULTI`` 15 client -> server
+``VFIO_USER_DEVICE_FEATURE`` 16 client -> server
+``VFIO_USER_MIG_DATA_READ`` 17 client -> server
+``VFIO_USER_MIG_DATA_WRITE`` 18 client -> server
====================================== ========= =================
Header
@@ -508,34 +511,33 @@ format:
Capabilities:
-+--------------------+--------+------------------------------------------------+
-| Name | Type | Description |
-+====================+========+================================================+
-| max_msg_fds | number | Maximum number of file descriptors that can be |
-| | | received by the sender in one message. |
-| | | Optional. If not specified then the receiver |
-| | | must assume a value of ``1``. |
-+--------------------+--------+------------------------------------------------+
-| max_data_xfer_size | number | Maximum ``count`` for data transfer messages; |
-| | | see `Read and Write Operations`_. Optional, |
-| | | with a default value of 1048576 bytes. |
-+--------------------+--------+------------------------------------------------+
-| migration | object | Migration capability parameters. If missing |
-| | | then migration is not supported by the sender. |
-+--------------------+--------+------------------------------------------------+
-| twin_socket | object | Parameters for twin-socket mode, which handles |
-| | | server-to-client commands and their replies on |
-| | | a separate socket. Optional. |
-+--------------------+--------+------------------------------------------------+
-
-The migration capability contains the following name/value pairs:
-
-+--------+--------+-----------------------------------------------+
-| Name | Type | Description |
-+========+========+===============================================+
-| pgsize | number | Page size of dirty pages bitmap. The smallest |
-| | | between the client and the server is used. |
-+--------+--------+-----------------------------------------------+
++--------------------+---------+-----------------------------------------------+
+| Name | Type | Description |
++====================+=========+===============================================+
+| max_msg_fds | number | Maximum number of file descriptors that can |
+| | | be received by the sender in one message. |
+| | | Optional. If not specified then the receiver |
+| | | must assume a value of ``1``. |
++--------------------+---------+-----------------------------------------------+
+| max_data_xfer_size | number | Maximum ``count`` for data transfer messages; |
+| | | see `Read and Write Operations`_. Optional, |
+| | | with a default value of 1048576 bytes. |
++--------------------+---------+-----------------------------------------------+
+| max_dma_maps | number | Maximum number DMA map windows that can be |
+| | | valid simultaneously. Optional, with a |
+| | | value of 65535 (64k-1). |
++--------------------+---------+-----------------------------------------------+
+| pgsizes | number | Page sizes supported in DMA map operations |
+| | | or'ed together. Optional, with a default |
+| | | value of supporting only 4k pages. |
++--------------------+---------+-----------------------------------------------+
+| twin_socket | object | Parameters for twin-socket mode, which |
+| | | handles server-to-client commands and their |
+| | | replies on a separate socket. Optional. |
++--------------------+---------+-----------------------------------------------+
+| write_multiple | boolean | ``VFIO_USER_REGION_WRITE_MULTI`` messages |
+| | | are supported if the value is ``true``. |
++--------------------+---------+-----------------------------------------------+
The ``twin_socket`` capability object holds these name/value pairs:
@@ -678,56 +680,18 @@ The request payload for this message is a structure of the following format:
+--------------+--------+------------------------+
| flags | 4 | 4 |
+--------------+--------+------------------------+
-| | +-----+-----------------------+ |
-| | | Bit | Definition | |
-| | +=====+=======================+ |
-| | | 0 | get dirty page bitmap | |
-| | +-----+-----------------------+ |
-| | | 1 | unmap all regions | |
-| | +-----+-----------------------+ |
-+--------------+--------+------------------------+
| address | 8 | 8 |
+--------------+--------+------------------------+
| size | 16 | 8 |
+--------------+--------+------------------------+
* *argsz* is the maximum size of the reply payload.
-* *flags* contains the following DMA region attributes:
-
- * *get dirty page bitmap* indicates that a dirty page bitmap must be
- populated before unmapping the DMA region. The client must provide a
- `VFIO Bitmap`_ structure, explained below, immediately following this
- entry.
- * *unmap all regions* indicates to unmap all the regions previously
- mapped via `VFIO_USER_DMA_MAP`. This flag cannot be combined with
- *get dirty page bitmap* and expects *address* and *size* to be 0.
-
+* *flags* is unused in this version.
* *address* is the base DMA address of the DMA region.
* *size* is the size of the DMA region.
The address and size of the DMA region being unmapped must match exactly a
-previous mapping. The size of request message depends on whether or not the
-*get dirty page bitmap* bit is set in Flags:
-
-* If not set, the size of the total request message is: 16 + 24.
-
-* If set, the size of the total request message is: 16 + 24 + 16.
-
-.. _VFIO Bitmap:
-
-VFIO Bitmap Format
-""""""""""""""""""
-
-+--------+--------+------+
-| Name | Offset | Size |
-+========+========+======+
-| pgsize | 0 | 8 |
-+--------+--------+------+
-| size | 8 | 8 |
-+--------+--------+------+
-
-* *pgsize* is the page size for the bitmap, in bytes.
-* *size* is the size for the bitmap, in bytes, excluding the VFIO bitmap header.
+previous mapping.
Reply
^^^^^
@@ -736,14 +700,8 @@ Upon receiving a ``VFIO_USER_DMA_UNMAP`` command, if the file descriptor is
mapped then the server must release all references to that DMA region before
replying, which potentially includes in-flight DMA transactions.
-The server responds with the original DMA entry in the request. If the
-*get dirty page bitmap* bit is set in flags in the request, then
-the server also includes the `VFIO Bitmap`_ structure sent in the request,
-followed by the corresponding dirty page bitmap, where each bit represents
-one page of size *pgsize* in `VFIO Bitmap`_ .
+The server responds with the original DMA entry in the request.
-The total size of the total reply message is:
-16 + 24 + (16 + *size* in `VFIO Bitmap`_ if *get dirty page bitmap* is set).
``VFIO_USER_DEVICE_GET_INFO``
-----------------------------
@@ -959,7 +917,7 @@ VFIO region info cap sparse mmap
+----------+--------+------+
| offset | 8 | 8 |
+----------+--------+------+
-| size | 16 | 9 |
+| size | 16 | 8 |
+----------+--------+------+
| ... | | |
+----------+--------+------+
@@ -973,39 +931,6 @@ VFIO region info cap sparse mmap
The VFIO sparse mmap area is defined in ``<linux/vfio.h>`` (``struct
vfio_region_info_cap_sparse_mmap``).
-VFIO region type cap header
-"""""""""""""""""""""""""""
-
-+------------------+---------------------------+
-| Name | Value |
-+==================+===========================+
-| id | VFIO_REGION_INFO_CAP_TYPE |
-+------------------+---------------------------+
-| version | 0x1 |
-+------------------+---------------------------+
-| next | <next> |
-+------------------+---------------------------+
-| region info type | VFIO region info type |
-+------------------+---------------------------+
-
-This capability is defined when a region is specific to the device.
-
-VFIO region info type cap
-"""""""""""""""""""""""""
-
-The VFIO region info type is defined in ``<linux/vfio.h>``
-(``struct vfio_region_info_cap_type``).
-
-+---------+--------+------+
-| Name | Offset | Size |
-+=========+========+======+
-| type | 0 | 4 |
-+---------+--------+------+
-| subtype | 4 | 4 |
-+---------+--------+------+
-
-The only device-specific region type and subtype supported by vfio-user is
-``VFIO_REGION_TYPE_MIGRATION`` (3) and ``VFIO_REGION_SUBTYPE_MIGRATION`` (1).
``VFIO_USER_DEVICE_GET_REGION_IO_FDS``
--------------------------------------
@@ -1071,7 +996,7 @@ Reply
* *argsz* is the size of the region IO FD info structure plus the
total size of the sub-region array. Thus, each array entry "i" is at offset
- i * ((argsz - 16) / count). Note that currently this is 40 bytes for both IO
+ i * ((argsz - 32) / count). Note that currently this is 40 bytes for both IO
FD types, but this is not to be relied on. As elsewhere, this indicates the
full reply payload size needed.
* *flags* must be zero
@@ -1087,8 +1012,8 @@ Note that it is the client's responsibility to verify the requested values (for
example, that the requested offset does not exceed the region's bounds).
Each sub-region given in the response has one of two possible structures,
-depending whether *type* is ``VFIO_USER_IO_FD_TYPE_IOEVENTFD`` (0) or
-``VFIO_USER_IO_FD_TYPE_IOREGIONFD`` (1):
+depending whether *type* is ``VFIO_USER_IO_FD_TYPE_IOEVENTFD`` or
+``VFIO_USER_IO_FD_TYPE_IOREGIONFD``:
Sub-Region IO FD info format (ioeventfd)
""""""""""""""""""""""""""""""""""""""""
@@ -1552,290 +1477,455 @@ Reply
This command message is sent from the client to the server to reset the device.
Neither the request or reply have a payload.
-``VFIO_USER_DIRTY_PAGES``
--------------------------
+``VFIO_USER_REGION_WRITE_MULTI``
+--------------------------------
+
+This message can be used to coalesce multiple device write operations
+into a single messgage. It is only used as an optimization when the
+outgoing message queue is relatively full.
+
+Request
+^^^^^^^
+
++---------+--------+----------+
+| Name | Offset | Size |
++=========+========+==========+
+| wr_cnt | 0 | 8 |
++---------+--------+----------+
+| wrs | 8 | variable |
++---------+--------+----------+
-This command is analogous to ``VFIO_IOMMU_DIRTY_PAGES``. It is sent by the client
-to the server in order to control logging of dirty pages, usually during a live
-migration.
+* *wr_cnt* is the number of device writes coalesced in the message
+* *wrs* is an array of device writes defined below
-Dirty page tracking is optional for server implementation; clients should not
-rely on it.
+Single Device Write Format
+""""""""""""""""""""""""""
+
++--------+--------+----------+
+| Name | Offset | Size |
++========+========+==========+
+| offset | 0 | 8 |
++--------+--------+----------+
+| region | 8 | 4 |
++--------+--------+----------+
+| count | 12 | 4 |
++--------+--------+----------+
+| data | 16 | 8 |
++--------+--------+----------+
+
+* *offset* into the region being accessed.
+* *region* is the index of the region being accessed.
+* *count* is the size of the data to be transferred. This format can
+ only describe writes of 8 bytes or less.
+* *data* is the data to write.
+
+Reply
+^^^^^
+
++---------+--------+----------+
+| Name | Offset | Size |
++=========+========+==========+
+| wr_cnt | 0 | 8 |
++---------+--------+----------+
+
+* *wr_cnt* is the number of device writes completed.
+
+``VFIO_USER_DEVICE_FEATURE``
+----------------------------
+
+This command is analogous to ``VFIO_DEVICE_FEATURE``. It is used to get, set, or
+probe feature data of the device.
Request
^^^^^^^
-+-------+--------+-----------------------------------------+
-| Name | Offset | Size |
-+=======+========+=========================================+
-| argsz | 0 | 4 |
-+-------+--------+-----------------------------------------+
-| flags | 4 | 4 |
-+-------+--------+-----------------------------------------+
-| | +-----+----------------------------------------+ |
-| | | Bit | Definition | |
-| | +=====+========================================+ |
-| | | 0 | VFIO_IOMMU_DIRTY_PAGES_FLAG_START | |
-| | +-----+----------------------------------------+ |
-| | | 1 | VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP | |
-| | +-----+----------------------------------------+ |
-| | | 2 | VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP | |
-| | +-----+----------------------------------------+ |
-+-------+--------+-----------------------------------------+
-
-* *argsz* is the size of the VFIO dirty bitmap info structure for
- ``START/STOP``; and for ``GET_BITMAP``, the maximum size of the reply payload
-
-* *flags* defines the action to be performed by the server:
-
- * ``VFIO_IOMMU_DIRTY_PAGES_FLAG_START`` instructs the server to start logging
- pages it dirties. Logging continues until explicitly disabled by
- ``VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP``.
-
- * ``VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP`` instructs the server to stop logging
- dirty pages.
-
- * ``VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP`` requests the server to return
- the dirty bitmap for a specific IOVA range. The IOVA range is specified by
- a "VFIO Bitmap Range" structure, which must immediately follow this
- "VFIO Dirty Pages" structure. See `VFIO Bitmap Range Format`_.
- This operation is only valid if logging of dirty pages has been previously
- started.
-
- These flags are mutually exclusive with each other.
-
-This part of the request is analogous to VFIO's ``struct
-vfio_iommu_type1_dirty_bitmap``.
-
-.. _VFIO Bitmap Range Format:
-
-VFIO Bitmap Range Format
+The request payload for this message is a structure of the following format.
+
++-------+--------+--------------------------------+
+| Name | Offset | Size |
++=======+========+================================+
+| argsz | 0 | 4 |
++-------+--------+--------------------------------+
+| flags | 4 | 4 |
++-------+--------+--------------------------------+
+| | +---------+---------------------------+ |
+| | | Bit | Definition | |
+| | +=========+===========================+ |
+| | | 0 to 15 | Feature index | |
+| | +---------+---------------------------+ |
+| | | 16 | VFIO_DEVICE_FEATURE_GET | |
+| | +---------+---------------------------+ |
+| | | 17 | VFIO_DEVICE_FEATURE_SET | |
+| | +---------+---------------------------+ |
+| | | 18 | VFIO_DEVICE_FEATURE_PROBE | |
+| | +---------+---------------------------+ |
++-------+--------+--------------------------------+
+| data | 8 | variable |
++-------+--------+--------------------------------+
+
+* *argsz* is the maximum size of the reply payload.
+
+* *flags* defines the action to be performed by the server and upon which
+ feature:
+
+ * The feature index consists of the least significant 16 bits of the flags
+ field, and can be accessed using the ``VFIO_DEVICE_FEATURE_MASK`` bit mask.
+
+ * ``VFIO_DEVICE_FEATURE_GET`` instructs the server to get the data for the
+ given feature.
+
+ * ``VFIO_DEVICE_FEATURE_SET`` instructs the server to set the feature data to
+ that given in the ``data`` field of the payload.
+
+ * ``VFIO_DEVICE_FEATURE_PROBE`` instructs the server to probe for feature
+ support. If ``VFIO_DEVICE_FEATURE_GET`` and/or ``VFIO_DEVICE_FEATURE_SET``
+ are also set, the probe will only return success if all of the indicated
+ methods are supported.
+
+ ``VFIO_DEVICE_FEATURE_GET`` and ``VFIO_DEVICE_FEATURE_SET`` are mutually
+ exclusive, except for use with ``VFIO_DEVICE_FEATURE_PROBE``.
+
+* *data* is specific to the particular feature. It is not used for probing.
+
+This part of the request is analogous to VFIO's ``struct vfio_device_feature``.
+
+Reply
+^^^^^
+
+The reply payload must be the same as the request payload for setting or
+probing a feature. For getting a feature's data, the data is added in the data
+section and its length is added to ``argsz``.
+
+Device Features
+^^^^^^^^^^^^^^^
+
+The only device features supported by vfio-user are those related to migration,
+although this may change in the future. They are a subset of those supported in
+the VFIO implementation of the Linux kernel.
+
++----------------------------------------+---------------+
+| Name | Feature Index |
++========================================+===============+
+| VFIO_DEVICE_FEATURE_MIGRATION | 1 |
++----------------------------------------+---------------+
+| VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE | 2 |
++----------------------------------------+---------------+
+| VFIO_DEVICE_FEATURE_DMA_LOGGING_START | 6 |
++----------------------------------------+---------------+
+| VFIO_DEVICE_FEATURE_DMA_LOGGING_STOP | 7 |
++----------------------------------------+---------------+
+| VFIO_DEVICE_FEATURE_DMA_LOGGING_REPORT | 8 |
++----------------------------------------+---------------+
+
+``VFIO_DEVICE_FEATURE_MIGRATION``
+"""""""""""""""""""""""""""""""""
+
+This feature indicates that the device can support the migration API through
+``VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE``. If ``GET`` succeeds, the ``RUNNING``
+and ``ERROR`` states are always supported. Support for additional states is
+indicated via the flags field; at least ``VFIO_MIGRATION_STOP_COPY`` must be
+set.
+
+There is no data field of the request message.
+
+The data field of the reply message is structured as follows:
+
++-------+--------+---------------------------+
+| Name | Offset | Size |
++=======+========+===========================+
+| flags | 0 | 8 |
++-------+--------+---------------------------+
+| | +-----+--------------------------+ |
+| | | Bit | Definition | |
+| | +=====+==========================+ |
+| | | 0 | VFIO_MIGRATION_STOP_COPY | |
+| | +-----+--------------------------+ |
+| | | 1 | VFIO_MIGRATION_P2P | |
+| | +-----+--------------------------+ |
+| | | 2 | VFIO_MIGRATION_PRE_COPY | |
+| | +-----+--------------------------+ |
++-------+--------+---------------------------+
+
+These flags are interpreted in the same way as VFIO.
+
+``VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE``
+""""""""""""""""""""""""""""""""""""""""
+
+Upon ``VFIO_DEVICE_FEATURE_SET``, execute a migration state change on the VFIO
+device. The new state is supplied in ``device_state``. The state transition must
+fully complete before the reply is sent.
+
+The data field of the reply message, as well as the ``SET`` request message, is
+structured as follows:
+
++--------------+--------+------+
+| Name | Offset | Size |
++==============+========+======+
+| device_state | 0 | 4 |
++--------------+--------+------+
+| data_fd | 4 | 4 |
++--------------+--------+------+
+
+* *device_state* is the current state of the device (for ``GET``) or the
+ state to transition to (for ``SET``). It is defined by the
+ ``vfio_device_mig_state`` enum as detailed below. These states are the states
+ of the device migration Finite State Machine.
+
++--------------------------------+-------+---------------------------------------------------------------------+
+| Name | State | Description |
++================================+=======+=====================================================================+
+| VFIO_DEVICE_STATE_ERROR | 0 | The device has failed and must be reset. |
++--------------------------------+-------+---------------------------------------------------------------------+
+| VFIO_DEVICE_STATE_STOP | 1 | The device does not change the internal or external state. |
++--------------------------------+-------+---------------------------------------------------------------------+
+| VFIO_DEVICE_STATE_RUNNING | 2 | The device is running normally. |
++--------------------------------+-------+---------------------------------------------------------------------+
+| VFIO_DEVICE_STATE_STOP_COPY | 3 | The device internal state can be read out. |
++--------------------------------+-------+---------------------------------------------------------------------+
+| VFIO_DEVICE_STATE_RESUMING | 4 | The device is stopped and is loading a new internal state. |
++--------------------------------+-------+---------------------------------------------------------------------+
+| VFIO_DEVICE_STATE_RUNNING_P2P | 5 | (not used in vfio-user) |
++--------------------------------+-------+---------------------------------------------------------------------+
+| VFIO_DEVICE_STATE_PRE_COPY | 6 | The device is running normally but tracking internal state changes. |
++--------------------------------+-------+---------------------------------------------------------------------+
+| VFIO_DEVICE_STATE_PRE_COPY_P2P | 7 | (not used in vfio-user) |
++--------------------------------+-------+---------------------------------------------------------------------+
+
+* *data_fd* is unused in vfio-user, as the ``VFIO_USER_MIG_DATA_READ`` and
+ ``VFIO_USER_MIG_DATA_WRITE`` messages are used instead for migration data
+ transport.
+
+Direct State Transitions
""""""""""""""""""""""""
+The device migration FSM is a Mealy machine, so actions are taken upon the arcs
+between FSM states. The following transitions need to be supported by the
+server, a subset of those defined in ``<linux/vfio.h>``
+(``enum vfio_device_mig_state``).
+
+* ``RUNNING -> STOP``, ``STOP_COPY -> STOP``: Stop the operation of the device.
+ The ``STOP_COPY`` arc terminates the data transfer session.
+
+* ``RESUMING -> STOP``: Terminate the data transfer session. Complete processing
+ of the migration data. Stop the operation of the device. If the delivered data
+ is found to be incomplete, inconsistent, or otherwise invalid, fail the
+ ``SET`` command and optionally transition to the ``ERROR`` state.
+
+* ``PRE_COPY -> RUNNING``: Terminate the data transfer session. The device is
+ now fully operational.
+
+* ``STOP -> RUNNING``: Start the operation of the device.
+
+* ``RUNNING -> PRE_COPY``, ``STOP -> STOP_COPY``: Begin the process of saving
+ the device state. The device operation is unchanged, but data transfer begins.
+ ``PRE_COPY`` and ``STOP_COPY`` are referred to as the "saving group" of
+ states.
+
+* ``PRE_COPY -> STOP_COPY``: Continue to transfer migration data, but stop
+ device operation.
+
+* ``STOP -> RESUMING``: Start the process of restoring the device state. The
+ internal device state may be changed to prepare the device to receive the
+ migration data.
+
+The ``STOP_COPY -> PRE_COPY`` transition is explicitly not allowed and should
+return an error if requested.
+
+``ERROR`` cannot be specified as a device state, but any transition request can
+be failed and then move the state into ``ERROR`` if the server was unable to
+execute the requested arc AND was unable to restore the device into any valid
+state. To recover from ``ERROR``, ``VFIO_USER_DEVICE_RESET`` must be used to
+return back to ``RUNNING``.
+
+If ``PRE_COPY`` is not supported, arcs touching it are removed.
+
+Complex State Transitions
+"""""""""""""""""""""""""
+
+The remaining possible transitions are to be implemented as combinations of the
+above FSM arcs. As there are multiple paths, the path should be selected based
+on the following rules:
+
+* Select the shortest path.
+
+* The path cannot have saving group states as interior arcs, only start/end
+ states.
+
+``VFIO_DEVICE_FEATURE_DMA_LOGGING_START`` / ``VFIO_DEVICE_FEATURE_DMA_LOGGING_STOP``
+""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
+
+Upon ``VFIO_DEVICE_FEATURE_SET``, start/stop DMA logging. These features can
+also be probed to determine whether the device supports DMA logging.
+
+When DMA logging is started, a range of IOVAs to monitor is provided and the
+device can optimize its logging to cover only the IOVA range given. Only DMA
+writes are logged.
+
+The data field of the ``SET`` request is structured as follows:
+
++------------+--------+----------+
+| Name | Offset | Size |
++============+========+==========+
+| page_size | 0 | 8 |
++------------+--------+----------+
+| num_ranges | 8 | 4 |
++------------+--------+----------+
+| reserved | 12 | 4 |
++------------+--------+----------+
+| ranges | 16 | variable |
++------------+--------+----------+
+
+* *page_size* hints what tracking granularity the device should try to achieve.
+ If the device cannot do the hinted page size then it's the driver's choice
+ which page size to pick based on its support. On output the device will return
+ the page size it selected.
+
+* *num_ranges* is the number of IOVA ranges to monitor. A value of zero
+ indicates that all writes should be logged.
+
+* *ranges* is an array of ``vfio_user_device_feature_dma_logging_range``
+ entries:
+
+--------+--------+------+
| Name | Offset | Size |
+========+========+======+
| iova | 0 | 8 |
+--------+--------+------+
-| size | 8 | 8 |
-+--------+--------+------+
-| bitmap | 16 | 24 |
+| length | 8 | 8 |
+--------+--------+------+
-* *iova* is the IOVA offset
+ * *iova* is the base IO virtual address
+ * *length* is the length of the range to log
+
+Upon success, the response data field will be the same as the request, unless
+the page size was changed, in which case this will be reflected in the response.
+
+``VFIO_DEVICE_FEATURE_DMA_LOGGING_REPORT``
+""""""""""""""""""""""""""""""""""""""""""
+
+Upon ``VFIO_DEVICE_FEATURE_GET``, returns the dirty bitmap for a specific IOVA
+range. This operation is only valid if logging of dirty pages has been
+previously started by setting ``VFIO_DEVICE_FEATURE_DMA_LOGGING_START``.
+
+The data field of the request is structured as follows:
+
++-----------+--------+------+
+| Name | Offset | Size |
++===========+========+======+
+| iova | 0 | 8 |
++-----------+--------+------+
+| length | 8 | 8 |
++-----------+--------+------+
+| page_size | 16 | 8 |
++-----------+--------+------+
+
+* *iova* is the base IO virtual address
+
+* *length* is the length of the range
+
+* *page_size* is the unit of granularity of the bitmap, and must be a power of
+ two. It doesn't have to match the value given to
+ ``VFIO_DEVICE_FEATURE_DMA_LOGGING_START`` because the driver will format its
+ internal logging to match the reporting page size possibly by replicating bits
+ if the internal page size is lower than requested
+
+The data field of the response is identical, except with the bitmap added on
+the end at offset 24.
+
+The bitmap is an array of u64s that holds the output bitmap, with 1 bit
+reporting a *page_size* unit of IOVA. The bits outside of the requested range
+must be zero.
+
+The mapping of IOVA to bits is given by:
+
+``bitmap[(addr - iova)/page_size] & (1ULL << (addr % 64))``
+
+``VFIO_USER_MIG_DATA_READ``
+---------------------------
+
+This command is used to read data from the source migration server while it is
+in a saving group state (``PRE_COPY`` or ``STOP_COPY``).
+
+This command, and ``VFIO_USER_MIG_DATA_WRITE``, are used in place of the
+``data_fd`` file descriptor in ``<linux/vfio.h>``
+(``struct vfio_device_feature_mig_state``) to enable all data transport to use
+the single already-established UNIX socket. Hence, the migration data is
+treated like a stream, so the client must continue reading until no more
+migration data remains.
+
+Request
+^^^^^^^
+
+The request payload for this message is a structure of the following format.
-* *size* is the size of the IOVA region
++-------+--------+------+
+| Name | Offset | Size |
++=======+========+======+
+| argsz | 0 | 4 |
++-------+--------+------+
+| size | 4 | 4 |
++-------+--------+------+
-* *bitmap* is the VFIO Bitmap explained in `VFIO Bitmap`_.
+* *argsz* is the maximum size of the reply payload.
-This part of the request is analogous to VFIO's ``struct
-vfio_iommu_type1_dirty_bitmap_get``.
+* *size* is the size of the migration data to read.
Reply
^^^^^
-For ``VFIO_IOMMU_DIRTY_PAGES_FLAG_START`` or
-``VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP``, there is no reply payload.
-
-For ``VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP``, the reply payload is as follows:
-
-+--------------+--------+-----------------------------------------+
-| Name | Offset | Size |
-+==============+========+=========================================+
-| argsz | 0 | 4 |
-+--------------+--------+-----------------------------------------+
-| flags | 4 | 4 |
-+--------------+--------+-----------------------------------------+
-| | +-----+----------------------------------------+ |
-| | | Bit | Definition | |
-| | +=====+========================================+ |
-| | | 2 | VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP | |
-| | +-----+----------------------------------------+ |
-+--------------+--------+-----------------------------------------+
-| bitmap range | 8 | 40 |
-+--------------+--------+-----------------------------------------+
-| bitmap | 48 | variable |
-+--------------+--------+-----------------------------------------+
-
-* *argsz* is the size required for the full reply payload (dirty pages structure
- + bitmap range structure + actual bitmap)
-* *flags* is ``VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP``
-* *bitmap range* is the same bitmap range struct provided in the request, as
- defined in `VFIO Bitmap Range Format`_.
-* *bitmap* is the actual dirty pages bitmap corresponding to the range request
-
-VFIO Device Migration Info
---------------------------
+The reply payload for this message is a structure of the following format.
+
++-------+--------+----------+
+| Name | Offset | Size |
++=======+========+==========+
+| argsz | 0 | 4 |
++-------+--------+----------+
+| size | 4 | 4 |
++-------+--------+----------+
+| data | 8 | variable |
++-------+--------+----------+
-A device may contain a migration region (of type
-``VFIO_REGION_TYPE_MIGRATION``). The beginning of the region must contain
-``struct vfio_device_migration_info``, defined in ``<linux/vfio.h>``. This
-subregion is accessed like any other part of a standard vfio-user region
-using ``VFIO_USER_REGION_READ``/``VFIO_USER_REGION_WRITE``.
-
-+---------------+--------+--------------------------------+
-| Name | Offset | Size |
-+===============+========+================================+
-| device_state | 0 | 4 |
-+---------------+--------+--------------------------------+
-| | +-----+-------------------------------+ |
-| | | Bit | Definition | |
-| | +=====+===============================+ |
-| | | 0 | VFIO_DEVICE_STATE_V1_RUNNING | |
-| | +-----+-------------------------------+ |
-| | | 1 | VFIO_DEVICE_STATE_V1_SAVING | |
-| | +-----+-------------------------------+ |
-| | | 2 | VFIO_DEVICE_STATE_V1_RESUMING | |
-| | +-----+-------------------------------+ |
-+---------------+--------+--------------------------------+
-| reserved | 4 | 4 |
-+---------------+--------+--------------------------------+
-| pending_bytes | 8 | 8 |
-+---------------+--------+--------------------------------+
-| data_offset | 16 | 8 |
-+---------------+--------+--------------------------------+
-| data_size | 24 | 8 |
-+---------------+--------+--------------------------------+
-
-* *device_state* defines the state of the device:
-
- The client initiates device state transition by writing the intended state.
- The server must respond only after it has successfully transitioned to the new
- state. If an error occurs then the server must respond to the
- ``VFIO_USER_REGION_WRITE`` operation with the Error field set accordingly and
- must remain at the previous state, or in case of internal error it must
- transition to the error state, defined as
- ``VFIO_DEVICE_STATE_V1_RESUMING | VFIO_DEVICE_STATE_V1_SAVING``. The client
- must re-read the device state in order to determine it afresh.
-
- The following device states are defined:
-
- +-----------+---------+----------+-----------------------------------+
- | _RESUMING | _SAVING | _RUNNING | Description |
- +===========+=========+==========+===================================+
- | 0 | 0 | 0 | Device is stopped. |
- +-----------+---------+----------+-----------------------------------+
- | 0 | 0 | 1 | Device is running, default state. |
- +-----------+---------+----------+-----------------------------------+
- | 0 | 1 | 0 | Stop-and-copy state |
- +-----------+---------+----------+-----------------------------------+
- | 0 | 1 | 1 | Pre-copy state |
- +-----------+---------+----------+-----------------------------------+
- | 1 | 0 | 0 | Resuming |
- +-----------+---------+----------+-----------------------------------+
- | 1 | 0 | 1 | Invalid state |
- +-----------+---------+----------+-----------------------------------+
- | 1 | 1 | 0 | Error state |
- +-----------+---------+----------+-----------------------------------+
- | 1 | 1 | 1 | Invalid state |
- +-----------+---------+----------+-----------------------------------+
-
- Valid state transitions are shown in the following table:
-
- +-------------------------+---------+---------+---------------+----------+----------+
- | |darr| From / To |rarr| | Stopped | Running | Stop-and-copy | Pre-copy | Resuming |
- +=========================+=========+=========+===============+==========+==========+
- | Stopped | \- | 1 | 0 | 0 | 0 |
- +-------------------------+---------+---------+---------------+----------+----------+
- | Running | 1 | \- | 1 | 1 | 1 |
- +-------------------------+---------+---------+---------------+----------+----------+
- | Stop-and-copy | 1 | 1 | \- | 0 | 0 |
- +-------------------------+---------+---------+---------------+----------+----------+
- | Pre-copy | 0 | 0 | 1 | \- | 0 |
- +-------------------------+---------+---------+---------------+----------+----------+
- | Resuming | 0 | 1 | 0 | 0 | \- |
- +-------------------------+---------+---------+---------------+----------+----------+
-
- A device is migrated to the destination as follows:
-
- * The source client transitions the device state from the running state to
- the pre-copy state. This transition is optional for the client but must be
- supported by the server. The source server starts sending device state data
- to the source client through the migration region while the device is
- running.
-
- * The source client transitions the device state from the running state or the
- pre-copy state to the stop-and-copy state. The source server stops the
- device, saves device state and sends it to the source client through the
- migration region.
-
- The source client is responsible for sending the migration data to the
- destination client.
-
- A device is resumed on the destination as follows:
-
- * The destination client transitions the device state from the running state
- to the resuming state. The destination server uses the device state data
- received through the migration region to resume the device.
-
- * The destination client provides saved device state to the destination
- server and then transitions the device to back to the running state.
-
-* *reserved* This field is reserved and any access to it must be ignored by the
- server.
-
-* *pending_bytes* Remaining bytes to be migrated by the server. This field is
- read only.
-
-* *data_offset* Offset in the migration region where the client must:
-
- * read from, during the pre-copy or stop-and-copy state, or
-
- * write to, during the resuming state.
-
- This field is read only.
-
-* *data_size* Contains the size, in bytes, of the amount of data copied to:
-
- * the source migration region by the source server during the pre-copy or
- stop-and copy state, or
-
- * the destination migration region by the destination client during the
- resuming state.
-
-Device-specific data must be stored at any position after
-``struct vfio_device_migration_info``. Note that the migration region can be
-memory mappable, even partially. In practise, only the migration data portion
-can be memory mapped.
-
-The client processes device state data during the pre-copy and the
-stop-and-copy state in the following iterative manner:
-
- 1. The client reads ``pending_bytes`` to mark a new iteration. Repeated reads
- of this field is an idempotent operation. If there are no migration data
- to be consumed then the next step depends on the current device state:
-
- * pre-copy: the client must try again.
+* *argsz* is the size of the above structure, including the size of the data.
- * stop-and-copy: this procedure can end and the device can now start
- resuming on the destination.
+* *size* indicates the size of returned migration data. If this is less than the
+ requested size, there is no more migration data to read.
- 2. The client reads ``data_offset``; at this point the server must make
- available a portion of migration data at this offset to be read by the
- client, which must happen *before* completing the read operation. The
- amount of data to be read must be stored in the ``data_size`` field, which
- the client reads next.
+* *data* contains the migration data.
- 3. The client reads ``data_size`` to determine the amount of migration data
- available.
+``VFIO_USER_MIG_DATA_WRITE``
+----------------------------
- 4. The client reads and processes the migration data.
+This command is used to write data to the destination migration server while it
+is in the ``RESUMING`` state.
- 5. Go to step 1.
+As above, this replaces the ``data_fd`` file descriptor for transport of
+migration data, and as such, the migration data is treated like a stream.
-Note that the client can transition the device from the pre-copy state to the
-stop-and-copy state at any time; ``pending_bytes`` does not need to become zero.
+Request
+^^^^^^^
+
+The request payload for this message is a structure of the following format.
+
++-------+--------+----------+
+| Name | Offset | Size |
++=======+========+==========+
+| argsz | 0 | 4 |
++-------+--------+----------+
+| size | 4 | 4 |
++-------+--------+----------+
+| data | 8 | variable |
++-------+--------+----------+
+
+* *argsz* is the maximum size of the reply payload.
+
+* *size* is the size of the migration data to be written.
+
+* *data* contains the migration data.
-The client initializes the device state on the destination by setting the
-device state in the resuming state and writing the migration data to the
-destination migration region at ``data_offset`` offset. The client can write the
-source migration data in an iterative manner and the server must consume this
-data before completing each write operation, updating the ``data_offset`` field.
-The server must apply the source migration data on the device resume state. The
-client must write data on the same order and transaction size as read.
+Reply
+^^^^^
-If an error occurs then the server must fail the read or write operation. It is
-an implementation detail of the client how to handle errors.
+There is no reply payload for this message.
Appendices
==========