diff options
Diffstat (limited to 'docs/vfio-user.rst')
-rw-r--r-- | docs/vfio-user.rst | 836 |
1 files changed, 463 insertions, 373 deletions
diff --git a/docs/vfio-user.rst b/docs/vfio-user.rst index 3c26da5..b83b359 100644 --- a/docs/vfio-user.rst +++ b/docs/vfio-user.rst @@ -1,11 +1,10 @@ .. include:: <isonum.txt> - ******************************** vfio-user Protocol Specification ******************************** -------------- -Version_ 0.9.1 +Version_ 0.9.2 -------------- .. contents:: Table of Contents @@ -342,9 +341,9 @@ usual ``msg_size`` field in the header, not the ``argsz`` field. In a reply, the server sets ``argsz`` field to the size needed for a full payload size. This may be less than the requested maximum size. This may be -larger than the requested maximum size: in that case, the payload reply header -is returned, but the ``argsz`` field in the reply indicates the needed size, -allowing a client to allocate a larger buffer for holding the reply before +larger than the requested maximum size: in that case, the full payload is not +included in the reply, but the ``argsz`` field in the reply indicates the needed +size, allowing a client to allocate a larger buffer for holding the reply before trying again. In addition, during negotiation (see `Version`_), the client and server may @@ -357,8 +356,9 @@ Protocol Specification To distinguish from the base VFIO symbols, all vfio-user symbols are prefixed with ``vfio_user`` or ``VFIO_USER``. In this revision, all data is in the -little-endian format, although this may be relaxed in future revisions in cases -where the client and server are both big-endian. +endianness of the host system, although this may be relaxed in future +revisions in cases where the client and server run on different hosts +with different endianness. Unless otherwise specified, all sizes should be presumed to be in bytes. @@ -385,7 +385,10 @@ Name Command Request Direction ``VFIO_USER_DMA_READ`` 11 server -> client ``VFIO_USER_DMA_WRITE`` 12 server -> client ``VFIO_USER_DEVICE_RESET`` 13 client -> server -``VFIO_USER_DIRTY_PAGES`` 14 client -> server +``VFIO_USER_REGION_WRITE_MULTI`` 15 client -> server +``VFIO_USER_DEVICE_FEATURE`` 16 client -> server +``VFIO_USER_MIG_DATA_READ`` 17 client -> server +``VFIO_USER_MIG_DATA_WRITE`` 18 client -> server ====================================== ========= ================= Header @@ -508,34 +511,33 @@ format: Capabilities: -+--------------------+--------+------------------------------------------------+ -| Name | Type | Description | -+====================+========+================================================+ -| max_msg_fds | number | Maximum number of file descriptors that can be | -| | | received by the sender in one message. | -| | | Optional. If not specified then the receiver | -| | | must assume a value of ``1``. | -+--------------------+--------+------------------------------------------------+ -| max_data_xfer_size | number | Maximum ``count`` for data transfer messages; | -| | | see `Read and Write Operations`_. Optional, | -| | | with a default value of 1048576 bytes. | -+--------------------+--------+------------------------------------------------+ -| migration | object | Migration capability parameters. If missing | -| | | then migration is not supported by the sender. | -+--------------------+--------+------------------------------------------------+ -| twin_socket | object | Parameters for twin-socket mode, which handles | -| | | server-to-client commands and their replies on | -| | | a separate socket. Optional. | -+--------------------+--------+------------------------------------------------+ - -The migration capability contains the following name/value pairs: - -+--------+--------+-----------------------------------------------+ -| Name | Type | Description | -+========+========+===============================================+ -| pgsize | number | Page size of dirty pages bitmap. The smallest | -| | | between the client and the server is used. | -+--------+--------+-----------------------------------------------+ ++--------------------+---------+-----------------------------------------------+ +| Name | Type | Description | ++====================+=========+===============================================+ +| max_msg_fds | number | Maximum number of file descriptors that can | +| | | be received by the sender in one message. | +| | | Optional. If not specified then the receiver | +| | | must assume a value of ``1``. | ++--------------------+---------+-----------------------------------------------+ +| max_data_xfer_size | number | Maximum ``count`` for data transfer messages; | +| | | see `Read and Write Operations`_. Optional, | +| | | with a default value of 1048576 bytes. | ++--------------------+---------+-----------------------------------------------+ +| max_dma_maps | number | Maximum number DMA map windows that can be | +| | | valid simultaneously. Optional, with a | +| | | value of 65535 (64k-1). | ++--------------------+---------+-----------------------------------------------+ +| pgsizes | number | Page sizes supported in DMA map operations | +| | | or'ed together. Optional, with a default | +| | | value of supporting only 4k pages. | ++--------------------+---------+-----------------------------------------------+ +| twin_socket | object | Parameters for twin-socket mode, which | +| | | handles server-to-client commands and their | +| | | replies on a separate socket. Optional. | ++--------------------+---------+-----------------------------------------------+ +| write_multiple | boolean | ``VFIO_USER_REGION_WRITE_MULTI`` messages | +| | | are supported if the value is ``true``. | ++--------------------+---------+-----------------------------------------------+ The ``twin_socket`` capability object holds these name/value pairs: @@ -678,56 +680,18 @@ The request payload for this message is a structure of the following format: +--------------+--------+------------------------+ | flags | 4 | 4 | +--------------+--------+------------------------+ -| | +-----+-----------------------+ | -| | | Bit | Definition | | -| | +=====+=======================+ | -| | | 0 | get dirty page bitmap | | -| | +-----+-----------------------+ | -| | | 1 | unmap all regions | | -| | +-----+-----------------------+ | -+--------------+--------+------------------------+ | address | 8 | 8 | +--------------+--------+------------------------+ | size | 16 | 8 | +--------------+--------+------------------------+ * *argsz* is the maximum size of the reply payload. -* *flags* contains the following DMA region attributes: - - * *get dirty page bitmap* indicates that a dirty page bitmap must be - populated before unmapping the DMA region. The client must provide a - `VFIO Bitmap`_ structure, explained below, immediately following this - entry. - * *unmap all regions* indicates to unmap all the regions previously - mapped via `VFIO_USER_DMA_MAP`. This flag cannot be combined with - *get dirty page bitmap* and expects *address* and *size* to be 0. - +* *flags* is unused in this version. * *address* is the base DMA address of the DMA region. * *size* is the size of the DMA region. The address and size of the DMA region being unmapped must match exactly a -previous mapping. The size of request message depends on whether or not the -*get dirty page bitmap* bit is set in Flags: - -* If not set, the size of the total request message is: 16 + 24. - -* If set, the size of the total request message is: 16 + 24 + 16. - -.. _VFIO Bitmap: - -VFIO Bitmap Format -"""""""""""""""""" - -+--------+--------+------+ -| Name | Offset | Size | -+========+========+======+ -| pgsize | 0 | 8 | -+--------+--------+------+ -| size | 8 | 8 | -+--------+--------+------+ - -* *pgsize* is the page size for the bitmap, in bytes. -* *size* is the size for the bitmap, in bytes, excluding the VFIO bitmap header. +previous mapping. Reply ^^^^^ @@ -736,14 +700,8 @@ Upon receiving a ``VFIO_USER_DMA_UNMAP`` command, if the file descriptor is mapped then the server must release all references to that DMA region before replying, which potentially includes in-flight DMA transactions. -The server responds with the original DMA entry in the request. If the -*get dirty page bitmap* bit is set in flags in the request, then -the server also includes the `VFIO Bitmap`_ structure sent in the request, -followed by the corresponding dirty page bitmap, where each bit represents -one page of size *pgsize* in `VFIO Bitmap`_ . +The server responds with the original DMA entry in the request. -The total size of the total reply message is: -16 + 24 + (16 + *size* in `VFIO Bitmap`_ if *get dirty page bitmap* is set). ``VFIO_USER_DEVICE_GET_INFO`` ----------------------------- @@ -959,7 +917,7 @@ VFIO region info cap sparse mmap +----------+--------+------+ | offset | 8 | 8 | +----------+--------+------+ -| size | 16 | 9 | +| size | 16 | 8 | +----------+--------+------+ | ... | | | +----------+--------+------+ @@ -973,39 +931,6 @@ VFIO region info cap sparse mmap The VFIO sparse mmap area is defined in ``<linux/vfio.h>`` (``struct vfio_region_info_cap_sparse_mmap``). -VFIO region type cap header -""""""""""""""""""""""""""" - -+------------------+---------------------------+ -| Name | Value | -+==================+===========================+ -| id | VFIO_REGION_INFO_CAP_TYPE | -+------------------+---------------------------+ -| version | 0x1 | -+------------------+---------------------------+ -| next | <next> | -+------------------+---------------------------+ -| region info type | VFIO region info type | -+------------------+---------------------------+ - -This capability is defined when a region is specific to the device. - -VFIO region info type cap -""""""""""""""""""""""""" - -The VFIO region info type is defined in ``<linux/vfio.h>`` -(``struct vfio_region_info_cap_type``). - -+---------+--------+------+ -| Name | Offset | Size | -+=========+========+======+ -| type | 0 | 4 | -+---------+--------+------+ -| subtype | 4 | 4 | -+---------+--------+------+ - -The only device-specific region type and subtype supported by vfio-user is -``VFIO_REGION_TYPE_MIGRATION`` (3) and ``VFIO_REGION_SUBTYPE_MIGRATION`` (1). ``VFIO_USER_DEVICE_GET_REGION_IO_FDS`` -------------------------------------- @@ -1071,7 +996,7 @@ Reply * *argsz* is the size of the region IO FD info structure plus the total size of the sub-region array. Thus, each array entry "i" is at offset - i * ((argsz - 16) / count). Note that currently this is 40 bytes for both IO + i * ((argsz - 32) / count). Note that currently this is 40 bytes for both IO FD types, but this is not to be relied on. As elsewhere, this indicates the full reply payload size needed. * *flags* must be zero @@ -1087,8 +1012,8 @@ Note that it is the client's responsibility to verify the requested values (for example, that the requested offset does not exceed the region's bounds). Each sub-region given in the response has one of two possible structures, -depending whether *type* is ``VFIO_USER_IO_FD_TYPE_IOEVENTFD`` (0) or -``VFIO_USER_IO_FD_TYPE_IOREGIONFD`` (1): +depending whether *type* is ``VFIO_USER_IO_FD_TYPE_IOEVENTFD`` or +``VFIO_USER_IO_FD_TYPE_IOREGIONFD``: Sub-Region IO FD info format (ioeventfd) """""""""""""""""""""""""""""""""""""""" @@ -1552,290 +1477,455 @@ Reply This command message is sent from the client to the server to reset the device. Neither the request or reply have a payload. -``VFIO_USER_DIRTY_PAGES`` -------------------------- +``VFIO_USER_REGION_WRITE_MULTI`` +-------------------------------- + +This message can be used to coalesce multiple device write operations +into a single messgage. It is only used as an optimization when the +outgoing message queue is relatively full. + +Request +^^^^^^^ + ++---------+--------+----------+ +| Name | Offset | Size | ++=========+========+==========+ +| wr_cnt | 0 | 8 | ++---------+--------+----------+ +| wrs | 8 | variable | ++---------+--------+----------+ -This command is analogous to ``VFIO_IOMMU_DIRTY_PAGES``. It is sent by the client -to the server in order to control logging of dirty pages, usually during a live -migration. +* *wr_cnt* is the number of device writes coalesced in the message +* *wrs* is an array of device writes defined below -Dirty page tracking is optional for server implementation; clients should not -rely on it. +Single Device Write Format +"""""""""""""""""""""""""" + ++--------+--------+----------+ +| Name | Offset | Size | ++========+========+==========+ +| offset | 0 | 8 | ++--------+--------+----------+ +| region | 8 | 4 | ++--------+--------+----------+ +| count | 12 | 4 | ++--------+--------+----------+ +| data | 16 | 8 | ++--------+--------+----------+ + +* *offset* into the region being accessed. +* *region* is the index of the region being accessed. +* *count* is the size of the data to be transferred. This format can + only describe writes of 8 bytes or less. +* *data* is the data to write. + +Reply +^^^^^ + ++---------+--------+----------+ +| Name | Offset | Size | ++=========+========+==========+ +| wr_cnt | 0 | 8 | ++---------+--------+----------+ + +* *wr_cnt* is the number of device writes completed. + +``VFIO_USER_DEVICE_FEATURE`` +---------------------------- + +This command is analogous to ``VFIO_DEVICE_FEATURE``. It is used to get, set, or +probe feature data of the device. Request ^^^^^^^ -+-------+--------+-----------------------------------------+ -| Name | Offset | Size | -+=======+========+=========================================+ -| argsz | 0 | 4 | -+-------+--------+-----------------------------------------+ -| flags | 4 | 4 | -+-------+--------+-----------------------------------------+ -| | +-----+----------------------------------------+ | -| | | Bit | Definition | | -| | +=====+========================================+ | -| | | 0 | VFIO_IOMMU_DIRTY_PAGES_FLAG_START | | -| | +-----+----------------------------------------+ | -| | | 1 | VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP | | -| | +-----+----------------------------------------+ | -| | | 2 | VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP | | -| | +-----+----------------------------------------+ | -+-------+--------+-----------------------------------------+ - -* *argsz* is the size of the VFIO dirty bitmap info structure for - ``START/STOP``; and for ``GET_BITMAP``, the maximum size of the reply payload - -* *flags* defines the action to be performed by the server: - - * ``VFIO_IOMMU_DIRTY_PAGES_FLAG_START`` instructs the server to start logging - pages it dirties. Logging continues until explicitly disabled by - ``VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP``. - - * ``VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP`` instructs the server to stop logging - dirty pages. - - * ``VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP`` requests the server to return - the dirty bitmap for a specific IOVA range. The IOVA range is specified by - a "VFIO Bitmap Range" structure, which must immediately follow this - "VFIO Dirty Pages" structure. See `VFIO Bitmap Range Format`_. - This operation is only valid if logging of dirty pages has been previously - started. - - These flags are mutually exclusive with each other. - -This part of the request is analogous to VFIO's ``struct -vfio_iommu_type1_dirty_bitmap``. - -.. _VFIO Bitmap Range Format: - -VFIO Bitmap Range Format +The request payload for this message is a structure of the following format. + ++-------+--------+--------------------------------+ +| Name | Offset | Size | ++=======+========+================================+ +| argsz | 0 | 4 | ++-------+--------+--------------------------------+ +| flags | 4 | 4 | ++-------+--------+--------------------------------+ +| | +---------+---------------------------+ | +| | | Bit | Definition | | +| | +=========+===========================+ | +| | | 0 to 15 | Feature index | | +| | +---------+---------------------------+ | +| | | 16 | VFIO_DEVICE_FEATURE_GET | | +| | +---------+---------------------------+ | +| | | 17 | VFIO_DEVICE_FEATURE_SET | | +| | +---------+---------------------------+ | +| | | 18 | VFIO_DEVICE_FEATURE_PROBE | | +| | +---------+---------------------------+ | ++-------+--------+--------------------------------+ +| data | 8 | variable | ++-------+--------+--------------------------------+ + +* *argsz* is the maximum size of the reply payload. + +* *flags* defines the action to be performed by the server and upon which + feature: + + * The feature index consists of the least significant 16 bits of the flags + field, and can be accessed using the ``VFIO_DEVICE_FEATURE_MASK`` bit mask. + + * ``VFIO_DEVICE_FEATURE_GET`` instructs the server to get the data for the + given feature. + + * ``VFIO_DEVICE_FEATURE_SET`` instructs the server to set the feature data to + that given in the ``data`` field of the payload. + + * ``VFIO_DEVICE_FEATURE_PROBE`` instructs the server to probe for feature + support. If ``VFIO_DEVICE_FEATURE_GET`` and/or ``VFIO_DEVICE_FEATURE_SET`` + are also set, the probe will only return success if all of the indicated + methods are supported. + + ``VFIO_DEVICE_FEATURE_GET`` and ``VFIO_DEVICE_FEATURE_SET`` are mutually + exclusive, except for use with ``VFIO_DEVICE_FEATURE_PROBE``. + +* *data* is specific to the particular feature. It is not used for probing. + +This part of the request is analogous to VFIO's ``struct vfio_device_feature``. + +Reply +^^^^^ + +The reply payload must be the same as the request payload for setting or +probing a feature. For getting a feature's data, the data is added in the data +section and its length is added to ``argsz``. + +Device Features +^^^^^^^^^^^^^^^ + +The only device features supported by vfio-user are those related to migration, +although this may change in the future. They are a subset of those supported in +the VFIO implementation of the Linux kernel. + ++----------------------------------------+---------------+ +| Name | Feature Index | ++========================================+===============+ +| VFIO_DEVICE_FEATURE_MIGRATION | 1 | ++----------------------------------------+---------------+ +| VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE | 2 | ++----------------------------------------+---------------+ +| VFIO_DEVICE_FEATURE_DMA_LOGGING_START | 6 | ++----------------------------------------+---------------+ +| VFIO_DEVICE_FEATURE_DMA_LOGGING_STOP | 7 | ++----------------------------------------+---------------+ +| VFIO_DEVICE_FEATURE_DMA_LOGGING_REPORT | 8 | ++----------------------------------------+---------------+ + +``VFIO_DEVICE_FEATURE_MIGRATION`` +""""""""""""""""""""""""""""""""" + +This feature indicates that the device can support the migration API through +``VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE``. If ``GET`` succeeds, the ``RUNNING`` +and ``ERROR`` states are always supported. Support for additional states is +indicated via the flags field; at least ``VFIO_MIGRATION_STOP_COPY`` must be +set. + +There is no data field of the request message. + +The data field of the reply message is structured as follows: + ++-------+--------+---------------------------+ +| Name | Offset | Size | ++=======+========+===========================+ +| flags | 0 | 8 | ++-------+--------+---------------------------+ +| | +-----+--------------------------+ | +| | | Bit | Definition | | +| | +=====+==========================+ | +| | | 0 | VFIO_MIGRATION_STOP_COPY | | +| | +-----+--------------------------+ | +| | | 1 | VFIO_MIGRATION_P2P | | +| | +-----+--------------------------+ | +| | | 2 | VFIO_MIGRATION_PRE_COPY | | +| | +-----+--------------------------+ | ++-------+--------+---------------------------+ + +These flags are interpreted in the same way as VFIO. + +``VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE`` +"""""""""""""""""""""""""""""""""""""""" + +Upon ``VFIO_DEVICE_FEATURE_SET``, execute a migration state change on the VFIO +device. The new state is supplied in ``device_state``. The state transition must +fully complete before the reply is sent. + +The data field of the reply message, as well as the ``SET`` request message, is +structured as follows: + ++--------------+--------+------+ +| Name | Offset | Size | ++==============+========+======+ +| device_state | 0 | 4 | ++--------------+--------+------+ +| data_fd | 4 | 4 | ++--------------+--------+------+ + +* *device_state* is the current state of the device (for ``GET``) or the + state to transition to (for ``SET``). It is defined by the + ``vfio_device_mig_state`` enum as detailed below. These states are the states + of the device migration Finite State Machine. + ++--------------------------------+-------+---------------------------------------------------------------------+ +| Name | State | Description | ++================================+=======+=====================================================================+ +| VFIO_DEVICE_STATE_ERROR | 0 | The device has failed and must be reset. | ++--------------------------------+-------+---------------------------------------------------------------------+ +| VFIO_DEVICE_STATE_STOP | 1 | The device does not change the internal or external state. | ++--------------------------------+-------+---------------------------------------------------------------------+ +| VFIO_DEVICE_STATE_RUNNING | 2 | The device is running normally. | ++--------------------------------+-------+---------------------------------------------------------------------+ +| VFIO_DEVICE_STATE_STOP_COPY | 3 | The device internal state can be read out. | ++--------------------------------+-------+---------------------------------------------------------------------+ +| VFIO_DEVICE_STATE_RESUMING | 4 | The device is stopped and is loading a new internal state. | ++--------------------------------+-------+---------------------------------------------------------------------+ +| VFIO_DEVICE_STATE_RUNNING_P2P | 5 | (not used in vfio-user) | ++--------------------------------+-------+---------------------------------------------------------------------+ +| VFIO_DEVICE_STATE_PRE_COPY | 6 | The device is running normally but tracking internal state changes. | ++--------------------------------+-------+---------------------------------------------------------------------+ +| VFIO_DEVICE_STATE_PRE_COPY_P2P | 7 | (not used in vfio-user) | ++--------------------------------+-------+---------------------------------------------------------------------+ + +* *data_fd* is unused in vfio-user, as the ``VFIO_USER_MIG_DATA_READ`` and + ``VFIO_USER_MIG_DATA_WRITE`` messages are used instead for migration data + transport. + +Direct State Transitions """""""""""""""""""""""" +The device migration FSM is a Mealy machine, so actions are taken upon the arcs +between FSM states. The following transitions need to be supported by the +server, a subset of those defined in ``<linux/vfio.h>`` +(``enum vfio_device_mig_state``). + +* ``RUNNING -> STOP``, ``STOP_COPY -> STOP``: Stop the operation of the device. + The ``STOP_COPY`` arc terminates the data transfer session. + +* ``RESUMING -> STOP``: Terminate the data transfer session. Complete processing + of the migration data. Stop the operation of the device. If the delivered data + is found to be incomplete, inconsistent, or otherwise invalid, fail the + ``SET`` command and optionally transition to the ``ERROR`` state. + +* ``PRE_COPY -> RUNNING``: Terminate the data transfer session. The device is + now fully operational. + +* ``STOP -> RUNNING``: Start the operation of the device. + +* ``RUNNING -> PRE_COPY``, ``STOP -> STOP_COPY``: Begin the process of saving + the device state. The device operation is unchanged, but data transfer begins. + ``PRE_COPY`` and ``STOP_COPY`` are referred to as the "saving group" of + states. + +* ``PRE_COPY -> STOP_COPY``: Continue to transfer migration data, but stop + device operation. + +* ``STOP -> RESUMING``: Start the process of restoring the device state. The + internal device state may be changed to prepare the device to receive the + migration data. + +The ``STOP_COPY -> PRE_COPY`` transition is explicitly not allowed and should +return an error if requested. + +``ERROR`` cannot be specified as a device state, but any transition request can +be failed and then move the state into ``ERROR`` if the server was unable to +execute the requested arc AND was unable to restore the device into any valid +state. To recover from ``ERROR``, ``VFIO_USER_DEVICE_RESET`` must be used to +return back to ``RUNNING``. + +If ``PRE_COPY`` is not supported, arcs touching it are removed. + +Complex State Transitions +""""""""""""""""""""""""" + +The remaining possible transitions are to be implemented as combinations of the +above FSM arcs. As there are multiple paths, the path should be selected based +on the following rules: + +* Select the shortest path. + +* The path cannot have saving group states as interior arcs, only start/end + states. + +``VFIO_DEVICE_FEATURE_DMA_LOGGING_START`` / ``VFIO_DEVICE_FEATURE_DMA_LOGGING_STOP`` +"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" + +Upon ``VFIO_DEVICE_FEATURE_SET``, start/stop DMA logging. These features can +also be probed to determine whether the device supports DMA logging. + +When DMA logging is started, a range of IOVAs to monitor is provided and the +device can optimize its logging to cover only the IOVA range given. Only DMA +writes are logged. + +The data field of the ``SET`` request is structured as follows: + ++------------+--------+----------+ +| Name | Offset | Size | ++============+========+==========+ +| page_size | 0 | 8 | ++------------+--------+----------+ +| num_ranges | 8 | 4 | ++------------+--------+----------+ +| reserved | 12 | 4 | ++------------+--------+----------+ +| ranges | 16 | variable | ++------------+--------+----------+ + +* *page_size* hints what tracking granularity the device should try to achieve. + If the device cannot do the hinted page size then it's the driver's choice + which page size to pick based on its support. On output the device will return + the page size it selected. + +* *num_ranges* is the number of IOVA ranges to monitor. A value of zero + indicates that all writes should be logged. + +* *ranges* is an array of ``vfio_user_device_feature_dma_logging_range`` + entries: + +--------+--------+------+ | Name | Offset | Size | +========+========+======+ | iova | 0 | 8 | +--------+--------+------+ -| size | 8 | 8 | -+--------+--------+------+ -| bitmap | 16 | 24 | +| length | 8 | 8 | +--------+--------+------+ -* *iova* is the IOVA offset + * *iova* is the base IO virtual address + * *length* is the length of the range to log + +Upon success, the response data field will be the same as the request, unless +the page size was changed, in which case this will be reflected in the response. + +``VFIO_DEVICE_FEATURE_DMA_LOGGING_REPORT`` +"""""""""""""""""""""""""""""""""""""""""" + +Upon ``VFIO_DEVICE_FEATURE_GET``, returns the dirty bitmap for a specific IOVA +range. This operation is only valid if logging of dirty pages has been +previously started by setting ``VFIO_DEVICE_FEATURE_DMA_LOGGING_START``. + +The data field of the request is structured as follows: + ++-----------+--------+------+ +| Name | Offset | Size | ++===========+========+======+ +| iova | 0 | 8 | ++-----------+--------+------+ +| length | 8 | 8 | ++-----------+--------+------+ +| page_size | 16 | 8 | ++-----------+--------+------+ + +* *iova* is the base IO virtual address + +* *length* is the length of the range + +* *page_size* is the unit of granularity of the bitmap, and must be a power of + two. It doesn't have to match the value given to + ``VFIO_DEVICE_FEATURE_DMA_LOGGING_START`` because the driver will format its + internal logging to match the reporting page size possibly by replicating bits + if the internal page size is lower than requested + +The data field of the response is identical, except with the bitmap added on +the end at offset 24. + +The bitmap is an array of u64s that holds the output bitmap, with 1 bit +reporting a *page_size* unit of IOVA. The bits outside of the requested range +must be zero. + +The mapping of IOVA to bits is given by: + +``bitmap[(addr - iova)/page_size] & (1ULL << (addr % 64))`` + +``VFIO_USER_MIG_DATA_READ`` +--------------------------- + +This command is used to read data from the source migration server while it is +in a saving group state (``PRE_COPY`` or ``STOP_COPY``). + +This command, and ``VFIO_USER_MIG_DATA_WRITE``, are used in place of the +``data_fd`` file descriptor in ``<linux/vfio.h>`` +(``struct vfio_device_feature_mig_state``) to enable all data transport to use +the single already-established UNIX socket. Hence, the migration data is +treated like a stream, so the client must continue reading until no more +migration data remains. + +Request +^^^^^^^ + +The request payload for this message is a structure of the following format. -* *size* is the size of the IOVA region ++-------+--------+------+ +| Name | Offset | Size | ++=======+========+======+ +| argsz | 0 | 4 | ++-------+--------+------+ +| size | 4 | 4 | ++-------+--------+------+ -* *bitmap* is the VFIO Bitmap explained in `VFIO Bitmap`_. +* *argsz* is the maximum size of the reply payload. -This part of the request is analogous to VFIO's ``struct -vfio_iommu_type1_dirty_bitmap_get``. +* *size* is the size of the migration data to read. Reply ^^^^^ -For ``VFIO_IOMMU_DIRTY_PAGES_FLAG_START`` or -``VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP``, there is no reply payload. - -For ``VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP``, the reply payload is as follows: - -+--------------+--------+-----------------------------------------+ -| Name | Offset | Size | -+==============+========+=========================================+ -| argsz | 0 | 4 | -+--------------+--------+-----------------------------------------+ -| flags | 4 | 4 | -+--------------+--------+-----------------------------------------+ -| | +-----+----------------------------------------+ | -| | | Bit | Definition | | -| | +=====+========================================+ | -| | | 2 | VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP | | -| | +-----+----------------------------------------+ | -+--------------+--------+-----------------------------------------+ -| bitmap range | 8 | 40 | -+--------------+--------+-----------------------------------------+ -| bitmap | 48 | variable | -+--------------+--------+-----------------------------------------+ - -* *argsz* is the size required for the full reply payload (dirty pages structure - + bitmap range structure + actual bitmap) -* *flags* is ``VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP`` -* *bitmap range* is the same bitmap range struct provided in the request, as - defined in `VFIO Bitmap Range Format`_. -* *bitmap* is the actual dirty pages bitmap corresponding to the range request - -VFIO Device Migration Info --------------------------- +The reply payload for this message is a structure of the following format. + ++-------+--------+----------+ +| Name | Offset | Size | ++=======+========+==========+ +| argsz | 0 | 4 | ++-------+--------+----------+ +| size | 4 | 4 | ++-------+--------+----------+ +| data | 8 | variable | ++-------+--------+----------+ -A device may contain a migration region (of type -``VFIO_REGION_TYPE_MIGRATION``). The beginning of the region must contain -``struct vfio_device_migration_info``, defined in ``<linux/vfio.h>``. This -subregion is accessed like any other part of a standard vfio-user region -using ``VFIO_USER_REGION_READ``/``VFIO_USER_REGION_WRITE``. - -+---------------+--------+--------------------------------+ -| Name | Offset | Size | -+===============+========+================================+ -| device_state | 0 | 4 | -+---------------+--------+--------------------------------+ -| | +-----+-------------------------------+ | -| | | Bit | Definition | | -| | +=====+===============================+ | -| | | 0 | VFIO_DEVICE_STATE_V1_RUNNING | | -| | +-----+-------------------------------+ | -| | | 1 | VFIO_DEVICE_STATE_V1_SAVING | | -| | +-----+-------------------------------+ | -| | | 2 | VFIO_DEVICE_STATE_V1_RESUMING | | -| | +-----+-------------------------------+ | -+---------------+--------+--------------------------------+ -| reserved | 4 | 4 | -+---------------+--------+--------------------------------+ -| pending_bytes | 8 | 8 | -+---------------+--------+--------------------------------+ -| data_offset | 16 | 8 | -+---------------+--------+--------------------------------+ -| data_size | 24 | 8 | -+---------------+--------+--------------------------------+ - -* *device_state* defines the state of the device: - - The client initiates device state transition by writing the intended state. - The server must respond only after it has successfully transitioned to the new - state. If an error occurs then the server must respond to the - ``VFIO_USER_REGION_WRITE`` operation with the Error field set accordingly and - must remain at the previous state, or in case of internal error it must - transition to the error state, defined as - ``VFIO_DEVICE_STATE_V1_RESUMING | VFIO_DEVICE_STATE_V1_SAVING``. The client - must re-read the device state in order to determine it afresh. - - The following device states are defined: - - +-----------+---------+----------+-----------------------------------+ - | _RESUMING | _SAVING | _RUNNING | Description | - +===========+=========+==========+===================================+ - | 0 | 0 | 0 | Device is stopped. | - +-----------+---------+----------+-----------------------------------+ - | 0 | 0 | 1 | Device is running, default state. | - +-----------+---------+----------+-----------------------------------+ - | 0 | 1 | 0 | Stop-and-copy state | - +-----------+---------+----------+-----------------------------------+ - | 0 | 1 | 1 | Pre-copy state | - +-----------+---------+----------+-----------------------------------+ - | 1 | 0 | 0 | Resuming | - +-----------+---------+----------+-----------------------------------+ - | 1 | 0 | 1 | Invalid state | - +-----------+---------+----------+-----------------------------------+ - | 1 | 1 | 0 | Error state | - +-----------+---------+----------+-----------------------------------+ - | 1 | 1 | 1 | Invalid state | - +-----------+---------+----------+-----------------------------------+ - - Valid state transitions are shown in the following table: - - +-------------------------+---------+---------+---------------+----------+----------+ - | |darr| From / To |rarr| | Stopped | Running | Stop-and-copy | Pre-copy | Resuming | - +=========================+=========+=========+===============+==========+==========+ - | Stopped | \- | 1 | 0 | 0 | 0 | - +-------------------------+---------+---------+---------------+----------+----------+ - | Running | 1 | \- | 1 | 1 | 1 | - +-------------------------+---------+---------+---------------+----------+----------+ - | Stop-and-copy | 1 | 1 | \- | 0 | 0 | - +-------------------------+---------+---------+---------------+----------+----------+ - | Pre-copy | 0 | 0 | 1 | \- | 0 | - +-------------------------+---------+---------+---------------+----------+----------+ - | Resuming | 0 | 1 | 0 | 0 | \- | - +-------------------------+---------+---------+---------------+----------+----------+ - - A device is migrated to the destination as follows: - - * The source client transitions the device state from the running state to - the pre-copy state. This transition is optional for the client but must be - supported by the server. The source server starts sending device state data - to the source client through the migration region while the device is - running. - - * The source client transitions the device state from the running state or the - pre-copy state to the stop-and-copy state. The source server stops the - device, saves device state and sends it to the source client through the - migration region. - - The source client is responsible for sending the migration data to the - destination client. - - A device is resumed on the destination as follows: - - * The destination client transitions the device state from the running state - to the resuming state. The destination server uses the device state data - received through the migration region to resume the device. - - * The destination client provides saved device state to the destination - server and then transitions the device to back to the running state. - -* *reserved* This field is reserved and any access to it must be ignored by the - server. - -* *pending_bytes* Remaining bytes to be migrated by the server. This field is - read only. - -* *data_offset* Offset in the migration region where the client must: - - * read from, during the pre-copy or stop-and-copy state, or - - * write to, during the resuming state. - - This field is read only. - -* *data_size* Contains the size, in bytes, of the amount of data copied to: - - * the source migration region by the source server during the pre-copy or - stop-and copy state, or - - * the destination migration region by the destination client during the - resuming state. - -Device-specific data must be stored at any position after -``struct vfio_device_migration_info``. Note that the migration region can be -memory mappable, even partially. In practise, only the migration data portion -can be memory mapped. - -The client processes device state data during the pre-copy and the -stop-and-copy state in the following iterative manner: - - 1. The client reads ``pending_bytes`` to mark a new iteration. Repeated reads - of this field is an idempotent operation. If there are no migration data - to be consumed then the next step depends on the current device state: - - * pre-copy: the client must try again. +* *argsz* is the size of the above structure, including the size of the data. - * stop-and-copy: this procedure can end and the device can now start - resuming on the destination. +* *size* indicates the size of returned migration data. If this is less than the + requested size, there is no more migration data to read. - 2. The client reads ``data_offset``; at this point the server must make - available a portion of migration data at this offset to be read by the - client, which must happen *before* completing the read operation. The - amount of data to be read must be stored in the ``data_size`` field, which - the client reads next. +* *data* contains the migration data. - 3. The client reads ``data_size`` to determine the amount of migration data - available. +``VFIO_USER_MIG_DATA_WRITE`` +---------------------------- - 4. The client reads and processes the migration data. +This command is used to write data to the destination migration server while it +is in the ``RESUMING`` state. - 5. Go to step 1. +As above, this replaces the ``data_fd`` file descriptor for transport of +migration data, and as such, the migration data is treated like a stream. -Note that the client can transition the device from the pre-copy state to the -stop-and-copy state at any time; ``pending_bytes`` does not need to become zero. +Request +^^^^^^^ + +The request payload for this message is a structure of the following format. + ++-------+--------+----------+ +| Name | Offset | Size | ++=======+========+==========+ +| argsz | 0 | 4 | ++-------+--------+----------+ +| size | 4 | 4 | ++-------+--------+----------+ +| data | 8 | variable | ++-------+--------+----------+ + +* *argsz* is the maximum size of the reply payload. + +* *size* is the size of the migration data to be written. + +* *data* contains the migration data. -The client initializes the device state on the destination by setting the -device state in the resuming state and writing the migration data to the -destination migration region at ``data_offset`` offset. The client can write the -source migration data in an iterative manner and the server must consume this -data before completing each write operation, updating the ``data_offset`` field. -The server must apply the source migration data on the device resume state. The -client must write data on the same order and transaction size as read. +Reply +^^^^^ -If an error occurs then the server must fail the read or write operation. It is -an implementation detail of the client how to handle errors. +There is no reply payload for this message. Appendices ========== |