diff options
Diffstat (limited to 'docs/interop/vhost-user.rst')
-rw-r--r-- | docs/interop/vhost-user.rst | 301 |
1 files changed, 281 insertions, 20 deletions
diff --git a/docs/interop/vhost-user.rst b/docs/interop/vhost-user.rst index 768fb5c..9f1103f 100644 --- a/docs/interop/vhost-user.rst +++ b/docs/interop/vhost-user.rst @@ -108,6 +108,43 @@ A vring state description :num: a 32-bit number +A vring descriptor index for split virtqueues +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + ++-------------+---------------------+ +| vring index | index in avail ring | ++-------------+---------------------+ + +:vring index: 32-bit index of the respective virtqueue + +:index in avail ring: 32-bit value, of which currently only the lower 16 + bits are used: + + - Bits 0–15: Index of the next *Available Ring* descriptor that the + back-end will process. This is a free-running index that is not + wrapped by the ring size. + - Bits 16–31: Reserved (set to zero) + +Vring descriptor indices for packed virtqueues +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + ++-------------+--------------------+ +| vring index | descriptor indices | ++-------------+--------------------+ + +:vring index: 32-bit index of the respective virtqueue + +:descriptor indices: 32-bit value: + + - Bits 0–14: Index of the next *Available Ring* descriptor that the + back-end will process. This is a free-running index that is not + wrapped by the ring size. + - Bit 15: Driver (Available) Ring Wrap Counter + - Bits 16–30: Index of the entry in the *Used Ring* where the back-end + will place the next descriptor. This is a free-running index that + is not wrapped by the ring size. + - Bit 31: Device (Used) Ring Wrap Counter + A vring address description ^^^^^^^^^^^^^^^^^^^^^^^^^^^ @@ -285,6 +322,32 @@ VhostUserShared :UUID: 16 bytes UUID, whose first three components (a 32-bit value, then two 16-bit values) are stored in big endian. +Device state transfer parameters +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + ++--------------------+-----------------+ +| transfer direction | migration phase | ++--------------------+-----------------+ + +:transfer direction: a 32-bit enum, describing the direction in which + the state is transferred: + + - 0: Save: Transfer the state from the back-end to the front-end, + which happens on the source side of migration + - 1: Load: Transfer the state from the front-end to the back-end, + which happens on the destination side of migration + +:migration phase: a 32-bit enum, describing the state in which the VM + guest and devices are: + + - 0: Stopped (in the period after the transfer of memory-mapped + regions before switch-over to the destination): The VM guest is + stopped, and the vhost-user device is suspended (see + :ref:`Suspended device state <suspended_device_state>`). + + In the future, additional phases might be added e.g. to allow + iterative migration while the device is running. + C structure ----------- @@ -344,6 +407,7 @@ in the ancillary data: * ``VHOST_USER_SET_VRING_ERR`` * ``VHOST_USER_SET_BACKEND_REQ_FD`` (previous name ``VHOST_USER_SET_SLAVE_REQ_FD``) * ``VHOST_USER_SET_INFLIGHT_FD`` (if ``VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD``) +* ``VHOST_USER_SET_DEVICE_STATE_FD`` If *front-end* is unable to send the full message or receives a wrong reply it will close the connection. An optional reconnection mechanism @@ -374,35 +438,50 @@ negotiation. Ring states ----------- -Rings can be in one of three states: +Rings have two independent states: started/stopped, and enabled/disabled. -* stopped: the back-end must not process the ring at all. +* While a ring is stopped, the back-end must not process the ring at + all, regardless of whether it is enabled or disabled. The + enabled/disabled state should still be tracked, though, so it can come + into effect once the ring is started. -* started but disabled: the back-end must process the ring without +* started and disabled: The back-end must process the ring without causing any side effects. For example, for a networking device, in the disabled state the back-end must not supply any new RX packets, but must process and discard any TX packets. -* started and enabled. +* started and enabled: The back-end must process the ring normally, i.e. + process all requests and execute them. -Each ring is initialized in a stopped state. The back-end must start -ring upon receiving a kick (that is, detecting that file descriptor is -readable) on the descriptor specified by ``VHOST_USER_SET_VRING_KICK`` -or receiving the in-band message ``VHOST_USER_VRING_KICK`` if negotiated, -and stop ring upon receiving ``VHOST_USER_GET_VRING_BASE``. +Each ring is initialized in a stopped and disabled state. The back-end +must start a ring upon receiving a kick (that is, detecting that file +descriptor is readable) on the descriptor specified by +``VHOST_USER_SET_VRING_KICK`` or receiving the in-band message +``VHOST_USER_VRING_KICK`` if negotiated, and stop a ring upon receiving +``VHOST_USER_GET_VRING_BASE``. Rings can be enabled or disabled by ``VHOST_USER_SET_VRING_ENABLE``. -If ``VHOST_USER_F_PROTOCOL_FEATURES`` has not been negotiated, the -ring starts directly in the enabled state. - -If ``VHOST_USER_F_PROTOCOL_FEATURES`` has been negotiated, the ring is -initialized in a disabled state and is enabled by -``VHOST_USER_SET_VRING_ENABLE`` with parameter 1. +In addition, upon receiving a ``VHOST_USER_SET_FEATURES`` message from +the front-end without ``VHOST_USER_F_PROTOCOL_FEATURES`` set, the +back-end must enable all rings immediately. While processing the rings (whether they are enabled or not), the back-end must support changing some configuration aspects on the fly. +.. _suspended_device_state: + +Suspended device state +^^^^^^^^^^^^^^^^^^^^^^ + +While all vrings are stopped, the device is *suspended*. In addition to +not processing any vring (because they are stopped), the device must: + +* not write to any guest memory regions, +* not send any notifications to the guest, +* not send any messages to the front-end, +* still process and reply to messages from the front-end. + Multiple queue support ---------------------- @@ -490,7 +569,8 @@ ancillary data, it may be used to inform the front-end that the log has been modified. Once the source has finished migration, rings will be stopped by the -source. No further update must be done before rings are restarted. +source (:ref:`Suspended device state <suspended_device_state>`). No +further update must be done before rings are restarted. In postcopy migration the back-end is started before all the memory has been received from the source host, and care must be taken to avoid @@ -502,6 +582,80 @@ it performs WAKE ioctl's on the userfaultfd to wake the stalled back-end. The front-end indicates support for this via the ``VHOST_USER_PROTOCOL_F_PAGEFAULT`` feature. +.. _migrating_backend_state: + +Migrating back-end state +^^^^^^^^^^^^^^^^^^^^^^^^ + +Migrating device state involves transferring the state from one +back-end, called the source, to another back-end, called the +destination. After migration, the destination transparently resumes +operation without requiring the driver to re-initialize the device at +the VIRTIO level. If the migration fails, then the source can +transparently resume operation until another migration attempt is made. + +Generally, the front-end is connected to a virtual machine guest (which +contains the driver), which has its own state to transfer between source +and destination, and therefore will have an implementation-specific +mechanism to do so. The ``VHOST_USER_PROTOCOL_F_DEVICE_STATE`` feature +provides functionality to have the front-end include the back-end's +state in this transfer operation so the back-end does not need to +implement its own mechanism, and so the virtual machine may have its +complete state, including vhost-user devices' states, contained within a +single stream of data. + +To do this, the back-end state is transferred from back-end to front-end +on the source side, and vice versa on the destination side. This +transfer happens over a channel that is negotiated using the +``VHOST_USER_SET_DEVICE_STATE_FD`` message. This message has two +parameters: + +* Direction of transfer: On the source, the data is saved, transferring + it from the back-end to the front-end. On the destination, the data + is loaded, transferring it from the front-end to the back-end. + +* Migration phase: Currently, the only supported phase is the period + after the transfer of memory-mapped regions before switch-over to the + destination, when both the source and destination devices are + suspended (:ref:`Suspended device state <suspended_device_state>`). + In the future, additional phases might be supported to allow iterative + migration while the device is running. + +The nature of the channel is implementation-defined, but it must +generally behave like a pipe: The writing end will write all the data it +has into it, signalling the end of data by closing its end. The reading +end must read all of this data (until encountering the end of file) and +process it. + +* When saving, the writing end is the source back-end, and the reading + end is the source front-end. After reading the state data from the + channel, the source front-end must transfer it to the destination + front-end through an implementation-defined mechanism. + +* When loading, the writing end is the destination front-end, and the + reading end is the destination back-end. After reading the state data + from the channel, the destination back-end must deserialize its + internal state from that data and set itself up to allow the driver to + seamlessly resume operation on the VIRTIO level. + +Seamlessly resuming operation means that the migration must be +transparent to the guest driver, which operates on the VIRTIO level. +This driver will not perform any re-initialization steps, but continue +to use the device as if no migration had occurred. The vhost-user +front-end, however, will re-initialize the vhost state on the +destination, following the usual protocol for establishing a connection +to a vhost-user back-end: This includes, for example, setting up memory +mappings and kick and call FDs as necessary, negotiating protocol +features, or setting the initial vring base indices (to the same value +as on the source side, so that operation can resume). + +Both on the source and on the destination side, after the respective +front-end has seen all data transferred (when the transfer FD has been +closed), it sends the ``VHOST_USER_CHECK_DEVICE_STATE`` message to +verify that data transfer was successful in the back-end, too. The +back-end responds once it knows whether the transfer and processing was +successful or not. + Memory access ------------- @@ -896,6 +1050,7 @@ Protocol features #define VHOST_USER_PROTOCOL_F_STATUS 16 #define VHOST_USER_PROTOCOL_F_XEN_MMAP 17 #define VHOST_USER_PROTOCOL_F_SHARED_OBJECT 18 + #define VHOST_USER_PROTOCOL_F_DEVICE_STATE 19 Front-end message types ----------------------- @@ -1042,18 +1197,54 @@ Front-end message types ``VHOST_USER_SET_VRING_BASE`` :id: 10 :equivalent ioctl: ``VHOST_SET_VRING_BASE`` - :request payload: vring state description + :request payload: vring descriptor index/indices :reply payload: N/A - Sets the base offset in the available vring. + Sets the next index to use for descriptors in this vring: + + * For a split virtqueue, sets only the next descriptor index to + process in the *Available Ring*. The device is supposed to read the + next index in the *Used Ring* from the respective vring structure in + guest memory. + + * For a packed virtqueue, both indices are supplied, as they are not + explicitly available in memory. + + Consequently, the payload type is specific to the type of virt queue + (*a vring descriptor index for split virtqueues* vs. *vring descriptor + indices for packed virtqueues*). ``VHOST_USER_GET_VRING_BASE`` :id: 11 :equivalent ioctl: ``VHOST_USER_GET_VRING_BASE`` :request payload: vring state description - :reply payload: vring state description + :reply payload: vring descriptor index/indices + + Stops the vring and returns the current descriptor index or indices: + + * For a split virtqueue, returns only the 16-bit next descriptor + index to process in the *Available Ring*. Note that this may + differ from the available ring index in the vring structure in + memory, which points to where the driver will put new available + descriptors. For the *Used Ring*, the device only needs the next + descriptor index at which to put new descriptors, which is the + value in the vring structure in memory, so this value is not + covered by this message. + + * For a packed virtqueue, neither index is explicitly available to + read from memory, so both indices (as maintained by the device) are + returned. + + Consequently, the payload type is specific to the type of virt queue + (*a vring descriptor index for split virtqueues* vs. *vring descriptor + indices for packed virtqueues*). - Get the available vring base offset. + When and as long as all of a device’s vrings are stopped, it is + *suspended*, see :ref:`Suspended device state + <suspended_device_state>`. + + The request payload’s *num* field is currently reserved and must be + set to 0. ``VHOST_USER_SET_VRING_KICK`` :id: 12 @@ -1464,6 +1655,76 @@ Front-end message types the requested UUID. Back-end will reply passing the fd when the operation is successful, or no fd otherwise. +``VHOST_USER_SET_DEVICE_STATE_FD`` + :id: 42 + :equivalent ioctl: N/A + :request payload: device state transfer parameters + :reply payload: ``u64`` + + Front-end and back-end negotiate a channel over which to transfer the + back-end’s internal state during migration. Either side (front-end or + back-end) may create the channel. The nature of this channel is not + restricted or defined in this document, but whichever side creates it + must create a file descriptor that is provided to the respectively + other side, allowing access to the channel. This FD must behave as + follows: + + * For the writing end, it must allow writing the whole back-end state + sequentially. Closing the file descriptor signals the end of + transfer. + + * For the reading end, it must allow reading the whole back-end state + sequentially. The end of file signals the end of the transfer. + + For example, the channel may be a pipe, in which case the two ends of + the pipe fulfill these requirements respectively. + + Initially, the front-end creates a channel along with such an FD. It + passes the FD to the back-end as ancillary data of a + ``VHOST_USER_SET_DEVICE_STATE_FD`` message. The back-end may create a + different transfer channel, passing the respective FD back to the + front-end as ancillary data of the reply. If so, the front-end must + then discard its channel and use the one provided by the back-end. + + Whether the back-end should decide to use its own channel is decided + based on efficiency: If the channel is a pipe, both ends will most + likely need to copy data into and out of it. Any channel that allows + for more efficient processing on at least one end, e.g. through + zero-copy, is considered more efficient and thus preferred. If the + back-end can provide such a channel, it should decide to use it. + + The request payload contains parameters for the subsequent data + transfer, as described in the :ref:`Migrating back-end state + <migrating_backend_state>` section. + + The value returned is both an indication for success, and whether a + file descriptor for a back-end-provided channel is returned: Bits 0–7 + are 0 on success, and non-zero on error. Bit 8 is the invalid FD + flag; this flag is set when there is no file descriptor returned. + When this flag is not set, the front-end must use the returned file + descriptor as its end of the transfer channel. The back-end must not + both indicate an error and return a file descriptor. + + Using this function requires prior negotiation of the + ``VHOST_USER_PROTOCOL_F_DEVICE_STATE`` feature. + +``VHOST_USER_CHECK_DEVICE_STATE`` + :id: 43 + :equivalent ioctl: N/A + :request payload: N/A + :reply payload: ``u64`` + + After transferring the back-end’s internal state during migration (see + the :ref:`Migrating back-end state <migrating_backend_state>` + section), check whether the back-end was able to successfully fully + process the state. + + The value returned indicates success or error; 0 is success, any + non-zero value is an error. + + Using this function requires prior negotiation of the + ``VHOST_USER_PROTOCOL_F_DEVICE_STATE`` feature. + Back-end message types ---------------------- |