diff options
author | Stefan Hajnoczi <stefanha@redhat.com> | 2025-07-13 01:45:30 -0400 |
---|---|---|
committer | Stefan Hajnoczi <stefanha@redhat.com> | 2025-07-13 01:45:30 -0400 |
commit | 52af79811f0f0d38b8e99d2df68a3a14d79353ca (patch) | |
tree | fea6219b61bcf93e81f1fa19cb078f42f6ff2343 | |
parent | 0edc2afe0c8197bbcb98f948c609fb74c9b1ffd5 (diff) | |
parent | beeac2df5ff0850299e58f4ad27f83dae64c54df (diff) | |
download | qemu-52af79811f0f0d38b8e99d2df68a3a14d79353ca.zip qemu-52af79811f0f0d38b8e99d2df68a3a14d79353ca.tar.gz qemu-52af79811f0f0d38b8e99d2df68a3a14d79353ca.tar.bz2 |
Merge tag 'migration-20250711-pull-request' of https://gitlab.com/farosas/qemu into staging
Migration pull request
- General cleanups around: postcopy, bg-snapshot, migration hooks,
migration completion and formatting of 'info migrate'.
- Overhaul of postcopy blocktime tracking.
# -----BEGIN PGP SIGNATURE-----
#
# iQJEBAABCAAuFiEEqhtIsKIjJqWkw2TPx5jcdBvsMZ0FAmhxGdgQHGZhcm9zYXNA
# c3VzZS5kZQAKCRDHmNx0G+wxnahoD/9uNXirlmRk3tDnhiJsiYx+HnXYPFEORSZq
# zlpUyqvhQ1POp3Fa5pRf+bJ5mmPw8h8PdOR2StMpnW2Xa1OatAZj5m1uityAVWOl
# EkVfZLl0j6j9HCCmE3c4dztOGIBsd9YY0GWizL05XHYZPrdX4zOpolMN4m53RwQY
# HUVD6T2y9eFDnCO6MsoA9EfmkFYCRvqlS0VzTcYzQFN4H+QHlcpDfweqJpTLPa+1
# trahAN9PBuMjoewjDqwkNkf0CLaCXHszAfj6yv62Vi8Cbp9DDPywIYJKFnxspElW
# Fjg1b4MdsbYZNmeKgIawzgTOL1RrojvKkoi7KWp3D7M+/ZZl9kBwQuUcBXKI7N0R
# Y0GNfkkTycn18nM0JU/6QWSuVeiPbLArxQUGP1cLgvcHSSNgD9JxWbNBu5+1fFOG
# Gg3qnyYatJ6xJDiCrdKqV8fwozNlm/G6b9BiCDeVq+4nA2OKQ0shiNA1GZHvVSQL
# X4uAPexETdHfA/LeA2w5sgVBEw7BewBdjLntZDIFsyBnLrvqrDcU5Aav0wiHoI8U
# QBC2aIpJfMLHiIQ93mVX96NltXC7KvJTIZVl3iwfiYEYCvQtTYgdJ09ELXFJYxFX
# XpTTazqpmPSfuZpPRgx9YbDP/kS8Fg/PTOlPeD0T/frFgd1S6Thh6OW455PavMp8
# ht2lE4sxjA==
# =vtRD
# -----END PGP SIGNATURE-----
# gpg: Signature made Fri 11 Jul 2025 10:04:08 EDT
# gpg: using RSA key AA1B48B0A22326A5A4C364CFC798DC741BEC319D
# gpg: issuer "farosas@suse.de"
# gpg: Good signature from "Fabiano Rosas <farosas@suse.de>" [unknown]
# gpg: aka "Fabiano Almeida Rosas <fabiano.rosas@suse.com>" [unknown]
# gpg: WARNING: The key's User ID is not certified with a trusted signature!
# gpg: There is no indication that the signature belongs to the owner.
# Primary key fingerprint: AA1B 48B0 A223 26A5 A4C3 64CF C798 DC74 1BEC 319D
* tag 'migration-20250711-pull-request' of https://gitlab.com/farosas/qemu: (26 commits)
migration: Rename save_live_complete_precopy_thread to save_complete_precopy_thread
migration/postcopy: Add latency distribution report for blocktime
migration/postcopy: blocktime allows track / report non-vCPU faults
migration/postcopy: Optimize blocktime fault tracking with hashtable
migration/postcopy: Cleanup the total blocktime accounting
migration/postcopy: Cache the tid->vcpu mapping for blocktime
migration/postcopy: Initialize blocktime context only until listen
migration/postcopy: Report fault latencies in blocktime
migration/postcopy: Add blocktime fault counts per-vcpu
migration/postcopy: Bring blocktime layer to ns level
migration/postcopy: Drop PostcopyBlocktimeContext.start_time
migration/postcopy: Make all blocktime vars 64bits
migration/postcopy: Drop all atomic ops in blocktime feature
migration/postcopy: Push blocktime start/end into page req mutex
migration: Add option to set postcopy-blocktime
migration/postcopy: Avoid clearing dirty bitmap for postcopy too
migration: Rewrite the migration complete detect logic
migration/ram: Add tracepoints for ram_save_complete()
migration/ram: One less indent for ram_find_and_save_block()
migration: qemu_savevm_complete*() helpers
...
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
-rw-r--r-- | docs/devel/migration/main.rst | 4 | ||||
-rw-r--r-- | docs/devel/migration/postcopy.rst | 36 | ||||
-rw-r--r-- | docs/devel/migration/vfio.rst | 16 | ||||
-rw-r--r-- | hw/ppc/spapr.c | 2 | ||||
-rw-r--r-- | hw/s390x/s390-stattrib.c | 2 | ||||
-rw-r--r-- | hw/vfio/migration-multifd.c | 4 | ||||
-rw-r--r-- | hw/vfio/migration-multifd.h | 2 | ||||
-rw-r--r-- | hw/vfio/migration.c | 4 | ||||
-rw-r--r-- | include/migration/misc.h | 8 | ||||
-rw-r--r-- | include/migration/register.h | 34 | ||||
-rw-r--r-- | include/qemu/typedefs.h | 6 | ||||
-rw-r--r-- | migration/block-dirty-bitmap.c | 3 | ||||
-rw-r--r-- | migration/migration-hmp-cmds.c | 159 | ||||
-rw-r--r-- | migration/migration.c | 87 | ||||
-rw-r--r-- | migration/migration.h | 2 | ||||
-rw-r--r-- | migration/multifd-device-state.c | 10 | ||||
-rw-r--r-- | migration/options.c | 2 | ||||
-rw-r--r-- | migration/postcopy-ram.c | 563 | ||||
-rw-r--r-- | migration/postcopy-ram.h | 2 | ||||
-rw-r--r-- | migration/ram.c | 32 | ||||
-rw-r--r-- | migration/savevm.c | 89 | ||||
-rw-r--r-- | migration/trace-events | 9 | ||||
-rw-r--r-- | qapi/migration.json | 38 | ||||
-rw-r--r-- | tests/qtest/migration/migration-qmp.c | 5 |
24 files changed, 798 insertions, 321 deletions
diff --git a/docs/devel/migration/main.rst b/docs/devel/migration/main.rst index cdd4f4a..6493c1d 100644 --- a/docs/devel/migration/main.rst +++ b/docs/devel/migration/main.rst @@ -508,8 +508,8 @@ An iterative device must provide: the point that stream bandwidth limits tell it to stop. Each call generates one section. - - A ``save_live_complete_precopy`` function that must transmit the - last section for the device containing any remaining data. + - A ``save_complete`` function that must transmit the last section for + the device containing any remaining data. - A ``load_state`` function used to load sections generated by any of the save functions that generate sections. diff --git a/docs/devel/migration/postcopy.rst b/docs/devel/migration/postcopy.rst index 82e7a84..e319388 100644 --- a/docs/devel/migration/postcopy.rst +++ b/docs/devel/migration/postcopy.rst @@ -33,25 +33,6 @@ will now cause the transition from precopy to postcopy. It can be issued immediately after migration is started or any time later on. Issuing it after the end of a migration is harmless. -Blocktime is a postcopy live migration metric, intended to show how -long the vCPU was in state of interruptible sleep due to pagefault. -That metric is calculated both for all vCPUs as overlapped value, and -separately for each vCPU. These values are calculated on destination -side. To enable postcopy blocktime calculation, enter following -command on destination monitor: - -``migrate_set_capability postcopy-blocktime on`` - -Postcopy blocktime can be retrieved by query-migrate qmp command. -postcopy-blocktime value of qmp command will show overlapped blocking -time for all vCPU, postcopy-vcpu-blocktime will show list of blocking -time per vCPU. - -.. note:: - During the postcopy phase, the bandwidth limits set using - ``migrate_set_parameter`` is ignored (to avoid delaying requested pages that - the destination is waiting for). - Postcopy internals ================== @@ -312,3 +293,20 @@ explicitly) to be sent in a separate preempt channel, rather than queued in the background migration channel. Anyone who cares about latencies of page faults during a postcopy migration should enable this feature. By default, it's not enabled. + +Postcopy blocktime statistics +----------------------------- + +Blocktime is a postcopy live migration metric, intended to show how +long the vCPU was in state of interruptible sleep due to pagefault. +That metric is calculated both for all vCPUs as overlapped value, and +separately for each vCPU. These values are calculated on destination +side. To enable postcopy blocktime calculation, enter following +command on destination monitor: + +``migrate_set_capability postcopy-blocktime on`` + +Postcopy blocktime can be retrieved by query-migrate qmp command. +postcopy-blocktime value of qmp command will show overlapped blocking +time for all vCPU, postcopy-vcpu-blocktime will show list of blocking +time per vCPU. diff --git a/docs/devel/migration/vfio.rst b/docs/devel/migration/vfio.rst index 673e354..2d8e5ca 100644 --- a/docs/devel/migration/vfio.rst +++ b/docs/devel/migration/vfio.rst @@ -75,12 +75,12 @@ VFIO implements the device hooks for the iterative approach as follows: in the non-multifd mode. In the multifd mode it just emits either a dummy EOS marker. -* A ``save_live_complete_precopy`` function that sets the VFIO device in - _STOP_COPY state and iteratively copies the data for the VFIO device until - the vendor driver indicates that no data remains. - In the multifd mode it just emits a dummy EOS marker. +* A ``save_complete`` function that sets the VFIO device in _STOP_COPY + state and iteratively copies the data for the VFIO device until the + vendor driver indicates that no data remains. In the multifd mode it + just emits a dummy EOS marker. -* A ``save_live_complete_precopy_thread`` function that in the multifd mode +* A ``save_complete_precopy_thread`` function that in the multifd mode provides thread handler performing multifd device state transfer. It sets the VFIO device to _STOP_COPY state, iteratively reads the data from the VFIO device and queues it for multifd transmission until the vendor @@ -195,12 +195,12 @@ Live migration save path | Then the VFIO device is put in _STOP_COPY state (FINISH_MIGRATE, _ACTIVE, _STOP_COPY) - .save_live_complete_precopy() is called for each active device + .save_complete() is called for each active device For the VFIO device: in the non-multifd mode iterate in - .save_live_complete_precopy() until + .save_complete() until pending data is 0 In the multifd mode this iteration is done in - .save_live_complete_precopy_thread() instead. + .save_complete_precopy_thread() instead. | (POSTMIGRATE, _COMPLETED, _STOP_COPY) Migraton thread schedules cleanup bottom half and exits diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c index 08615f6..40f53ad 100644 --- a/hw/ppc/spapr.c +++ b/hw/ppc/spapr.c @@ -2518,7 +2518,7 @@ static void htab_save_cleanup(void *opaque) static SaveVMHandlers savevm_htab_handlers = { .save_setup = htab_save_setup, .save_live_iterate = htab_save_iterate, - .save_live_complete_precopy = htab_save_complete, + .save_complete = htab_save_complete, .save_cleanup = htab_save_cleanup, .load_state = htab_load, }; diff --git a/hw/s390x/s390-stattrib.c b/hw/s390x/s390-stattrib.c index f74cf32..13a678a 100644 --- a/hw/s390x/s390-stattrib.c +++ b/hw/s390x/s390-stattrib.c @@ -338,7 +338,7 @@ static const TypeInfo qemu_s390_stattrib_info = { static SaveVMHandlers savevm_s390_stattrib_handlers = { .save_setup = cmma_save_setup, .save_live_iterate = cmma_save_iterate, - .save_live_complete_precopy = cmma_save_complete, + .save_complete = cmma_save_complete, .state_pending_exact = cmma_state_pending, .state_pending_estimate = cmma_state_pending, .save_cleanup = cmma_save_cleanup, diff --git a/hw/vfio/migration-multifd.c b/hw/vfio/migration-multifd.c index 850a319..5563548 100644 --- a/hw/vfio/migration-multifd.c +++ b/hw/vfio/migration-multifd.c @@ -583,7 +583,7 @@ vfio_save_complete_precopy_thread_config_state(VFIODevice *vbasedev, /* * This thread is spawned by the migration core directly via - * .save_live_complete_precopy_thread SaveVMHandler. + * .save_complete_precopy_thread SaveVMHandler. * * It exits after either: * * completing saving the remaining device state and device config, OR: @@ -592,7 +592,7 @@ vfio_save_complete_precopy_thread_config_state(VFIODevice *vbasedev, * multifd_device_state_save_thread_should_exit() returning true. */ bool -vfio_multifd_save_complete_precopy_thread(SaveLiveCompletePrecopyThreadData *d, +vfio_multifd_save_complete_precopy_thread(SaveCompletePrecopyThreadData *d, Error **errp) { VFIODevice *vbasedev = d->handler_opaque; diff --git a/hw/vfio/migration-multifd.h b/hw/vfio/migration-multifd.h index 0bab632..ebf22a7 100644 --- a/hw/vfio/migration-multifd.h +++ b/hw/vfio/migration-multifd.h @@ -26,7 +26,7 @@ bool vfio_multifd_load_state_buffer(void *opaque, char *data, size_t data_size, void vfio_multifd_emit_dummy_eos(VFIODevice *vbasedev, QEMUFile *f); bool -vfio_multifd_save_complete_precopy_thread(SaveLiveCompletePrecopyThreadData *d, +vfio_multifd_save_complete_precopy_thread(SaveCompletePrecopyThreadData *d, Error **errp); int vfio_multifd_switchover_start(VFIODevice *vbasedev); diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c index b76697bd..c329578 100644 --- a/hw/vfio/migration.c +++ b/hw/vfio/migration.c @@ -824,7 +824,7 @@ static const SaveVMHandlers savevm_vfio_handlers = { .state_pending_exact = vfio_state_pending_exact, .is_active_iterate = vfio_is_active_iterate, .save_live_iterate = vfio_save_iterate, - .save_live_complete_precopy = vfio_save_complete_precopy, + .save_complete = vfio_save_complete_precopy, .save_state = vfio_save_state, .load_setup = vfio_load_setup, .load_cleanup = vfio_load_cleanup, @@ -835,7 +835,7 @@ static const SaveVMHandlers savevm_vfio_handlers = { */ .load_state_buffer = vfio_multifd_load_state_buffer, .switchover_start = vfio_switchover_start, - .save_live_complete_precopy_thread = vfio_multifd_save_complete_precopy_thread, + .save_complete_precopy_thread = vfio_multifd_save_complete_precopy_thread, }; /* ---------------------------------------------------------------------- */ diff --git a/include/migration/misc.h b/include/migration/misc.h index 8fd36eb..a261f99 100644 --- a/include/migration/misc.h +++ b/include/migration/misc.h @@ -119,19 +119,19 @@ bool migrate_uri_parse(const char *uri, MigrationChannel **channel, Error **errp); /* migration/multifd-device-state.c */ -typedef struct SaveLiveCompletePrecopyThreadData { - SaveLiveCompletePrecopyThreadHandler hdlr; +typedef struct SaveCompletePrecopyThreadData { + SaveCompletePrecopyThreadHandler hdlr; char *idstr; uint32_t instance_id; void *handler_opaque; -} SaveLiveCompletePrecopyThreadData; +} SaveCompletePrecopyThreadData; bool multifd_queue_device_state(char *idstr, uint32_t instance_id, char *data, size_t len); bool multifd_device_state_supported(void); void -multifd_spawn_device_state_save_thread(SaveLiveCompletePrecopyThreadHandler hdlr, +multifd_spawn_device_state_save_thread(SaveCompletePrecopyThreadHandler hdlr, char *idstr, uint32_t instance_id, void *opaque); diff --git a/include/migration/register.h b/include/migration/register.h index b79dc81..ae79794 100644 --- a/include/migration/register.h +++ b/include/migration/register.h @@ -78,51 +78,43 @@ typedef struct SaveVMHandlers { void (*save_cleanup)(void *opaque); /** - * @save_live_complete_postcopy + * @save_complete * - * Called at the end of postcopy for all postcopyable devices. + * Transmits the last section for the device containing any + * remaining data at the end phase of migration. * - * @f: QEMUFile where to send the data - * @opaque: data pointer passed to register_savevm_live() + * For precopy, this will be invoked _during_ the switchover phase + * after source VM is stopped. * - * Returns zero to indicate success and negative for error - */ - int (*save_live_complete_postcopy)(QEMUFile *f, void *opaque); - - /** - * @save_live_complete_precopy - * - * Transmits the last section for the device containing any - * remaining data at the end of a precopy phase. When postcopy is - * enabled, devices that support postcopy will skip this step, - * where the final data will be flushed at the end of postcopy via - * @save_live_complete_postcopy instead. + * For postcopy, this will be invoked _after_ the switchover phase + * (except some very unusual cases, like PMEM ramblocks), while + * destination VM can be running. * * @f: QEMUFile where to send the data * @opaque: data pointer passed to register_savevm_live() * * Returns zero to indicate success and negative for error */ - int (*save_live_complete_precopy)(QEMUFile *f, void *opaque); + int (*save_complete)(QEMUFile *f, void *opaque); /** - * @save_live_complete_precopy_thread (invoked in a separate thread) + * @save_complete_precopy_thread (invoked in a separate thread) * * Called at the end of a precopy phase from a separate worker thread * in configurations where multifd device state transfer is supported * in order to perform asynchronous transmission of the remaining data in - * parallel with @save_live_complete_precopy handlers. + * parallel with @save_complete handlers. * When postcopy is enabled, devices that support postcopy will skip this * step. * - * @d: a #SaveLiveCompletePrecopyThreadData containing parameters that the + * @d: a #SaveCompletePrecopyThreadData containing parameters that the * handler may need, including this device section idstr and instance_id, * and opaque data pointer passed to register_savevm_live(). * @errp: pointer to Error*, to store an error if it happens. * * Returns true to indicate success and false for errors. */ - SaveLiveCompletePrecopyThreadHandler save_live_complete_precopy_thread; + SaveCompletePrecopyThreadHandler save_complete_precopy_thread; /* This runs both outside and inside the BQL. */ diff --git a/include/qemu/typedefs.h b/include/qemu/typedefs.h index 507f081..4a94af9 100644 --- a/include/qemu/typedefs.h +++ b/include/qemu/typedefs.h @@ -109,7 +109,7 @@ typedef struct QString QString; typedef struct RAMBlock RAMBlock; typedef struct Range Range; typedef struct ReservedRegion ReservedRegion; -typedef struct SaveLiveCompletePrecopyThreadData SaveLiveCompletePrecopyThreadData; +typedef struct SaveCompletePrecopyThreadData SaveCompletePrecopyThreadData; typedef struct SHPCDevice SHPCDevice; typedef struct SSIBus SSIBus; typedef struct TCGCPUOps TCGCPUOps; @@ -135,7 +135,7 @@ typedef struct IRQState *qemu_irq; typedef void (*qemu_irq_handler)(void *opaque, int n, int level); typedef bool (*MigrationLoadThread)(void *opaque, bool *should_quit, Error **errp); -typedef bool (*SaveLiveCompletePrecopyThreadHandler)(SaveLiveCompletePrecopyThreadData *d, - Error **errp); +typedef bool (*SaveCompletePrecopyThreadHandler)(SaveCompletePrecopyThreadData *d, + Error **errp); #endif /* QEMU_TYPEDEFS_H */ diff --git a/migration/block-dirty-bitmap.c b/migration/block-dirty-bitmap.c index f2c352d..a061aad 100644 --- a/migration/block-dirty-bitmap.c +++ b/migration/block-dirty-bitmap.c @@ -1248,8 +1248,7 @@ static bool dirty_bitmap_has_postcopy(void *opaque) static SaveVMHandlers savevm_dirty_bitmap_handlers = { .save_setup = dirty_bitmap_save_setup, - .save_live_complete_postcopy = dirty_bitmap_save_complete, - .save_live_complete_precopy = dirty_bitmap_save_complete, + .save_complete = dirty_bitmap_save_complete, .has_postcopy = dirty_bitmap_has_postcopy, .state_pending_exact = dirty_bitmap_state_pending, .state_pending_estimate = dirty_bitmap_state_pending, diff --git a/migration/migration-hmp-cmds.c b/migration/migration-hmp-cmds.c index e8a563c..cef5608 100644 --- a/migration/migration-hmp-cmds.c +++ b/migration/migration-hmp-cmds.c @@ -52,6 +52,88 @@ static void migration_global_dump(Monitor *mon) ms->clear_bitmap_shift); } +static const gchar *format_time_str(uint64_t us) +{ + const char *units[] = {"us", "ms", "sec"}; + int index = 0; + + while (us > 1000) { + us /= 1000; + if (++index >= (sizeof(units) - 1)) { + break; + } + } + + return g_strdup_printf("%"PRIu64" %s", us, units[index]); +} + +static void migration_dump_blocktime(Monitor *mon, MigrationInfo *info) +{ + if (info->has_postcopy_blocktime) { + monitor_printf(mon, "Postcopy Blocktime (ms): %" PRIu32 "\n", + info->postcopy_blocktime); + } + + if (info->has_postcopy_vcpu_blocktime) { + uint32List *item = info->postcopy_vcpu_blocktime; + const char *sep = ""; + int count = 0; + + monitor_printf(mon, "Postcopy vCPU Blocktime (ms):\n ["); + + while (item) { + monitor_printf(mon, "%s%"PRIu32, sep, item->value); + item = item->next; + /* Each line 10 vcpu results, newline if there's more */ + sep = ((++count % 10 == 0) && item) ? ",\n " : ", "; + } + monitor_printf(mon, "]\n"); + } + + if (info->has_postcopy_latency) { + monitor_printf(mon, "Postcopy Latency (ns): %" PRIu64 "\n", + info->postcopy_latency); + } + + if (info->has_postcopy_non_vcpu_latency) { + monitor_printf(mon, "Postcopy non-vCPU Latencies (ns): %" PRIu64 "\n", + info->postcopy_non_vcpu_latency); + } + + if (info->has_postcopy_vcpu_latency) { + uint64List *item = info->postcopy_vcpu_latency; + const char *sep = ""; + int count = 0; + + monitor_printf(mon, "Postcopy vCPU Latencies (ns):\n ["); + + while (item) { + monitor_printf(mon, "%s%"PRIu64, sep, item->value); + item = item->next; + /* Each line 10 vcpu results, newline if there's more */ + sep = ((++count % 10 == 0) && item) ? ",\n " : ", "; + } + monitor_printf(mon, "]\n"); + } + + if (info->has_postcopy_latency_dist) { + uint64List *item = info->postcopy_latency_dist; + int count = 0; + + monitor_printf(mon, "Postcopy Latency Distribution:\n"); + + while (item) { + g_autofree const gchar *from = format_time_str(1UL << count); + g_autofree const gchar *to = format_time_str(1UL << (count + 1)); + + monitor_printf(mon, " [ %8s - %8s ]: %10"PRIu64"\n", + from, to, item->value); + item = item->next; + count++; + } + } +} + void hmp_info_migrate(Monitor *mon, const QDict *qdict) { bool show_all = qdict_get_try_bool(qdict, "all", false); @@ -69,7 +151,7 @@ void hmp_info_migrate(Monitor *mon, const QDict *qdict) } if (info->has_status) { - monitor_printf(mon, "Status: %s", + monitor_printf(mon, "Status: \t\t%s", MigrationStatus_str(info->status)); if (info->status == MIGRATION_STATUS_FAILED && info->error_desc) { monitor_printf(mon, " (%s)\n", info->error_desc); @@ -78,7 +160,7 @@ void hmp_info_migrate(Monitor *mon, const QDict *qdict) } if (info->total_time) { - monitor_printf(mon, "Time (ms): total=%" PRIu64, + monitor_printf(mon, "Time (ms): \t\ttotal=%" PRIu64, info->total_time); if (info->has_setup_time) { monitor_printf(mon, ", setup=%" PRIu64, @@ -110,48 +192,51 @@ void hmp_info_migrate(Monitor *mon, const QDict *qdict) } if (info->ram) { + g_autofree char *str_psize = size_to_str(info->ram->page_size); + g_autofree char *str_total = size_to_str(info->ram->total); + g_autofree char *str_transferred = size_to_str(info->ram->transferred); + g_autofree char *str_remaining = size_to_str(info->ram->remaining); + g_autofree char *str_precopy = size_to_str(info->ram->precopy_bytes); + g_autofree char *str_multifd = size_to_str(info->ram->multifd_bytes); + g_autofree char *str_postcopy = size_to_str(info->ram->postcopy_bytes); + monitor_printf(mon, "RAM info:\n"); - monitor_printf(mon, " Throughput (Mbps): %0.2f\n", + monitor_printf(mon, " Throughput (Mbps): \t%0.2f\n", info->ram->mbps); - monitor_printf(mon, " Sizes (KiB): pagesize=%" PRIu64 - ", total=%" PRIu64 ",\n", - info->ram->page_size >> 10, - info->ram->total >> 10); - monitor_printf(mon, " transferred=%" PRIu64 - ", remain=%" PRIu64 ",\n", - info->ram->transferred >> 10, - info->ram->remaining >> 10); - monitor_printf(mon, " precopy=%" PRIu64 - ", multifd=%" PRIu64 - ", postcopy=%" PRIu64, - info->ram->precopy_bytes >> 10, - info->ram->multifd_bytes >> 10, - info->ram->postcopy_bytes >> 10); + monitor_printf(mon, " Sizes: \t\tpagesize=%s, total=%s\n", + str_psize, str_total); + monitor_printf(mon, " Transfers: \t\ttransferred=%s, remain=%s\n", + str_transferred, str_remaining); + monitor_printf(mon, " Channels: \t\tprecopy=%s, " + "multifd=%s, postcopy=%s", + str_precopy, str_multifd, str_postcopy); if (info->vfio) { - monitor_printf(mon, ", vfio=%" PRIu64, - info->vfio->transferred >> 10); + g_autofree char *str_vfio = size_to_str(info->vfio->transferred); + + monitor_printf(mon, ", vfio=%s", str_vfio); } monitor_printf(mon, "\n"); - monitor_printf(mon, " Pages: normal=%" PRIu64 ", zero=%" PRIu64 - ", rate_per_sec=%" PRIu64 "\n", - info->ram->normal, - info->ram->duplicate, + monitor_printf(mon, " Page Types: \tnormal=%" PRIu64 + ", zero=%" PRIu64 "\n", + info->ram->normal, info->ram->duplicate); + monitor_printf(mon, " Page Rates (pps): \ttransfer=%" PRIu64, info->ram->pages_per_second); - monitor_printf(mon, " Others: dirty_syncs=%" PRIu64, - info->ram->dirty_sync_count); - if (info->ram->dirty_pages_rate) { - monitor_printf(mon, ", dirty_pages_rate=%" PRIu64, + monitor_printf(mon, ", dirty=%" PRIu64, info->ram->dirty_pages_rate); } + monitor_printf(mon, "\n"); + + monitor_printf(mon, " Others: \t\tdirty_syncs=%" PRIu64, + info->ram->dirty_sync_count); if (info->ram->postcopy_requests) { monitor_printf(mon, ", postcopy_req=%" PRIu64, info->ram->postcopy_requests); } if (info->ram->downtime_bytes) { - monitor_printf(mon, ", downtime_ram=%" PRIu64, + monitor_printf(mon, ", downtime_bytes=%" PRIu64, info->ram->downtime_bytes); } if (info->ram->dirty_sync_missed_zero_copy) { @@ -199,23 +284,7 @@ void hmp_info_migrate(Monitor *mon, const QDict *qdict) info->dirty_limit_ring_full_time); } - if (info->has_postcopy_blocktime) { - monitor_printf(mon, "Postcopy Blocktime (ms): %" PRIu32 "\n", - info->postcopy_blocktime); - } - - if (info->has_postcopy_vcpu_blocktime) { - Visitor *v; - char *str; - v = string_output_visitor_new(false, &str); - visit_type_uint32List(v, NULL, &info->postcopy_vcpu_blocktime, - &error_abort); - visit_complete(v, &str); - monitor_printf(mon, "Postcopy vCPU Blocktime: %s\n", str); - g_free(str); - visit_free(v); - } - + migration_dump_blocktime(mon, info); out: qapi_free_MigrationInfo(info); } diff --git a/migration/migration.c b/migration/migration.c index 4098870..10c216d 100644 --- a/migration/migration.c +++ b/migration/migration.c @@ -576,22 +576,27 @@ int migrate_send_rp_message_req_pages(MigrationIncomingState *mis, } int migrate_send_rp_req_pages(MigrationIncomingState *mis, - RAMBlock *rb, ram_addr_t start, uint64_t haddr) + RAMBlock *rb, ram_addr_t start, uint64_t haddr, + uint32_t tid) { void *aligned = (void *)(uintptr_t)ROUND_DOWN(haddr, qemu_ram_pagesize(rb)); bool received = false; WITH_QEMU_LOCK_GUARD(&mis->page_request_mutex) { received = ramblock_recv_bitmap_test_byte_offset(rb, start); - if (!received && !g_tree_lookup(mis->page_requested, aligned)) { - /* - * The page has not been received, and it's not yet in the page - * request list. Queue it. Set the value of element to 1, so that - * things like g_tree_lookup() will return TRUE (1) when found. - */ - g_tree_insert(mis->page_requested, aligned, (gpointer)1); - qatomic_inc(&mis->page_requested_count); - trace_postcopy_page_req_add(aligned, mis->page_requested_count); + if (!received) { + if (!g_tree_lookup(mis->page_requested, aligned)) { + /* + * The page has not been received, and it's not yet in the + * page request list. Queue it. Set the value of element + * to 1, so that things like g_tree_lookup() will return + * TRUE (1) when found. + */ + g_tree_insert(mis->page_requested, aligned, (gpointer)1); + qatomic_inc(&mis->page_requested_count); + trace_postcopy_page_req_add(aligned, mis->page_requested_count); + } + mark_postcopy_blocktime_begin(haddr, tid, rb); } } @@ -3436,33 +3441,60 @@ static MigIterateState migration_iteration_run(MigrationState *s) Error *local_err = NULL; bool in_postcopy = s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE; bool can_switchover = migration_can_switchover(s); + bool complete_ready; + /* Fast path - get the estimated amount of pending data */ qemu_savevm_state_pending_estimate(&must_precopy, &can_postcopy); pending_size = must_precopy + can_postcopy; trace_migrate_pending_estimate(pending_size, must_precopy, can_postcopy); - if (pending_size < s->threshold_size) { - qemu_savevm_state_pending_exact(&must_precopy, &can_postcopy); - pending_size = must_precopy + can_postcopy; - trace_migrate_pending_exact(pending_size, must_precopy, can_postcopy); + if (in_postcopy) { + /* + * Iterate in postcopy until all pending data flushed. Note that + * postcopy completion doesn't rely on can_switchover, because when + * POSTCOPY_ACTIVE it means switchover already happened. + */ + complete_ready = !pending_size; + } else { + /* + * Exact pending reporting is only needed for precopy. Taking RAM + * as example, there'll be no extra dirty information after + * postcopy started, so ESTIMATE should always match with EXACT + * during postcopy phase. + */ + if (pending_size < s->threshold_size) { + qemu_savevm_state_pending_exact(&must_precopy, &can_postcopy); + pending_size = must_precopy + can_postcopy; + trace_migrate_pending_exact(pending_size, must_precopy, + can_postcopy); + } + + /* Should we switch to postcopy now? */ + if (must_precopy <= s->threshold_size && + can_switchover && qatomic_read(&s->start_postcopy)) { + if (postcopy_start(s, &local_err)) { + migrate_set_error(s, local_err); + error_report_err(local_err); + } + return MIG_ITERATE_SKIP; + } + + /* + * For precopy, migration can complete only if: + * + * (1) Switchover is acknowledged by destination + * (2) Pending size is no more than the threshold specified + * (which was calculated from expected downtime) + */ + complete_ready = can_switchover && (pending_size <= s->threshold_size); } - if ((!pending_size || pending_size < s->threshold_size) && can_switchover) { + if (complete_ready) { trace_migration_thread_low_pending(pending_size); migration_completion(s); return MIG_ITERATE_BREAK; } - /* Still a significant amount to transfer */ - if (!in_postcopy && must_precopy <= s->threshold_size && can_switchover && - qatomic_read(&s->start_postcopy)) { - if (postcopy_start(s, &local_err)) { - migrate_set_error(s, local_err); - error_report_err(local_err); - } - return MIG_ITERATE_SKIP; - } - /* Just another iteration step */ qemu_savevm_state_iterate(s->to_dst_file, in_postcopy); return MIG_ITERATE_RESUME; @@ -3887,9 +3919,8 @@ static void *bg_migration_thread(void *opaque) while (migration_is_active()) { MigIterateState iter_state = bg_migration_iteration_run(s); - if (iter_state == MIG_ITERATE_SKIP) { - continue; - } else if (iter_state == MIG_ITERATE_BREAK) { + + if (iter_state == MIG_ITERATE_BREAK) { break; } diff --git a/migration/migration.h b/migration/migration.h index 739289d..01329bf 100644 --- a/migration/migration.h +++ b/migration/migration.h @@ -546,7 +546,7 @@ void migrate_send_rp_shut(MigrationIncomingState *mis, void migrate_send_rp_pong(MigrationIncomingState *mis, uint32_t value); int migrate_send_rp_req_pages(MigrationIncomingState *mis, RAMBlock *rb, - ram_addr_t start, uint64_t haddr); + ram_addr_t start, uint64_t haddr, uint32_t tid); int migrate_send_rp_message_req_pages(MigrationIncomingState *mis, RAMBlock *rb, ram_addr_t start); void migrate_send_rp_recv_bitmap(MigrationIncomingState *mis, diff --git a/migration/multifd-device-state.c b/migration/multifd-device-state.c index 94222d0..fce64f0 100644 --- a/migration/multifd-device-state.c +++ b/migration/multifd-device-state.c @@ -131,7 +131,7 @@ bool multifd_device_state_supported(void) static void multifd_device_state_save_thread_data_free(void *opaque) { - SaveLiveCompletePrecopyThreadData *data = opaque; + SaveCompletePrecopyThreadData *data = opaque; g_clear_pointer(&data->idstr, g_free); g_free(data); @@ -139,7 +139,7 @@ static void multifd_device_state_save_thread_data_free(void *opaque) static int multifd_device_state_save_thread(void *opaque) { - SaveLiveCompletePrecopyThreadData *data = opaque; + SaveCompletePrecopyThreadData *data = opaque; g_autoptr(Error) local_err = NULL; if (!data->hdlr(data, &local_err)) { @@ -170,18 +170,18 @@ bool multifd_device_state_save_thread_should_exit(void) } void -multifd_spawn_device_state_save_thread(SaveLiveCompletePrecopyThreadHandler hdlr, +multifd_spawn_device_state_save_thread(SaveCompletePrecopyThreadHandler hdlr, char *idstr, uint32_t instance_id, void *opaque) { - SaveLiveCompletePrecopyThreadData *data; + SaveCompletePrecopyThreadData *data; assert(multifd_device_state_supported()); assert(multifd_send_device_state); assert(!qatomic_read(&multifd_send_device_state->threads_abort)); - data = g_new(SaveLiveCompletePrecopyThreadData, 1); + data = g_new(SaveCompletePrecopyThreadData, 1); data->hdlr = hdlr; data->idstr = g_strdup(idstr); data->instance_id = instance_id; diff --git a/migration/options.c b/migration/options.c index 162c72c..4e923a2 100644 --- a/migration/options.c +++ b/migration/options.c @@ -187,6 +187,8 @@ const Property migration_properties[] = { DEFINE_PROP_MIG_CAP("x-postcopy-ram", MIGRATION_CAPABILITY_POSTCOPY_RAM), DEFINE_PROP_MIG_CAP("x-postcopy-preempt", MIGRATION_CAPABILITY_POSTCOPY_PREEMPT), + DEFINE_PROP_MIG_CAP("postcopy-blocktime", + MIGRATION_CAPABILITY_POSTCOPY_BLOCKTIME), DEFINE_PROP_MIG_CAP("x-colo", MIGRATION_CAPABILITY_X_COLO), DEFINE_PROP_MIG_CAP("x-release-ram", MIGRATION_CAPABILITY_RELEASE_RAM), DEFINE_PROP_MIG_CAP("x-return-path", MIGRATION_CAPABILITY_RETURN_PATH), diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c index 75fd310..45af9a3 100644 --- a/migration/postcopy-ram.c +++ b/migration/postcopy-ram.c @@ -110,19 +110,104 @@ void postcopy_thread_create(MigrationIncomingState *mis, #include <sys/eventfd.h> #include <linux/userfaultfd.h> +/* + * Here we use 24 buckets, which means the last bucket will cover [2^24 us, + * 2^25 us) ~= [16, 32) seconds. It should be far enough to record even + * extreme (perf-wise broken) 1G pages moving over, which can sometimes + * take a few seconds due to various reasons. Anything more than that + * might be unsensible to account anymore. + */ +#define BLOCKTIME_LATENCY_BUCKET_N (24) + +/* All the time records are in unit of nanoseconds */ typedef struct PostcopyBlocktimeContext { - /* time when page fault initiated per vCPU */ - uint32_t *page_fault_vcpu_time; - /* page address per vCPU */ - uintptr_t *vcpu_addr; - uint32_t total_blocktime; /* blocktime per vCPU */ - uint32_t *vcpu_blocktime; + uint64_t *vcpu_blocktime_total; + /* count of faults per vCPU */ + uint64_t *vcpu_faults_count; + /* + * count of currently blocked faults per vCPU. + * + * NOTE: Normally there should only be one fault in-progress per vCPU + * thread, so logically it _seems_ vcpu_faults_count[] for any vCPU + * should be either zero or one. However, there can be reasons we see + * >1 faults on the same vCPU thread. + * + * CASE (1): since the process to resolve faults (ioctl(UFFDIO_COPY), + * for example) is done before taking the mutex that protects the + * blocktime context, it can happen that we read more than one faulted + * addresses per vCPU. + * + * One example when we can see >1 faulted addresses for one vCPU: + * + * vcpu1 thread fault thread resolve thread + * ============ ============ ============== + * + * faulted on addr1 + * read uffd msg (addr1) + * MUTEX_LOCK + * add entry (cpu1, addr1) + * MUTEX_UNLOCK + * request remote fault (addr1) + * resolve fault (addr1) + * addr1 resolved, continue.. + * faulted on addr2 + * read uffd msg (addr2) + * MUTEX_LOCK + * add entry (cpu1, addr2) <--------------- [A] + * MUTEX_UNLOCK + * MUTEX_LOCK + * remove entry (cpu1, addr1) + * MUTEX_UNLOCK + * + * In above case, we may see (cpu1, addr1) and (cpu1, addr2) entries to + * appear together at [A], when it gets the lock before the resolve + * thread. Use this counter to maintain such case, and only when it + * reaches zero we know the vCPU is not blocked anymore. + * + * CASE (2): theoretically (the author admit to not have verified + * this..), one vCPU thread can also generate more than one userfaultfd + * message on the same address. It can happen e.g. for whatever reason + * the fault got retried before a resolution arrives. In that extremely + * rare case, we could also see two (cpu1, addr1) entries. + * + * In all cases, be prepared with such re-entrancies with this array. + * + * Using uint8_t should be far enough for now. For example, when + * there're only one resolve thread (postcopy ram listening thread), + * the max (concurrent fault entries) should be two. + */ + uint8_t *vcpu_faults_current; + /* + * The hash that contains addr1->[(cpu1,ts1),(cpu2,ts2) ...] mappings. + * Each of the entry is a tuple of (CPU index, fault timestamp) showing + * that a fault was requested. + */ + GHashTable *vcpu_addr_hash; + /* + * Each bucket stores the count of faults that were resolved within the + * bucket window [2^N us, 2^(N+1) us). + */ + uint64_t latency_buckets[BLOCKTIME_LATENCY_BUCKET_N]; + /* total blocktime when all vCPUs are stopped */ + uint64_t total_blocktime; /* point in time when last page fault was initiated */ - uint32_t last_begin; + uint64_t last_begin; /* number of vCPU are suspended */ int smp_cpus_down; - uint64_t start_time; + + /* + * Fast path for looking up vcpu_index from tid. NOTE: this result + * only reflects the vcpu setup when postcopy is running. It may not + * always match with the current vcpu setup because vcpus can be hot + * attached/detached after migration completes. However this should be + * stable when blocktime is using the structure. + */ + GHashTable *tid_to_vcpu_hash; + /* Count of non-vCPU faults. This is only for debugging purpose. */ + uint64_t non_vcpu_faults; + /* total blocktime when a non-vCPU thread is stopped */ + uint64_t non_vcpu_blocktime_total; /* * Handler for exit event, necessary for @@ -131,11 +216,41 @@ typedef struct PostcopyBlocktimeContext { Notifier exit_notifier; } PostcopyBlocktimeContext; +typedef struct { + /* The time the fault was triggered */ + uint64_t fault_time; + /* + * The vCPU index that was blocked, when cpu==-1, it means it's a + * fault from non-vCPU threads. + */ + int cpu; +} BlocktimeVCPUEntry; + +/* Alloc an entry to record a vCPU fault */ +static BlocktimeVCPUEntry * +blocktime_vcpu_entry_alloc(int cpu, uint64_t fault_time) +{ + BlocktimeVCPUEntry *entry = g_new(BlocktimeVCPUEntry, 1); + + entry->fault_time = fault_time; + entry->cpu = cpu; + + return entry; +} + +/* Free a @GList of @BlocktimeVCPUEntry */ +static void blocktime_vcpu_list_free(gpointer data) +{ + g_list_free_full(data, g_free); +} + static void destroy_blocktime_context(struct PostcopyBlocktimeContext *ctx) { - g_free(ctx->page_fault_vcpu_time); - g_free(ctx->vcpu_addr); - g_free(ctx->vcpu_blocktime); + g_hash_table_destroy(ctx->tid_to_vcpu_hash); + g_hash_table_destroy(ctx->vcpu_addr_hash); + g_free(ctx->vcpu_blocktime_total); + g_free(ctx->vcpu_faults_count); + g_free(ctx->vcpu_faults_current); g_free(ctx); } @@ -146,32 +261,65 @@ static void migration_exit_cb(Notifier *n, void *data) destroy_blocktime_context(ctx); } +static GHashTable *blocktime_init_tid_to_vcpu_hash(void) +{ + /* + * TID as an unsigned int can be directly used as the key. However, + * CPU index can NOT be directly used as value, because CPU index can + * be 0, which means NULL. Then when lookup we can never know whether + * it's 0 or "not found". Hence use an indirection for CPU index. + */ + GHashTable *table = g_hash_table_new_full(g_direct_hash, g_direct_equal, + NULL, g_free); + CPUState *cpu; + + /* + * Initialize the tid->cpu_id mapping for lookups. The caller needs to + * make sure when reaching here the CPU topology is frozen and will be + * stable for the whole blocktime trapping period. + */ + CPU_FOREACH(cpu) { + int *value = g_new(int, 1); + + *value = cpu->cpu_index; + g_hash_table_insert(table, + GUINT_TO_POINTER((uint32_t)cpu->thread_id), + value); + trace_postcopy_blocktime_tid_cpu_map(cpu->cpu_index, cpu->thread_id); + } + + return table; +} + static struct PostcopyBlocktimeContext *blocktime_context_new(void) { MachineState *ms = MACHINE(qdev_get_machine()); unsigned int smp_cpus = ms->smp.cpus; PostcopyBlocktimeContext *ctx = g_new0(PostcopyBlocktimeContext, 1); - ctx->page_fault_vcpu_time = g_new0(uint32_t, smp_cpus); - ctx->vcpu_addr = g_new0(uintptr_t, smp_cpus); - ctx->vcpu_blocktime = g_new0(uint32_t, smp_cpus); - ctx->exit_notifier.notify = migration_exit_cb; - ctx->start_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME); - qemu_add_exit_notifier(&ctx->exit_notifier); - return ctx; -} + /* Initialize all counters to be zeros */ + memset(ctx->latency_buckets, 0, sizeof(ctx->latency_buckets)); -static uint32List *get_vcpu_blocktime_list(PostcopyBlocktimeContext *ctx) -{ - MachineState *ms = MACHINE(qdev_get_machine()); - uint32List *list = NULL; - int i; + ctx->vcpu_blocktime_total = g_new0(uint64_t, smp_cpus); + ctx->vcpu_faults_count = g_new0(uint64_t, smp_cpus); + ctx->vcpu_faults_current = g_new0(uint8_t, smp_cpus); + ctx->tid_to_vcpu_hash = blocktime_init_tid_to_vcpu_hash(); - for (i = ms->smp.cpus - 1; i >= 0; i--) { - QAPI_LIST_PREPEND(list, ctx->vcpu_blocktime[i]); - } + /* + * The key (host virtual addresses) will always be gpointer-sized on + * either 32bits or 64bits systems, so it'll fit as a direct key. + * + * The value will be a list of BlocktimeVCPUEntry entries. + */ + ctx->vcpu_addr_hash = g_hash_table_new_full(g_direct_hash, + g_direct_equal, + NULL, + blocktime_vcpu_list_free); + + ctx->exit_notifier.notify = migration_exit_cb; + qemu_add_exit_notifier(&ctx->exit_notifier); - return list; + return ctx; } /* @@ -185,18 +333,64 @@ void fill_destination_postcopy_migration_info(MigrationInfo *info) { MigrationIncomingState *mis = migration_incoming_get_current(); PostcopyBlocktimeContext *bc = mis->blocktime_ctx; + MachineState *ms = MACHINE(qdev_get_machine()); + uint64_t latency_total = 0, faults = 0; + uint32List *list_blocktime = NULL; + uint64List *list_latency = NULL; + uint64List *latency_buckets = NULL; + int i; if (!bc) { return; } + for (i = ms->smp.cpus - 1; i >= 0; i--) { + uint64_t latency, total, count; + + /* Convert ns -> ms */ + QAPI_LIST_PREPEND(list_blocktime, + (uint32_t)(bc->vcpu_blocktime_total[i] / SCALE_MS)); + + /* The rest in nanoseconds */ + total = bc->vcpu_blocktime_total[i]; + latency_total += total; + count = bc->vcpu_faults_count[i]; + faults += count; + + if (count) { + latency = total / count; + } else { + /* No fault detected */ + latency = 0; + } + + QAPI_LIST_PREPEND(list_latency, latency); + } + + for (i = BLOCKTIME_LATENCY_BUCKET_N - 1; i >= 0; i--) { + QAPI_LIST_PREPEND(latency_buckets, bc->latency_buckets[i]); + } + + latency_total += bc->non_vcpu_blocktime_total; + faults += bc->non_vcpu_faults; + + info->has_postcopy_non_vcpu_latency = true; + info->postcopy_non_vcpu_latency = bc->non_vcpu_faults ? + (bc->non_vcpu_blocktime_total / bc->non_vcpu_faults) : 0; info->has_postcopy_blocktime = true; - info->postcopy_blocktime = bc->total_blocktime; + /* Convert ns -> ms */ + info->postcopy_blocktime = (uint32_t)(bc->total_blocktime / SCALE_MS); info->has_postcopy_vcpu_blocktime = true; - info->postcopy_vcpu_blocktime = get_vcpu_blocktime_list(bc); + info->postcopy_vcpu_blocktime = list_blocktime; + info->has_postcopy_latency = true; + info->postcopy_latency = faults ? (latency_total / faults) : 0; + info->has_postcopy_vcpu_latency = true; + info->postcopy_vcpu_latency = list_latency; + info->has_postcopy_latency_dist = true; + info->postcopy_latency_dist = latency_buckets; } -static uint32_t get_postcopy_total_blocktime(void) +static uint64_t get_postcopy_total_blocktime(void) { MigrationIncomingState *mis = migration_incoming_get_current(); PostcopyBlocktimeContext *bc = mis->blocktime_ctx; @@ -300,13 +494,13 @@ static bool ufd_check_and_apply(int ufd, MigrationIncomingState *mis, } #ifdef UFFD_FEATURE_THREAD_ID + /* + * Postcopy blocktime conditionally needs THREAD_ID feature (introduced + * to Linux in 2017). Always try to enable it when QEMU is compiled + * with such environment. + */ if (UFFD_FEATURE_THREAD_ID & supported_features) { asked_features |= UFFD_FEATURE_THREAD_ID; - if (migrate_postcopy_blocktime()) { - if (!mis->blocktime_ctx) { - mis->blocktime_ctx = blocktime_context_new(); - } - } } #endif @@ -752,8 +946,12 @@ int postcopy_wake_shared(struct PostCopyFD *pcfd, pagesize); } +/* + * NOTE: @tid is only used when postcopy-blocktime feature is enabled, and + * also optional: when zero is provided, the fault accounting will be ignored. + */ static int postcopy_request_page(MigrationIncomingState *mis, RAMBlock *rb, - ram_addr_t start, uint64_t haddr) + ram_addr_t start, uint64_t haddr, uint32_t tid) { void *aligned = (void *)(uintptr_t)ROUND_DOWN(haddr, qemu_ram_pagesize(rb)); @@ -772,7 +970,7 @@ static int postcopy_request_page(MigrationIncomingState *mis, RAMBlock *rb, return received ? 0 : postcopy_place_page_zero(mis, aligned, rb); } - return migrate_send_rp_req_pages(mis, rb, start, haddr); + return migrate_send_rp_req_pages(mis, rb, start, haddr, tid); } /* @@ -793,83 +991,204 @@ int postcopy_request_shared_page(struct PostCopyFD *pcfd, RAMBlock *rb, qemu_ram_get_idstr(rb), rb_offset); return postcopy_wake_shared(pcfd, client_addr, rb); } - postcopy_request_page(mis, rb, aligned_rbo, client_addr); + /* TODO: support blocktime tracking */ + postcopy_request_page(mis, rb, aligned_rbo, client_addr, 0); return 0; } -static int get_mem_fault_cpu_index(uint32_t pid) +static int blocktime_get_vcpu(PostcopyBlocktimeContext *ctx, uint32_t tid) { - CPUState *cpu_iter; + int *found; - CPU_FOREACH(cpu_iter) { - if (cpu_iter->thread_id == pid) { - trace_get_mem_fault_cpu_index(cpu_iter->cpu_index, pid); - return cpu_iter->cpu_index; - } + found = g_hash_table_lookup(ctx->tid_to_vcpu_hash, GUINT_TO_POINTER(tid)); + if (!found) { + /* + * NOTE: this is possible, because QEMU's non-vCPU threads can + * also access a missing page. Or, when KVM async pf is enabled, a + * fault can even happen from a kworker.. + */ + return -1; } - trace_get_mem_fault_cpu_index(-1, pid); - return -1; + + return *found; } -static uint32_t get_low_time_offset(PostcopyBlocktimeContext *dc) +static uint64_t get_current_ns(void) { - int64_t start_time_offset = qemu_clock_get_ms(QEMU_CLOCK_REALTIME) - - dc->start_time; - return start_time_offset < 1 ? 1 : start_time_offset & UINT32_MAX; + return (uint64_t)qemu_clock_get_ns(QEMU_CLOCK_REALTIME); +} + +/* + * Inject an (cpu, fault_time) entry into the database, using addr as key. + * When cpu==-1, it means it's a non-vCPU fault. + */ +static void blocktime_fault_inject(PostcopyBlocktimeContext *ctx, + uintptr_t addr, int cpu, uint64_t time) +{ + BlocktimeVCPUEntry *entry = blocktime_vcpu_entry_alloc(cpu, time); + GHashTable *table = ctx->vcpu_addr_hash; + gpointer key = (gpointer)addr; + GList *head, *list; + gboolean result; + + head = g_hash_table_lookup(table, key); + if (head) { + /* + * If existed, steal the @head for list operation rather than + * freeing it, making sure steal succeeded. + */ + result = g_hash_table_steal(table, key); + assert(result == TRUE); + } + + /* + * Now the key is guaranteed to be absent. Two cases: + * + * (1) There's no existing entry, list contains the only one. Insert. + * (2) There're existing entries, after stealing we own it, prepend the + * result and re-insert. + */ + list = g_list_prepend(head, entry); + g_hash_table_insert(table, key, list); + + trace_postcopy_blocktime_begin(addr, time, cpu, !!head); } /* - * This function is being called when pagefault occurs. It - * tracks down vCPU blocking time. + * This function is being called when pagefault occurs. It tracks down vCPU + * blocking time. It's protected by @page_request_mutex. * * @addr: faulted host virtual address * @ptid: faulted process thread id * @rb: ramblock appropriate to addr */ -static void mark_postcopy_blocktime_begin(uintptr_t addr, uint32_t ptid, - RAMBlock *rb) +void mark_postcopy_blocktime_begin(uintptr_t addr, uint32_t ptid, + RAMBlock *rb) { - int cpu, already_received; + int cpu; MigrationIncomingState *mis = migration_incoming_get_current(); PostcopyBlocktimeContext *dc = mis->blocktime_ctx; - uint32_t low_time_offset; + uint64_t current; if (!dc || ptid == 0) { return; } - cpu = get_mem_fault_cpu_index(ptid); - if (cpu < 0) { - return; + + /* + * The caller should only inject a blocktime entry when the page is + * yet missing. + */ + assert(!ramblock_recv_bitmap_test(rb, (void *)addr)); + + current = get_current_ns(); + cpu = blocktime_get_vcpu(dc, ptid); + + if (cpu >= 0) { + /* How many faults on this vCPU in total? */ + dc->vcpu_faults_count[cpu]++; + + /* + * Account how many concurrent faults on this vCPU we trapped. See + * comments above vcpu_faults_current[] on why it can be more than one. + */ + if (dc->vcpu_faults_current[cpu]++ == 0) { + dc->smp_cpus_down++; + /* + * We use last_begin to cover (1) the 1st fault on this specific + * vCPU, but meanwhile (2) the last vCPU that got blocked. It's + * only used to calculate system-wide blocktime. + */ + dc->last_begin = current; + } + + /* Making sure it won't overflow - it really should never! */ + assert(dc->vcpu_faults_current[cpu] <= 255); + } else { + /* + * For non-vCPU thread faults, we don't care about tid or cpu index + * or time the thread is blocked (e.g., a kworker trying to help + * KVM when async_pf=on is OK to be blocked and not affect guest + * responsiveness), but we care about latency. Track it with + * cpu=-1. + * + * Note that this will NOT affect blocktime reports on vCPU being + * blocked, but only about system-wide latency reports. + */ + dc->non_vcpu_faults++; } - low_time_offset = get_low_time_offset(dc); - if (dc->vcpu_addr[cpu] == 0) { - qatomic_inc(&dc->smp_cpus_down); + blocktime_fault_inject(dc, addr, cpu, current); +} + +static void blocktime_latency_account(PostcopyBlocktimeContext *ctx, + uint64_t time_us) +{ + /* + * Convert time (in us) to bucket index it belongs. Take extra caution + * of time_us==0 even if normally rare - when happens put into bucket 0. + */ + int index = time_us ? (63 - clz64(time_us)) : 0; + + assert(index >= 0); + + /* If it's too large, put into top bucket */ + if (index >= BLOCKTIME_LATENCY_BUCKET_N) { + index = BLOCKTIME_LATENCY_BUCKET_N - 1; } - qatomic_xchg(&dc->last_begin, low_time_offset); - qatomic_xchg(&dc->page_fault_vcpu_time[cpu], low_time_offset); - qatomic_xchg(&dc->vcpu_addr[cpu], addr); + ctx->latency_buckets[index]++; +} + +typedef struct { + PostcopyBlocktimeContext *ctx; + uint64_t current; + int affected_cpus; + int affected_non_cpus; +} BlockTimeVCPUIter; + +static void blocktime_cpu_list_iter_fn(gpointer data, gpointer user_data) +{ + BlockTimeVCPUIter *iter = user_data; + PostcopyBlocktimeContext *ctx = iter->ctx; + BlocktimeVCPUEntry *entry = data; + uint64_t time_passed; + int cpu = entry->cpu; /* - * check it here, not at the beginning of the function, - * due to, check could occur early than bitmap_set in - * qemu_ufd_copy_ioctl + * Time should never go back.. so when the fault is resolved it must be + * later than when it was faulted. */ - already_received = ramblock_recv_bitmap_test(rb, (void *)addr); - if (already_received) { - qatomic_xchg(&dc->vcpu_addr[cpu], 0); - qatomic_xchg(&dc->page_fault_vcpu_time[cpu], 0); - qatomic_dec(&dc->smp_cpus_down); + assert(iter->current >= entry->fault_time); + time_passed = iter->current - entry->fault_time; + + /* Latency buckets are in microseconds */ + blocktime_latency_account(ctx, time_passed / SCALE_US); + + if (cpu >= 0) { + /* + * If we resolved all pending faults on one vCPU due to this page + * resolution, take a note. + */ + if (--ctx->vcpu_faults_current[cpu] == 0) { + ctx->vcpu_blocktime_total[cpu] += time_passed; + iter->affected_cpus += 1; + } + trace_postcopy_blocktime_end_one(cpu, ctx->vcpu_faults_current[cpu]); + } else { + iter->affected_non_cpus++; + ctx->non_vcpu_blocktime_total += time_passed; + /* + * We do not maintain how many pending non-vCPU faults because we + * do not care about blocktime, only latency. + */ + trace_postcopy_blocktime_end_one(-1, 0); } - trace_mark_postcopy_blocktime_begin(addr, dc, dc->page_fault_vcpu_time[cpu], - cpu, already_received); } /* - * This function just provide calculated blocktime per cpu and trace it. - * Total blocktime is calculated in mark_postcopy_blocktime_end. - * + * This function just provide calculated blocktime per cpu and trace it. + * Total blocktime is calculated in mark_postcopy_blocktime_end. It's + * protected by @page_request_mutex. * * Assume we have 3 CPU * @@ -899,48 +1218,45 @@ static void mark_postcopy_blocktime_end(uintptr_t addr) PostcopyBlocktimeContext *dc = mis->blocktime_ctx; MachineState *ms = MACHINE(qdev_get_machine()); unsigned int smp_cpus = ms->smp.cpus; - int i, affected_cpu = 0; - bool vcpu_total_blocktime = false; - uint32_t read_vcpu_time, low_time_offset; + BlockTimeVCPUIter iter = { + .current = get_current_ns(), + .affected_cpus = 0, + .affected_non_cpus = 0, + .ctx = dc, + }; + gpointer key = (gpointer)addr; + GHashTable *table; + GList *list; if (!dc) { return; } - low_time_offset = get_low_time_offset(dc); - /* lookup cpu, to clear it, - * that algorithm looks straightforward, but it's not - * optimal, more optimal algorithm is keeping tree or hash - * where key is address value is a list of */ - for (i = 0; i < smp_cpus; i++) { - uint32_t vcpu_blocktime = 0; - - read_vcpu_time = qatomic_fetch_add(&dc->page_fault_vcpu_time[i], 0); - if (qatomic_fetch_add(&dc->vcpu_addr[i], 0) != addr || - read_vcpu_time == 0) { - continue; - } - qatomic_xchg(&dc->vcpu_addr[i], 0); - vcpu_blocktime = low_time_offset - read_vcpu_time; - affected_cpu += 1; - /* we need to know is that mark_postcopy_end was due to - * faulted page, another possible case it's prefetched - * page and in that case we shouldn't be here */ - if (!vcpu_total_blocktime && - qatomic_fetch_add(&dc->smp_cpus_down, 0) == smp_cpus) { - vcpu_total_blocktime = true; - } - /* continue cycle, due to one page could affect several vCPUs */ - dc->vcpu_blocktime[i] += vcpu_blocktime; + table = dc->vcpu_addr_hash; + /* the address wasn't tracked at all? */ + list = g_hash_table_lookup(table, key); + if (!list) { + return; } - qatomic_sub(&dc->smp_cpus_down, affected_cpu); - if (vcpu_total_blocktime) { - dc->total_blocktime += low_time_offset - qatomic_fetch_add( - &dc->last_begin, 0); + /* + * Loop over the set of vCPUs that got blocked on this addr, do the + * blocktime accounting. After that, remove the whole list. + */ + g_list_foreach(list, blocktime_cpu_list_iter_fn, &iter); + g_hash_table_remove(table, key); + + /* + * If all vCPUs used to be down, and copying this page would free some + * vCPUs, then the system-level blocktime ends here. + */ + if (dc->smp_cpus_down == smp_cpus && iter.affected_cpus) { + dc->total_blocktime += iter.current - dc->last_begin; } - trace_mark_postcopy_blocktime_end(addr, dc, dc->total_blocktime, - affected_cpu); + dc->smp_cpus_down -= iter.affected_cpus; + + trace_postcopy_blocktime_end(addr, iter.current, iter.affected_cpus, + iter.affected_non_cpus); } static void postcopy_pause_fault_thread(MigrationIncomingState *mis) @@ -1068,17 +1384,14 @@ static void *postcopy_ram_fault_thread(void *opaque) qemu_ram_get_idstr(rb), rb_offset, msg.arg.pagefault.feat.ptid); - mark_postcopy_blocktime_begin( - (uintptr_t)(msg.arg.pagefault.address), - msg.arg.pagefault.feat.ptid, rb); - retry: /* * Send the request to the source - we want to request one * of our host page sizes (which is >= TPS) */ ret = postcopy_request_page(mis, rb, rb_offset, - msg.arg.pagefault.address); + msg.arg.pagefault.address, + msg.arg.pagefault.feat.ptid); if (ret) { /* May be network failure, try to wait for recovery */ postcopy_pause_fault_thread(mis); @@ -1221,6 +1534,11 @@ int postcopy_ram_incoming_setup(MigrationIncomingState *mis) return -1; } + if (migrate_postcopy_blocktime()) { + assert(mis->blocktime_ctx == NULL); + mis->blocktime_ctx = blocktime_context_new(); + } + /* Now an eventfd we use to tell the fault-thread to quit */ mis->userfault_event_fd = eventfd(0, EFD_CLOEXEC); if (mis->userfault_event_fd == -1) { @@ -1299,8 +1617,8 @@ static int qemu_ufd_copy_ioctl(MigrationIncomingState *mis, void *host_addr, qemu_cond_signal(&mis->page_request_cond); } } - qemu_mutex_unlock(&mis->page_request_mutex); mark_postcopy_blocktime_end((uintptr_t)host_addr); + qemu_mutex_unlock(&mis->page_request_mutex); } return ret; } @@ -1430,6 +1748,11 @@ int postcopy_wake_shared(struct PostCopyFD *pcfd, { g_assert_not_reached(); } + +void mark_postcopy_blocktime_begin(uintptr_t addr, uint32_t ptid, + RAMBlock *rb) +{ +} #endif /* ------------------------------------------------------------------------- */ diff --git a/migration/postcopy-ram.h b/migration/postcopy-ram.h index a6df1b2..3852141 100644 --- a/migration/postcopy-ram.h +++ b/migration/postcopy-ram.h @@ -196,5 +196,7 @@ void postcopy_preempt_new_channel(MigrationIncomingState *mis, QEMUFile *file); void postcopy_preempt_setup(MigrationState *s); int postcopy_preempt_establish_channel(MigrationState *s); bool postcopy_is_paused(MigrationStatus status); +void mark_postcopy_blocktime_begin(uintptr_t addr, uint32_t ptid, + RAMBlock *rb); #endif diff --git a/migration/ram.c b/migration/ram.c index 2140785..7208bc1 100644 --- a/migration/ram.c +++ b/migration/ram.c @@ -835,8 +835,10 @@ static inline bool migration_bitmap_clear_dirty(RAMState *rs, * protections isn't needed as we know there will be either (1) no * further writes if migration will complete, or (2) migration fails * at last then tracking isn't needed either. + * + * Do the same for postcopy due to the same reason. */ - if (!rs->last_stage) { + if (!rs->last_stage && !migration_in_postcopy()) { /* * Clear dirty bitmap if needed. This _must_ be called before we * send any of the page in the chunk because we need to make sure @@ -2286,16 +2288,18 @@ static int ram_find_and_save_block(RAMState *rs) if (!get_queued_page(rs, pss)) { /* priority queue empty, so just search for something dirty */ int res = find_dirty_block(rs, pss); - if (res != PAGE_DIRTY_FOUND) { - if (res == PAGE_ALL_CLEAN) { - break; - } else if (res == PAGE_TRY_AGAIN) { - continue; - } else if (res < 0) { - pages = res; - break; - } + + if (res == PAGE_ALL_CLEAN) { + break; + } else if (res == PAGE_TRY_AGAIN) { + continue; + } else if (res < 0) { + pages = res; + break; } + + /* Otherwise we must have a dirty page to move */ + assert(res == PAGE_DIRTY_FOUND); } pages = ram_save_host_page(rs, pss); if (pages) { @@ -3288,6 +3292,8 @@ static int ram_save_complete(QEMUFile *f, void *opaque) RAMState *rs = *temp; int ret = 0; + trace_ram_save_complete(rs->migration_dirty_pages, 0); + rs->last_stage = !migration_in_colo_state(); WITH_RCU_READ_LOCK_GUARD() { @@ -3351,6 +3357,9 @@ static int ram_save_complete(QEMUFile *f, void *opaque) } qemu_put_be64(f, RAM_SAVE_FLAG_EOS); + + trace_ram_save_complete(rs->migration_dirty_pages, 1); + return qemu_fflush(f); } @@ -4548,8 +4557,7 @@ void postcopy_preempt_shutdown_file(MigrationState *s) static SaveVMHandlers savevm_ram_handlers = { .save_setup = ram_save_setup, .save_live_iterate = ram_save_iterate, - .save_live_complete_postcopy = ram_save_complete, - .save_live_complete_precopy = ram_save_complete, + .save_complete = ram_save_complete, .has_postcopy = ram_has_postcopy, .state_pending_exact = ram_state_pending_exact, .state_pending_estimate = ram_state_pending_estimate, diff --git a/migration/savevm.c b/migration/savevm.c index bb04a45..fabbeb2 100644 --- a/migration/savevm.c +++ b/migration/savevm.c @@ -1484,37 +1484,54 @@ bool should_send_vmdesc(void) return !machine->suppress_vmdesc; } +static bool qemu_savevm_complete_exists(SaveStateEntry *se) +{ + return se->ops && se->ops->save_complete; +} + +/* + * Invoke the ->save_complete() if necessary. + * Returns: 0 if skip the current SE or succeeded, <0 if error happened. + */ +static int qemu_savevm_complete(SaveStateEntry *se, QEMUFile *f) +{ + int ret; + + if (se->ops->is_active) { + if (!se->ops->is_active(se->opaque)) { + return 0; + } + } + + trace_savevm_section_start(se->idstr, se->section_id); + save_section_header(f, se, QEMU_VM_SECTION_END); + ret = se->ops->save_complete(f, se->opaque); + trace_savevm_section_end(se->idstr, se->section_id, ret); + save_section_footer(f, se); + + if (ret < 0) { + qemu_file_set_error(f, ret); + } + + return ret; +} + /* - * Calls the save_live_complete_postcopy methods - * causing the last few pages to be sent immediately and doing any associated - * cleanup. + * Complete saving any postcopy-able devices. + * * Note postcopy also calls qemu_savevm_state_complete_precopy to complete * all the other devices, but that happens at the point we switch to postcopy. */ void qemu_savevm_state_complete_postcopy(QEMUFile *f) { SaveStateEntry *se; - int ret; QTAILQ_FOREACH(se, &savevm_state.handlers, entry) { - if (!se->ops || !se->ops->save_live_complete_postcopy) { + if (!qemu_savevm_complete_exists(se)) { continue; } - if (se->ops->is_active) { - if (!se->ops->is_active(se->opaque)) { - continue; - } - } - trace_savevm_section_start(se->idstr, se->section_id); - /* Section type */ - qemu_put_byte(f, QEMU_VM_SECTION_END); - qemu_put_be32(f, se->section_id); - ret = se->ops->save_live_complete_postcopy(f, se->opaque); - trace_savevm_section_end(se->idstr, se->section_id, ret); - save_section_footer(f, se); - if (ret < 0) { - qemu_file_set_error(f, ret); + if (qemu_savevm_complete(se, f) < 0) { return; } } @@ -1560,20 +1577,19 @@ int qemu_savevm_state_complete_precopy_iterable(QEMUFile *f, bool in_postcopy) { int64_t start_ts_each, end_ts_each; SaveStateEntry *se; - int ret; bool multifd_device_state = multifd_device_state_supported(); if (multifd_device_state) { QTAILQ_FOREACH(se, &savevm_state.handlers, entry) { - SaveLiveCompletePrecopyThreadHandler hdlr; + SaveCompletePrecopyThreadHandler hdlr; if (!se->ops || (in_postcopy && se->ops->has_postcopy && se->ops->has_postcopy(se->opaque)) || - !se->ops->save_live_complete_precopy_thread) { + !se->ops->save_complete_precopy_thread) { continue; } - hdlr = se->ops->save_live_complete_precopy_thread; + hdlr = se->ops->save_complete_precopy_thread; multifd_spawn_device_state_save_thread(hdlr, se->idstr, se->instance_id, se->opaque); @@ -1581,32 +1597,25 @@ int qemu_savevm_state_complete_precopy_iterable(QEMUFile *f, bool in_postcopy) } QTAILQ_FOREACH(se, &savevm_state.handlers, entry) { - if (!se->ops || - (in_postcopy && se->ops->has_postcopy && - se->ops->has_postcopy(se->opaque)) || - !se->ops->save_live_complete_precopy) { + if (!qemu_savevm_complete_exists(se)) { continue; } - if (se->ops->is_active) { - if (!se->ops->is_active(se->opaque)) { - continue; - } + if (in_postcopy && se->ops->has_postcopy && + se->ops->has_postcopy(se->opaque)) { + /* + * If postcopy will start soon, and if the SE supports + * postcopy, then we can skip the SE for the postcopy phase. + */ + continue; } start_ts_each = qemu_clock_get_us(QEMU_CLOCK_REALTIME); - trace_savevm_section_start(se->idstr, se->section_id); - - save_section_header(f, se, QEMU_VM_SECTION_END); - - ret = se->ops->save_live_complete_precopy(f, se->opaque); - trace_savevm_section_end(se->idstr, se->section_id, ret); - save_section_footer(f, se); - if (ret < 0) { - qemu_file_set_error(f, ret); + if (qemu_savevm_complete(se, f) < 0) { goto ret_fail_abort_threads; } end_ts_each = qemu_clock_get_us(QEMU_CLOCK_REALTIME); + trace_vmstate_downtime_save("iterable", se->idstr, se->instance_id, end_ts_each - start_ts_each); } diff --git a/migration/trace-events b/migration/trace-events index c506e11..706db97 100644 --- a/migration/trace-events +++ b/migration/trace-events @@ -105,6 +105,7 @@ ram_load_postcopy_loop(int channel, uint64_t addr, int flags) "chan=%d addr=0x%" ram_postcopy_send_discard_bitmap(void) "" ram_save_page(const char *rbname, uint64_t offset, void *host) "%s: offset: 0x%" PRIx64 " host: %p" ram_save_queue_pages(const char *rbname, size_t start, size_t len) "%s: start: 0x%zx len: 0x%zx" +ram_save_complete(uint64_t dirty_pages, int done) "dirty=%" PRIu64 ", done=%d" ram_dirty_bitmap_request(char *str) "%s" ram_dirty_bitmap_reload_begin(char *str) "%s" ram_dirty_bitmap_reload_complete(char *str) "%s" @@ -284,8 +285,6 @@ postcopy_nhp_range(const char *ramblock, void *host_addr, size_t offset, size_t postcopy_place_page(void *host_addr) "host=%p" postcopy_place_page_zero(void *host_addr) "host=%p" postcopy_ram_enable_notify(void) "" -mark_postcopy_blocktime_begin(uint64_t addr, void *dd, uint32_t time, int cpu, int received) "addr: 0x%" PRIx64 ", dd: %p, time: %u, cpu: %d, already_received: %d" -mark_postcopy_blocktime_end(uint64_t addr, void *dd, uint32_t time, int affected_cpu) "addr: 0x%" PRIx64 ", dd: %p, time: %u, affected_cpu: %d" postcopy_pause_fault_thread(void) "" postcopy_pause_fault_thread_continued(void) "" postcopy_pause_fast_load(void) "" @@ -309,8 +308,10 @@ postcopy_preempt_tls_handshake(void) "" postcopy_preempt_new_channel(void) "" postcopy_preempt_thread_entry(void) "" postcopy_preempt_thread_exit(void) "" - -get_mem_fault_cpu_index(int cpu, uint32_t pid) "cpu: %d, pid: %u" +postcopy_blocktime_tid_cpu_map(int cpu, uint32_t tid) "cpu: %d, tid: %u" +postcopy_blocktime_begin(uint64_t addr, uint64_t time, int cpu, bool exists) "addr: 0x%" PRIx64 ", time: %" PRIu64 ", cpu: %d, exist: %d" +postcopy_blocktime_end(uint64_t addr, uint64_t time, int affected_cpu, int affected_non_cpus) "addr: 0x%" PRIx64 ", time: %" PRIu64 ", affected_cpus: %d, affected_non_cpus: %d" +postcopy_blocktime_end_one(int cpu, uint8_t left_faults) "cpu: %d, left_faults: %" PRIu8 # exec.c migration_exec_outgoing(const char *cmd) "cmd=%s" diff --git a/qapi/migration.json b/qapi/migration.json index e8a7d3b..2d39a8f 100644 --- a/qapi/migration.json +++ b/qapi/migration.json @@ -236,6 +236,31 @@ # This is only present when the postcopy-blocktime migration # capability is enabled. (Since 3.0) # +# @postcopy-latency: average remote page fault latency (in ns). Note that +# this doesn't include all faults, but only the ones that require a +# remote page request. So it should be always bigger than the real +# average page fault latency. This is only present when the +# postcopy-blocktime migration capability is enabled. (Since 10.1) +# +# @postcopy-latency-dist: remote page fault latency distributions. Each +# element of the array is the number of faults that fall into the +# bucket period. For the N-th bucket (N>=0), the latency window is +# [2^Nus, 2^(N+1)us). For example, the 8th element stores how many +# remote faults got resolved within [256us, 512us) window. This is only +# present when the postcopy-blocktime migration capability is enabled. +# (Since 10.1) +# +# @postcopy-vcpu-latency: average remote page fault latency per vCPU (in +# ns). It has the same definition of @postcopy-latency, but instead +# this is the per-vCPU statistics. This is only present when the +# postcopy-blocktime migration capability is enabled. (Since 10.1) +# +# @postcopy-non-vcpu-latency: average remote page fault latency for all +# faults happend in non-vCPU threads (in ns). It has the same +# definition of @postcopy-latency but this only provides statistics to +# non-vCPU faults. This is only present when the postcopy-blocktime +# migration capability is enabled. (Since 10.1) +# # @socket-address: Only used for tcp, to know what the real port is # (Since 4.0) # @@ -260,6 +285,11 @@ # average memory load of the virtual CPU indirectly. Note that # zero means guest doesn't dirty memory. (Since 8.1) # +# Features: +# +# @unstable: Members @postcopy-latency, @postcopy-vcpu-latency, +# @postcopy-latency-dist, @postcopy-non-vcpu-latency are experimental. +# # Since: 0.14 ## { 'struct': 'MigrationInfo', @@ -275,6 +305,14 @@ '*blocked-reasons': ['str'], '*postcopy-blocktime': 'uint32', '*postcopy-vcpu-blocktime': ['uint32'], + '*postcopy-latency': { + 'type': 'uint64', 'features': [ 'unstable' ] }, + '*postcopy-latency-dist': { + 'type': ['uint64'], 'features': [ 'unstable' ] }, + '*postcopy-vcpu-latency': { + 'type': ['uint64'], 'features': [ 'unstable' ] }, + '*postcopy-non-vcpu-latency': { + 'type': 'uint64', 'features': [ 'unstable' ] }, '*socket-address': ['SocketAddress'], '*dirty-limit-throttle-time-per-round': 'uint64', '*dirty-limit-ring-full-time': 'uint64'} } diff --git a/tests/qtest/migration/migration-qmp.c b/tests/qtest/migration/migration-qmp.c index fb59741..66dd369 100644 --- a/tests/qtest/migration/migration-qmp.c +++ b/tests/qtest/migration/migration-qmp.c @@ -358,6 +358,11 @@ void read_blocktime(QTestState *who) rsp_return = migrate_query_not_failed(who); g_assert(qdict_haskey(rsp_return, "postcopy-blocktime")); + g_assert(qdict_haskey(rsp_return, "postcopy-vcpu-blocktime")); + g_assert(qdict_haskey(rsp_return, "postcopy-latency")); + g_assert(qdict_haskey(rsp_return, "postcopy-latency-dist")); + g_assert(qdict_haskey(rsp_return, "postcopy-vcpu-latency")); + g_assert(qdict_haskey(rsp_return, "postcopy-non-vcpu-latency")); qobject_unref(rsp_return); } |