Age | Commit message (Collapse) | Author | Files | Lines |
|
Define the cpr_is_incoming helper, to be used in several cpr fix patches.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Message-ID: <1741380954-341079-2-git-send-email-steven.sistare@oracle.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
|
|
Address an error in RDMA-based migration by ensuring RDMA is prioritized
when saving pages in `ram_save_target_page()`.
Previously, the RDMA protocol's page-saving step was placed after other
protocols due to a refactoring in commit bc38dc2f5f3. This led to migration
failures characterized by unknown control messages and state loading errors
destination:
(qemu) qemu-system-x86_64: Unknown control message QEMU FILE
qemu-system-x86_64: error while loading state section id 1(ram)
qemu-system-x86_64: load of migration failed: Operation not permitted
source:
(qemu) qemu-system-x86_64: RDMA is in an error state waiting migration to abort!
qemu-system-x86_64: failed to save SaveStateEntry with id(name): 1(ram): -1
qemu-system-x86_64: rdma migration: recv polling control error!
qemu-system-x86_64: warning: Early error. Sending error.
qemu-system-x86_64: warning: rdma migration: send polling control error
RDMA migration implemented its own protocol/method to send pages to
destination side, hand over to RDMA first to prevent pages being saved by
other protocol.
Fixes: bc38dc2f5f3 ("migration: refactor ram_save_target_page functions")
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
Message-ID: <20250305062825.772629-2-lizhijian@fujitsu.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
|
|
Unlike cpr-reboot mode, cpr-transfer mode cannot save volatile ram blocks
in the migration stream file and recreate them later, because the physical
memory for the blocks is pinned and registered for vfio. Add a blocker
for volatile ram blocks.
Also add a blocker for RAM_GUEST_MEMFD. Preserving guest_memfd may be
sufficient for CPR, but it has not been tested yet.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Message-ID: <1740667681-257312-1-git-send-email-steven.sistare@oracle.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
|
|
On the incoming migration side, QEMU uses a coroutine to load all the VM
states. Inside, it may reference MigrationState on global states like
migration capabilities, parameters, error state, shared mutexes and more.
However there's nothing yet to make sure MigrationState won't get
destroyed (e.g. after migration_shutdown()). Meanwhile there's also no API
available to remove the incoming coroutine in migration_shutdown(),
avoiding it to access the freed elements.
There's a bug report showing this can happen and crash dest QEMU when
migration is cancelled on source.
When it happens, the dest main thread is trying to cleanup everything:
#0 qemu_aio_coroutine_enter
#1 aio_dispatch_handler
#2 aio_poll
#3 monitor_cleanup
#4 qemu_cleanup
#5 qemu_default_main
Then it found the migration incoming coroutine, schedule it (even after
migration_shutdown()), causing crash:
#0 __pthread_kill_implementation
#1 __pthread_kill_internal
#2 __GI_raise
#3 __GI_abort
#4 __assert_fail_base
#5 __assert_fail
#6 qemu_mutex_lock_impl
#7 qemu_lockable_mutex_lock
#8 qemu_lockable_lock
#9 qemu_lockable_auto_lock
#10 migrate_set_error
#11 process_incoming_migration_co
#12 coroutine_trampoline
To fix it, take a refcount after an incoming setup is properly done when
qmp_migrate_incoming() succeeded the 1st time. As it's during a QMP
handler which needs BQL, it means the main loop is still alive (without
going into cleanups, which also needs BQL).
Releasing the refcount now only until the incoming migration coroutine
finished or failed. Hence the refcount is valid for both (1) setup phase
of incoming ports, mostly IO watches (e.g. qio_channel_add_watch_full()),
and (2) the incoming coroutine itself (process_incoming_migration_co()).
Note that we can't unref in migration_incoming_state_destroy(), because
both qmp_xen_load_devices_state() and load_snapshot() will use it without
an incoming migration. Those hold BQL so they're not prone to this issue.
PS: I suspect nobody uses Xen's command at all, as it didn't register yank,
hence AFAIU the command should crash on master when trying to unregister
yank in migration_incoming_state_destroy().. but that's another story.
Also note that in some incoming failure cases we may not always unref the
MigrationState refcount, which is a trade-off to keep things simple. We
could make it accurate, but it can be an overkill. Some examples:
- Unlike most of the rest protocols, socket_start_incoming_migration()
may create net listener after incoming port setup sucessfully.
It means we can't unref in migration_channel_process_incoming() as a
generic path because socket protocol might keep using MigrationState.
- For either socket or file, multiple IO watches might be created, it
means logically each IO watch needs to take one refcount for
MigrationState so as to be 100% accurate on ownership of refcount taken.
In general, we at least need per-protocol handling to make it accurate,
which can be an overkill if we know incoming failed after all. Add a short
comment to explain that when taking the refcount in qmp_migrate_incoming().
Bugzilla: https://issues.redhat.com/browse/RHEL-69775
Tested-by: Yan Fu <yafu@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Message-ID: <20250220132459.512610-1-peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
|
|
Automatic memory management helps avoid memory safety issues.
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Link: https://lore.kernel.org/qemu-devel/2fd01d773a783d572dcf538a064a98cc09e75c12.1741124640.git.maciej.szmigiero@oracle.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
|
|
This SaveVMHandler helps device provide its own asynchronous transmission
of the remaining data at the end of a precopy phase via multifd channels,
in parallel with the transfer done by save_live_complete_precopy handlers.
These threads are launched only when multifd device state transfer is
supported.
Management of these threads in done in the multifd migration code,
wrapping them in the generic thread pool.
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Link: https://lore.kernel.org/qemu-devel/eac74a4ca7edd8968bbf72aa07b9041c76364a16.1741124640.git.maciej.szmigiero@oracle.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
|
|
Since device state transfer via multifd channels requires multifd
channels with packets and is currently not compatible with multifd
compression add an appropriate query function so device can learn
whether it can actually make use of it.
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Link: https://lore.kernel.org/qemu-devel/1ff0d98b85f470e5a33687406e877583b8fab74e.1741124640.git.maciej.szmigiero@oracle.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
|
|
The newly introduced device state buffer can be used for either storing
VFIO's read() raw data, but already also possible to store generic device
states. After noticing that device states may not easily provide a max
buffer size (also the fact that RAM MultiFDPages_t after all also want to
have flexibility on managing offset[] array), it may not be a good idea to
stick with union on MultiFDSendData.. as it won't play well with such
flexibility.
Switch MultiFDSendData to a struct.
It won't consume a lot more space in reality, after all the real buffers
were already dynamically allocated, so it's so far only about the two
structs (pages, device_state) that will be duplicated, but they're small.
With this, we can remove the pretty hard to understand alloc size logic.
Because now we can allocate offset[] together with the SendData, and
properly free it when the SendData is freed.
[MSS: Make sure to clear possible device state payload before freeing
MultiFDSendData, remove placeholders for other patches not included]
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Acked-by: Fabiano Rosas <farosas@suse.de>
Link: https://lore.kernel.org/qemu-devel/7b02baba8e6ddb23ef7c349d312b9b631db09d7e.1741124640.git.maciej.szmigiero@oracle.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
|
|
A new function multifd_queue_device_state() is provided for device to queue
its state for transmission via a multifd channel.
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Link: https://lore.kernel.org/qemu-devel/ebd55768d3e5fecb5eb3f197bad9c0c07e5bc084.1741124640.git.maciej.szmigiero@oracle.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
|
|
This way if there are fields there that needs explicit disposal (like, for
example, some attached buffers) they will be handled appropriately.
Add a related assert to multifd_set_payload_type() in order to make sure
that this function is only used to fill a previously empty MultiFDSendData
with some payload, not the other way around.
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Link: https://lore.kernel.org/qemu-devel/6755205f2b95abbed251f87061feee1c0e410836.1741124640.git.maciej.szmigiero@oracle.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
|
|
multifd_send() function is currently not thread safe, make it thread safe
by holding a lock during its execution.
This way it will be possible to safely call it concurrently from multiple
threads.
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Link: https://lore.kernel.org/qemu-devel/dd0f3bcc02ca96a7d523ca58ea69e495a33b453b.1741124640.git.maciej.szmigiero@oracle.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
|
|
Add a basic support for receiving device state via multifd channels -
channels that are shared with RAM transfers.
Depending whether MULTIFD_FLAG_DEVICE_STATE flag is present or not in the
packet header either device state (MultiFDPacketDeviceState_t) or RAM
data (existing MultiFDPacket_t) is read.
The received device state data is provided to
qemu_loadvm_load_state_buffer() function for processing in the
device's load_state_buffer handler.
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Link: https://lore.kernel.org/qemu-devel/9b86f806c134e7815ecce0eee84f0e0e34aa0146.1741124640.git.maciej.szmigiero@oracle.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
|
|
Read packet header first so in the future we will be able to
differentiate between a RAM multifd packet and a device state multifd
packet.
Since these two are of different size we can't read the packet body until
we know which packet type it is.
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Link: https://lore.kernel.org/qemu-devel/832ad055fe447561ac1ad565d61658660cb3f63f.1741124640.git.maciej.szmigiero@oracle.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
|
|
Some drivers might want to make use of auxiliary helper threads during VM
state loading, for example to make sure that their blocking (sync) I/O
operations don't block the rest of the migration process.
Add a migration core managed thread pool to facilitate this use case.
The migration core will wait for these threads to finish before
(re)starting the VM at destination.
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Link: https://lore.kernel.org/qemu-devel/b09fd70369b6159c75847e69f235cb908b02570c.1741124640.git.maciej.szmigiero@oracle.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
|
|
All callers to migration_incoming_state_destroy() other than
postcopy_ram_listen_thread() do this call with BQL held.
Since migration_incoming_state_destroy() ultimately calls "load_cleanup"
SaveVMHandlers and it will soon call BQL-sensitive code it makes sense
to always call that function under BQL rather than to have it deal with
both cases (with BQL and without BQL).
Add the necessary bql_lock() and bql_unlock() to
postcopy_ram_listen_thread().
qemu_loadvm_state_main() in postcopy_ram_listen_thread() could call
"load_state" SaveVMHandlers that are expecting BQL to be held.
In principle, the only devices that should be arriving on migration
channel serviced by postcopy_ram_listen_thread() are those that are
postcopiable and whose load handlers are safe to be called without BQL
being held.
But nothing currently prevents the source from sending data for "unsafe"
devices which would cause trouble there.
Add a TODO comment there so it's clear that it would be good to improve
handling of such (erroneous) case in the future.
Acked-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Link: https://lore.kernel.org/qemu-devel/21bb5ca337b1d5a802e697f553f37faf296b5ff4.1741193259.git.maciej.szmigiero@oracle.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
|
|
qemu_loadvm_load_state_buffer() and its load_state_buffer
SaveVMHandler allow providing device state buffer to explicitly
specified device via its idstr and instance id.
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Link: https://lore.kernel.org/qemu-devel/71ca753286b87831ced4afd422e2e2bed071af25.1741124640.git.maciej.szmigiero@oracle.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
|
|
This QEMU_VM_COMMAND sub-command and its switchover_start SaveVMHandler is
used to mark the switchover point in main migration stream.
It can be used to inform the destination that all pre-switchover main
migration stream data has been sent/received so it can start to process
post-switchover data that it might have received via other migration
channels like the multifd ones.
Add also the relevant MigrationState bit stream compatibility property and
its hw_compat entry.
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Zhang Chen <zhangckid@gmail.com> # for the COLO part
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Link: https://lore.kernel.org/qemu-devel/311be6da85fc7e49a7598684d80aa631778dcbce.1741124640.git.maciej.szmigiero@oracle.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
|
|
Memory pull request for 10.0
v2 changelog:
- Fix Mac (and possibly some other) build issues for two patches
- os: add an ability to lock memory on_fault
- memory: pass MemTxAttrs to memory_access_is_direct()
List of features:
- William's fix on ram hole punching when with file offset
- Daniil's patchset to introduce mem-lock=on-fault
- William's hugetlb hwpoison fix for size report & remap
- David's series to allow qemu debug writes to MMIOs
# -----BEGIN PGP SIGNATURE-----
#
# iIgEABYKADAWIQS5GE3CDMRX2s990ak7X8zN86vXBgUCZ6zcQBIccGV0ZXJ4QHJl
# ZGhhdC5jb20ACgkQO1/MzfOr1wbL3wEAqx94NpB/tEEBj6WXE3uV9LqQ0GCTYmV+
# MbM51Vep8ksA/35yFn3ltM2yoSnUf9WJW6LXEEKhQlwswI0vChQERgkE
# =++O1
# -----END PGP SIGNATURE-----
# gpg: Signature made Thu 13 Feb 2025 01:37:04 HKT
# gpg: using EDDSA key B9184DC20CC457DACF7DD1A93B5FCCCDF3ABD706
# gpg: issuer "peterx@redhat.com"
# gpg: Good signature from "Peter Xu <xzpeter@gmail.com>" [full]
# gpg: aka "Peter Xu <peterx@redhat.com>" [full]
# Primary key fingerprint: B918 4DC2 0CC4 57DA CF7D D1A9 3B5F CCCD F3AB D706
* tag 'mem-next-pull-request' of https://gitlab.com/peterx/qemu:
overcommit: introduce mem-lock=on-fault
system: introduce a new MlockState enum
system/vl: extract overcommit option parsing into a helper
os: add an ability to lock memory on_fault
system/physmem: poisoned memory discard on reboot
system/physmem: handle hugetlb correctly in qemu_ram_remap()
physmem: teach cpu_memory_rw_debug() to write to more memory regions
hmp: use cpu_get_phys_page_debug() in hmp_gva2gpa()
memory: pass MemTxAttrs to memory_access_is_direct()
physmem: disallow direct access to RAM DEVICE in address_space_write_rom()
physmem: factor out direct access check into memory_region_supports_direct_access()
physmem: factor out RAM/ROMD check in memory_access_is_direct()
physmem: factor out memory_region_is_ram_device() check in memory_access_is_direct()
system/physmem: take into account fd_offset for file fallocate
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
|
|
qmp_migrate guarantees that cpr_channel is not null for
MIG_MODE_CPR_TRANSFER when cpr_state_save is called:
qmp_migrate()
if (s->parameters.mode == MIG_MODE_CPR_TRANSFER && !cpr_channel) {
return;
}
cpr_state_save(cpr_channel)
but cpr_state_save checks for mode differently before using channel,
and Coverity cannot infer that they are equivalent in outgoing QEMU,
and warns that channel may be NULL:
cpr_state_save(channel)
MigMode mode = migrate_mode();
if (mode == MIG_MODE_CPR_TRANSFER) {
f = cpr_transfer_output(channel, errp);
To make Coverity happy, assert that channel != NULL in cpr_state_save.
Resolves: Coverity CID 1590980
Reported-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Message-ID: <1738788841-211843-1-git-send-email-steven.sistare@oracle.com>
[assert instead of using parameters.mode in cpr_state_save]
Signed-off-by: Fabiano Rosas <farosas@suse.de>
|
|
The expected outcome from qmp_migrate_cancel() is that the source
migration goes to the terminal state
MIGRATION_STATUS_CANCELLED. Anything different from this is a bug when
cancelling.
Make sure there is never a state transition from an unspecified state
into FAILED. Code that sets FAILED, should always either make sure
that the old state is not CANCELLING or specify the old state.
Note that the destination is allowed to go into FAILED, so there's no
issue there.
(I don't think this is relevant as a backport because cancelling does
work, it just doesn't show the right state at the end)
Fixes: 3dde8fdbad ("migration: Merge precopy/postcopy on switchover start")
Fixes: d0edb8a173 ("migration: Create the postcopy preempt channel asynchronously")
Fixes: 8518278a6a ("migration: implementation of background snapshot thread")
Fixes: bf78a046b9 ("migration: refactor migrate_fd_connect failures")
Reviewed-by: Peter Xu <peterx@redhat.com>
Message-ID: <20250213175927.19642-7-farosas@suse.de>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
|
|
After postcopy has started, it's not possible to recover the source
machine in case a migration error occurs because the destination has
already been changing the state of the machine. For that same reason,
it doesn't make sense to try to cancel the migration after postcopy
has started. Reject the cancel command during postcopy.
Reviewed-by: Peter Xu <peterx@redhat.com>
Message-ID: <20250213175927.19642-6-farosas@suse.de>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
|
|
If the destination side fails at migration_ioc_process_incoming()
before starting the coroutine, it will report the error but QEMU will
not exit.
Set the migration state to FAILED and exit the process if
exit-on-error allows.
Resolves: https://gitlab.com/qemu-project/qemu/-/issues/2633
Reported-by: Daniel P. Berrangé <berrange@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Message-ID: <20250213175927.19642-5-farosas@suse.de>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
|
|
Remove all instances of _fd_ from the migration generic code. These
functions have grown over time and the _fd_ part is now just
confusing.
migration_fd_error() -> migration_error() makes it a little
vague. Since it's only used for migration_connect() failures, change
it to migration_connect_set_error().
Reviewed-by: Peter Xu <peterx@redhat.com>
Message-ID: <20250213175927.19642-4-farosas@suse.de>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
|
|
There's no need for two separate functions and this _fd_ is a historic
artifact that makes little sense nowadays.
Reviewed-by: Peter Xu <peterx@redhat.com>
Message-ID: <20250213175927.19642-3-farosas@suse.de>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
|
|
There's no point passing the error into migration cancel only for it
to call migrate_set_error().
Reviewed-by: Peter Xu <peterx@redhat.com>
Message-ID: <20250213175927.19642-2-farosas@suse.de>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
|
|
We're currently only checking the QEMUFile error after
qemu_loadvm_state(). This was causing a TLS termination error from
multifd recv threads to be ignored.
Start checking the migration error as well to avoid missing further
errors.
Regarding compatibility concerning the TLS termination error that was
being ignored, for QEMUs <= 9.2 - if the old QEMU is being used as
migration source - the recently added migration property
multifd-tls-clean-termination needs to be set to OFF in the
*destination* machine.
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
|
|
We're currently changing the way the source multifd migration handles
the shutdown of the multifd channels when TLS is in use to perform a
clean termination by calling gnutls_bye().
Older src QEMUs will always close the channel without terminating the
TLS session. New dst QEMUs treat an unclean termination as an error.
Add multifd_clean_tls_termination (default true) that can be switched
on the destination whenever a src QEMU <= 9.2 is in use.
(Note that the compat property is only strictly necessary for src
QEMUs older than 9.1. Due to synchronization coincidences, src QEMUs
9.1 and 9.2 can put the destination in a condition where it doesn't
see the unclean termination. Still, make the property more inclusive
to facilitate potential backports.)
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
|
|
The multifd recv side has been getting a TLS error of
GNUTLS_E_PREMATURE_TERMINATION at the end of migration when the send
side closes the sockets without ending the TLS session. This has been
masked by the code not checking the migration error after loadvm.
Start ending the TLS session at multifd_send_shutdown() so the recv
side always sees a clean termination (EOF) and we can start to
differentiate that from an actual premature termination that might
possibly happen in the middle of the migration.
There's nothing to be done if a previous migration error has already
broken the connection, so add a comment explaining it and ignore any
errors coming from gnutls_bye().
This doesn't break compat with older recv-side QEMUs because EOF has
always caused the recv thread to exit cleanly.
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
|
|
Locking the memory without MCL_ONFAULT instantly prefaults any mmaped
anonymous memory with a write-fault, which introduces a lot of extra
overhead in terms of memory usage when all you want to do is to prevent
kcompactd from migrating and compacting QEMU pages. Add an option to
only lock pages lazily as they're faulted by the process by using
MCL_ONFAULT if asked.
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Daniil Tatianin <d-tatianin@yandex-team.ru>
Link: https://lore.kernel.org/r/20250212143920.1269754-5-d-tatianin@yandex-team.ru
Signed-off-by: Peter Xu <peterx@redhat.com>
|
|
Replace the boolean value enable_mlock with an enum and add a helper to
decide whether we should be calling os_mlock.
This is a stepping stone towards introducing a new mlock mode, which
will be the third possible state of this enum.
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Daniil Tatianin <d-tatianin@yandex-team.ru>
Link: https://lore.kernel.org/r/20250212143920.1269754-4-d-tatianin@yandex-team.ru
Signed-off-by: Peter Xu <peterx@redhat.com>
|
|
This will be used in the following commits to make it possible to only
lock memory on fault instead of right away.
Signed-off-by: Daniil Tatianin <d-tatianin@yandex-team.ru>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Link: https://lore.kernel.org/r/20250212143920.1269754-2-d-tatianin@yandex-team.ru
[peterx: fail os_mlock(on_fault=1) when not supported]
[peterx: use G_GNUC_UNUSED instead of "(void)on_fault", per Dan]
Signed-off-by: Peter Xu <peterx@redhat.com>
|
|
Block layer patches
- Managing inactive nodes (enables QSD migration with shared storage)
- Fix swapped values for BLOCK_IO_ERROR 'device' and 'qom-path'
- vpc: Read images exported from Azure correctly
- scripts/qemu-gdb: Support coroutine dumps in coredumps
- Minor cleanups
# -----BEGIN PGP SIGNATURE-----
#
# iQJFBAABCAAvFiEE3D3rFZqa+V09dFb+fwmycsiPL9YFAmek34IRHGt3b2xmQHJl
# ZGhhdC5jb20ACgkQfwmycsiPL9bDpxAAnTvwmdazAXG0g9GzqvrEB/+6rStjAsqE
# 9MTWV4WxyN41d0RXxN8CYKb8CXSiTRyw6r3CSGNYEI2eShe9e934PriSkZm41HyX
# n9Yh5YxqGZqitzvPtx62Ii/1KG+PcjQbfHuK1p4+rlKa0yQ2eGlio1JIIrZrCkBZ
# ikZcQUrhIyD0XV8hTQ2+Ysa+ZN6itjnlTQIG3gS3m8f8WR7kyUXD8YFMQFJFyjVx
# NrAIpLnc/ln9+5PZR9tje8U7XEn2KCgI5pgGaQnrd0h0G1H4ig8ogzYYnKTLhjU/
# AmQpS8np8Tyg6S1UZTiekEq0VuAhThEQc5b3sGbmHWH/R2ABMStyf18oCBAkPzZ7
# s6h+3XzTKKY2Q5Q3ZG/ANkUJjTNBhdj1fcaARvbSWsqsuk5CWX/I3jzvgihFtCSs
# eGu+b/bLeW6P7hu4qPHBcgLHuB1Fc7Rd2t4BoIGM1wcO2CeC9DzUKOiIMZOEJIh0
# GGqCkEWDHgckDTakD4/vSqm0UDKt6FSlQC9ga/ILBY3IB5HpHoArY58selymy28i
# X7MgAvbjdsmNuUuXDZZOiObcFt3j8jlmwPJpPyzXPQIiPX1RXeBPRhVAEeZCKn6Z
# tfHr72SJdMeVOGXVTvOrJ2iW+4g03rPdmkDFCUhpOwo62RODq7ahvCIXsNf3nEFR
# rSB3T1M/8EM=
# =iQLP
# -----END PGP SIGNATURE-----
# gpg: Signature made Thu 06 Feb 2025 11:12:50 EST
# gpg: using RSA key DC3DEB159A9AF95D3D7456FE7F09B272C88F2FD6
# gpg: issuer "kwolf@redhat.com"
# gpg: Good signature from "Kevin Wolf <kwolf@redhat.com>" [full]
# Primary key fingerprint: DC3D EB15 9A9A F95D 3D74 56FE 7F09 B272 C88F 2FD6
* tag 'for-upstream' of https://repo.or.cz/qemu/kevin: (25 commits)
block: remove unused BLOCK_OP_TYPE_DATAPLANE
iotests: Add (NBD-based) tests for inactive nodes
iotests: Add qsd-migrate case
iotests: Add filter_qtest()
nbd/server: Support inactive nodes
block/export: Add option to allow export of inactive nodes
block: Drain nodes before inactivating them
block/export: Don't ignore image activation error in blk_exp_add()
block: Support inactive nodes in blk_insert_bs()
block: Add blockdev-set-active QMP command
block: Add option to create inactive nodes
block: Fix crash on block_resize on inactive node
block: Don't attach inactive child to active node
migration/block-active: Remove global active flag
block: Inactivate external snapshot overlays when necessary
block: Allow inactivating already inactive nodes
block: Add 'active' field to BlockDeviceInfo
block-backend: Fix argument order when calling 'qapi_event_send_block_io_error()'
scripts/qemu-gdb: Support coroutine dumps in coredumps
scripts/qemu-gdb: Simplify fs_base fetching for coroutines
...
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
|
|
The general expectation is that header files should follow the same
file/path naming scheme as the corresponding source file. There are
various historical exceptions to this practice in QEMU, with one of
the most notable being the include/qapi/qmp/ directory. Most of the
headers there correspond to source files in qobject/.
This patch corrects most of that inconsistency by creating
include/qobject/ and moving the headers for qobject/ there.
This also fixes MAINTAINERS for include/qapi/qmp/dispatch.h:
scripts/get_maintainer.pl now reports "QAPI" instead of "No
maintainers found".
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
Reviewed-by: Zhao Liu <zhao1.liu@intel.com>
Acked-by: Halil Pasic <pasic@linux.ibm.com> #s390x
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Message-ID: <20241118151235.2665921-2-armbru@redhat.com>
[Rebased]
|
|
Block devices have an individual active state, a single global flag
can't cover this correctly. This becomes more important as we allow
users to manually manage which nodes are active or inactive.
Now that it's allowed to call bdrv_inactivate_all() even when some
nodes are already inactive, we can remove the flag and just
unconditionally call bdrv_inactivate_all() and, more importantly,
bdrv_activate_all() before we make use of the nodes.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Acked-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-ID: <20250204211407.381505-5-kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
|
|
Refactor ram_save_target_page legacy and multifd
functions into one. Other than simplifying it,
it frees 'migration_ops' object from usage, so it
is expunged.
Signed-off-by: Prasad Pandit <pjp@fedoraproject.org>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Message-ID: <20250127120823.144949-3-ppandit@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
|
|
Two small cleanups in the same section of vmstate_save():
- Check vmdesc before the "mixed null/non-null data in array" logic, to
be crystal clear that it's only about the JSON writer, not the vmstate on
its own in the migration stream.
- Since we have is_null variable now, use that to replace a check.
Signed-off-by: Peter Xu <peterx@redhat.com>
Tested-by: Jiri Denemark <jdenemar@redhat.com>
Reviewed-by: Juraj Marcin <jmarcin@redhat.com>
Link: https://lore.kernel.org/r/20250114230746.3268797-17-peterx@redhat.com
Signed-off-by: Fabiano Rosas <farosas@suse.de>
|
|
Now after all the cleanups, finally we can merge the switchover startup
phase into one single function for precopy/postcopy.
Signed-off-by: Peter Xu <peterx@redhat.com>
Tested-by: Jiri Denemark <jdenemar@redhat.com>
Reviewed-by: Juraj Marcin <jmarcin@redhat.com>
Link: https://lore.kernel.org/r/20250114230746.3268797-16-peterx@redhat.com
Signed-off-by: Fabiano Rosas <farosas@suse.de>
|
|
DEVICE state was introduced back in 2017:
https://lore.kernel.org/qemu-devel/20171020090556.18631-1-dgilbert@redhat.com/
Quote from Dave's cover letter, when the pre-switchover phase was enabled,
the state transition looks like this:
The precopy flow is:
active->pre-switchover->device->completed
The postcopy flow is:
active->pre-switchover->postcopy-active->completed
To supplement above, when the cap is not enabled:
The precopy flow is:
active->completed
The postcopy flow is:
active->postcopy-active->completed
It works for us, though we have some code just to special case these state
transitions, so the DEVICE state currently is special only to precopy, and
only conditionally.
I had a quick discussion with Libvirt developers, it turns out that this
may not be necessary. IOW, it seems okay we can have DEVICE state to be
generic, so that we don't have over-complicated state machines. It not
only helps align all the migration state machine, help cleanup the code
path especially on pre-switchover handling (see the patch itself), another
side benefit is we can unconditionally have a specific state to mark the
switchover phase, which might be helpful for debugging too.
This patch makes the DEVICE state to be present always, marking that source
QEMU is switching over. Then the state machine will be always as simple
as:
active-> [pre-switchover->] -> device -> [postcopy-active->] -> complete
After the change, no matter whether pre-switchover or postcopy is enabled
or not, we always have DEVICE state showing the switchover phase. When
pre-switchover enabled, we'll have an extra stage before that. When
postcopy is enabled, we'll have an extra stage after that.
A few qtests need touch up in QEMU tree for this change:
- A few iotest outputs (194, 203, 234, 262, 280)
- Teach libqos's migrate() on "device" state
Cc: Jiri Denemark <jdenemar@redhat.com>
Cc: Daniel P. Berrangé <berrange@redhat.com>
Cc: Dr. David Alan Gilbert <dave@treblig.org>
Signed-off-by: Peter Xu <peterx@redhat.com>
Tested-by: Jiri Denemark <jdenemar@redhat.com>
Reviewed-by: Juraj Marcin <jmarcin@redhat.com>
Link: https://lore.kernel.org/r/20250114230746.3268797-15-peterx@redhat.com
Signed-off-by: Fabiano Rosas <farosas@suse.de>
|
|
Now qemu_savevm_state_complete_precopy() is never used in postcopy, clean
it up as in_postcopy==false now unconditionally.
Signed-off-by: Peter Xu <peterx@redhat.com>
Tested-by: Jiri Denemark <jdenemar@redhat.com>
Reviewed-by: Juraj Marcin <jmarcin@redhat.com>
Link: https://lore.kernel.org/r/20250114230746.3268797-14-peterx@redhat.com
Signed-off-by: Fabiano Rosas <farosas@suse.de>
|
|
Postcopy invokes qemu_savevm_state_complete_precopy() twice for a long
time, and that caused way too much confusions. Let's clean this up and
make postcopy easier to read.
It's actually fairly straightforward: postcopy starts with saving
non-postcopiable iterables, then later it saves again with non-iterable
only. Move these two calls out makes everything much easier to follow.
Otherwise it's very unclear what qemu_savevm_state_complete_precopy() did
in either of the calls.
No functional change intended.
Signed-off-by: Peter Xu <peterx@redhat.com>
Tested-by: Jiri Denemark <jdenemar@redhat.com>
Reviewed-by: Juraj Marcin <jmarcin@redhat.com>
Link: https://lore.kernel.org/r/20250114230746.3268797-13-peterx@redhat.com
Signed-off-by: Fabiano Rosas <farosas@suse.de>
|
|
Postcopy invokes qemu_savevm_state_complete_precopy() twice, that means
it'll invoke COMPLETE notify twice.. also twice the tracepoints that
marking precopy complete.
Move that notification (along with the tracepoint) out to the caller, so
that postcopy will only notify once right at the start of switchover phase
from precopy. When at it, rename it to suite the file now it locates.
For precopy, there should have no functional change except the tracepoint
has a name change.
For the other two users of qemu_savevm_state_complete_precopy(), namely:
qemu_savevm_state() and qemu_savevm_live_state(): the notifier shouldn't
matter because they're not precopy at all. Now in these two contexts (aka,
"savevm", and "colo") sometimes the precopy notifiers will still be
invoked, but that's outside the scope of this patch.
Signed-off-by: Peter Xu <peterx@redhat.com>
Tested-by: Jiri Denemark <jdenemar@redhat.com>
Reviewed-by: Juraj Marcin <jmarcin@redhat.com>
Link: https://lore.kernel.org/r/20250114230746.3268797-12-peterx@redhat.com
Signed-off-by: Fabiano Rosas <farosas@suse.de>
|
|
This paves way for some follow up patch to modify migration states at the
end of postcopy_start(), which should better be with the BQL so that
there's no way of concurrent cancellation.
So we'll do something slightly more with BQL but they're really trivial,
hopefully nothing will really chance with this.
A side benefit is we can drop another explicit lock() in failure path.
Signed-off-by: Peter Xu <peterx@redhat.com>
Tested-by: Jiri Denemark <jdenemar@redhat.com>
Reviewed-by: Juraj Marcin <jmarcin@redhat.com>
Link: https://lore.kernel.org/r/20250114230746.3268797-11-peterx@redhat.com
Signed-off-by: Fabiano Rosas <farosas@suse.de>
|
|
I can't see why we must cache the state now after we avoided possible
CANCEL race: that's the only thing I can think of that can modify the
migration state concurrently with the migration thread itself. Make all
the state updates to happen always, then we don't need to cache the state
anymore.
Signed-off-by: Peter Xu <peterx@redhat.com>
Tested-by: Jiri Denemark <jdenemar@redhat.com>
Reviewed-by: Juraj Marcin <jmarcin@redhat.com>
Link: https://lore.kernel.org/r/20250114230746.3268797-10-peterx@redhat.com
Signed-off-by: Fabiano Rosas <farosas@suse.de>
|
|
In migration_maybe_pause() QEMU may yield BQL before waiting for a
semaphore. However it yields the BQL too early, which logically gives it
chance for the main thread to quickly take the BQL and modify the state to
CANCELLING.
To avoid such race condition from happening at all, always update the
migration states within the BQL. It'll make sure no concurrent
cancellation can ever happen.
With that, IIUC there's chance we can remove the extra parameter in
migration_maybe_pause() to update active state, but that'll be done
separately later.
Signed-off-by: Peter Xu <peterx@redhat.com>
Tested-by: Jiri Denemark <jdenemar@redhat.com>
Reviewed-by: Juraj Marcin <jmarcin@redhat.com>
Link: https://lore.kernel.org/r/20250114230746.3268797-9-peterx@redhat.com
Signed-off-by: Fabiano Rosas <farosas@suse.de>
|
|
Precopy uses unlimited bandwidth always during switchover, it makes sense
because this is so critical and no one would like to throttle bandwidth
during the VM blackout.
OTOH, postcopy surprisingly didn't do that. There's one line that in the
middle of the postcopy switchover it tries to switch to postcopy's
specified max-postcopy-bandwidth, but even so it's somewhere in the middle
which is strange.
This patch brings the two modes to always use unlimited bandwidth for
switchover, meanwhile only apply the postcopy max bandwidth after the
switchover is completed.
Signed-off-by: Peter Xu <peterx@redhat.com>
Tested-by: Jiri Denemark <jdenemar@redhat.com>
Reviewed-by: Juraj Marcin <jmarcin@redhat.com>
Link: https://lore.kernel.org/r/20250114230746.3268797-8-peterx@redhat.com
Signed-off-by: Fabiano Rosas <farosas@suse.de>
|
|
Do one shot cpu sync at qemu_savevm_state_complete_precopy_non_iterable(),
instead of coding it separately in two places.
Note that in the context of qemu_savevm_state_complete_precopy(), this
patch is also an optimization for postcopy path, in that we can avoid sync
cpu twice during switchover: before this patch, postcopy_start() invokes
twice on qemu_savevm_state_complete_precopy(), each of them will try to
sync CPU info. In reality, only one of them would be enough.
For background snapshot, there's no intended functional change.
Signed-off-by: Peter Xu <peterx@redhat.com>
Tested-by: Jiri Denemark <jdenemar@redhat.com>
Reviewed-by: Juraj Marcin <jmarcin@redhat.com>
Link: https://lore.kernel.org/r/20250114230746.3268797-7-peterx@redhat.com
Signed-off-by: Fabiano Rosas <farosas@suse.de>
|
|
This parameter is only used by one caller, which is the genuine precopy
complete path (migration_completion_precopy).
The parameter was introduced in a1fbe750fd ("migration: Fix race of image
locking between src and dst") to make sure the inactivate will happen
before EOF to make sure dest will always be able to activate the disk
properly. However there's no limitation on how early we inactivate the
disk. For precopy completion path, we can always do that as long as VM is
stopped.
Move the disk inactivate there, then we can remove this inactivate_disk
parameter in the whole call stack, because all the rest users pass in false
always.
Signed-off-by: Peter Xu <peterx@redhat.com>
Tested-by: Jiri Denemark <jdenemar@redhat.com>
Reviewed-by: Juraj Marcin <jmarcin@redhat.com>
Link: https://lore.kernel.org/r/20250114230746.3268797-6-peterx@redhat.com
Signed-off-by: Fabiano Rosas <farosas@suse.de>
|
|
Postcopy can trigger this tracepoint twice, while only the 1st one is
valid. Avoid triggering the 2nd tracepoint just like what we do with
recording the total downtime.
Signed-off-by: Peter Xu <peterx@redhat.com>
Tested-by: Jiri Denemark <jdenemar@redhat.com>
Reviewed-by: Juraj Marcin <jmarcin@redhat.com>
Link: https://lore.kernel.org/r/20250114230746.3268797-5-peterx@redhat.com
Signed-off-by: Fabiano Rosas <farosas@suse.de>
|
|
postcopy_start() is the entry function that postcopy is destined to start.
It also means QEMU source will not dump VM description, aka, the JSON
writer is garbage now.
We can leave that to be cleaned up when migration completes, however when
with the JSON writer object being present, vmstate_save() will still try to
construct the JSON objects for the VM descriptions, even though it'll never
be used later if it's postcopy.
To save those cycles, release the JSON writer earlier for postcopy. Then
vmstate_save() later will be smart enough to skip the JSON object
constructions completely. It can logically reduce downtime because all
such JSON constructions happen during postcopy blackout.
Signed-off-by: Peter Xu <peterx@redhat.com>
Tested-by: Jiri Denemark <jdenemar@redhat.com>
Reviewed-by: Juraj Marcin <jmarcin@redhat.com>
Link: https://lore.kernel.org/r/20250114230746.3268797-4-peterx@redhat.com
Signed-off-by: Fabiano Rosas <farosas@suse.de>
|
|
QEMU machine has a property "suppress-vmdesc". When it is enabled, QEMU
will stop attaching JSON VM description at the end of the precopy migration
stream (postcopy is never affected because postcopy never attach that).
However even if it's suppressed by the user, the source QEMU will still
construct the JSON descriptions, which is a complete waste of CPU and
memory resources.
To avoid it, only create the JSON writer object if suppress-vmdesc is not
specified.
Luckily, vmstate_save() already supports vmdesc==NULL, so only a few spots
that are left to be prepared that vmdesc can be NULL now.
When at it, move the init / destroy of the JSON writer object to start /
end of the migration - the JSON writer object is a sub-struct of migration
state, and that looks like the only object that was dynamically allocated /
destroyed within migration process. Make it the same as the rest objects
that migration uses.
Signed-off-by: Peter Xu <peterx@redhat.com>
Tested-by: Jiri Denemark <jdenemar@redhat.com>
Reviewed-by: Juraj Marcin <jmarcin@redhat.com>
Link: https://lore.kernel.org/r/20250114230746.3268797-3-peterx@redhat.com
Signed-off-by: Fabiano Rosas <farosas@suse.de>
|