Age | Commit message (Collapse) | Author | Files | Lines |
|
Signed-off-by: Pierrick Bouvier <pierrick.bouvier@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Acked-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Peter Xu <peterx@redhat.com>
Message-ID: <20250725201729.17100-3-pierrick.bouvier@linaro.org>
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
|
|
When either the postcopy or return path capabilities are
enabled, the migration code will use the primary channel
for bidirectional I/O.
If either of those capabilities are enabled, the migration
code needs to mark the channel as expecting concurrent I/O
in order to activate the thread safety workarounds for
GNUTLS bug 1717
Closes: https://gitlab.com/qemu-project/qemu/-/issues/1937
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Link: https://lore.kernel.org/qemu-devel/20250718150514.2635338-4-berrange@redhat.com
Signed-off-by: Fabiano Rosas <farosas@suse.de>
|
|
The 'info migrate' command only shows the error message when the
migration state is 'failed'. When postcopy is used, however,
the 'postcopy-paused' state is used instead of 'failed', so we
must show the error message there too.
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Link: https://lore.kernel.org/qemu-devel/20250721133913.2914669-1-berrange@redhat.com
[line break to satisfy checkpatch]
Signed-off-by: Fabiano Rosas <farosas@suse.de>
|
|
Fix the loop condition to avoid having a label with "1000 us" instead
of "1 ms".
Reported-by: Prasad Pandit <ppandit@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Link: https://lore.kernel.org/qemu-devel/20250716182648.30202-3-farosas@suse.de
Signed-off-by: Fabiano Rosas <farosas@suse.de>
|
|
Coverity has caught a bug in the formatting of time intervals for
postcopy latency distribution display in 'info migrate'.
While bounds checking the labels array, sizeof is incorrectly being
used. ARRAY_SIZE is the correct form of obtaining the size of an
array.
Fixes: 3345fb3b6d ("migration/postcopy: Add latency distribution report for blocktime")
Resolves: Coverity CID 1612248
Suggested-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Link: https://lore.kernel.org/qemu-devel/20250716182648.30202-2-farosas@suse.de
Signed-off-by: Fabiano Rosas <farosas@suse.de>
|
|
save_complete_precopy_thread
Recent patch [1] renames the save_live_complete_precopy handler to
save_complete, as the machine is not live in most cases when this
handler is executed. The same is true also for
save_live_complete_precopy_thread, therefore this patch removes the
"live" keyword from the handler itself and related types to keep the
naming unified.
In contrast to save_complete, this handler is only executed at the end
of precopy, therefore the "precopy" keyword is retained.
[1]: https://lore.kernel.org/all/20250613140801.474264-7-peterx@redhat.com/
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Cédric Le Goater <clg@redhat.com>
Signed-off-by: Juraj Marcin <jmarcin@redhat.com>
Link: https://lore.kernel.org/r/20250626085235.294690-1-jmarcin@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
|
|
Add the latency distribution too for blocktime, using order-of-two buckets.
It accounts for all the faults, from either vCPU or non-vCPU threads. With
prior rework, it's very easy to achieve by adding an array to account for
faults in each buckets.
Sample output for HMP (while for QMP it's simply an array):
Postcopy Latency Distribution:
[ 1 us - 2 us ]: 0
[ 2 us - 4 us ]: 0
[ 4 us - 8 us ]: 1
[ 8 us - 16 us ]: 2
[ 16 us - 32 us ]: 2
[ 32 us - 64 us ]: 3
[ 64 us - 128 us ]: 10169
[ 128 us - 256 us ]: 50151
[ 256 us - 512 us ]: 12876
[ 512 us - 1 ms ]: 97
[ 1 ms - 2 ms ]: 42
[ 2 ms - 4 ms ]: 44
[ 4 ms - 8 ms ]: 93
[ 8 ms - 16 ms ]: 138
[ 16 ms - 32 ms ]: 0
[ 32 ms - 65 ms ]: 0
[ 65 ms - 131 ms ]: 0
[ 131 ms - 262 ms ]: 0
[ 262 ms - 524 ms ]: 0
[ 524 ms - 1 sec ]: 0
[ 1 sec - 2 sec ]: 0
[ 2 sec - 4 sec ]: 0
[ 4 sec - 8 sec ]: 0
[ 8 sec - 16 sec ]: 0
Cc: Markus Armbruster <armbru@redhat.com>
Acked-by: Dr. David Alan Gilbert <dave@treblig.org>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Link: https://lore.kernel.org/r/20250613141217.474825-15-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
|
|
When used to report page fault latencies, the blocktime feature can be
almost useless when KVM async page fault is enabled, because in most cases
such remote fault will kickoff async page faults, then it's not trackable
from blocktime layer.
After all these recent rewrites to blocktime layer, it's finally so easy to
also support tracking non-vCPU faults. It'll be even faster if we could
always index fault records with TIDs, unfortunately we need to maintain the
blocktime API which report things in vCPU indexes.
Of course this can work not only for kworkers, but also any guest accesses
that may reach a missing page, for example, very likely when in the QEMU
main thread too (and all other threads whenever applicable).
In this case, we don't care about "how long the threads are blocked", but
we only care about "how long the fault will be resolved".
Cc: Markus Armbruster <armbru@redhat.com>
Cc: Dr. David Alan Gilbert <dave@treblig.org>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Tested-by: Mario Casquero <mcasquer@redhat.com>
Link: https://lore.kernel.org/r/20250613141217.474825-14-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
|
|
Currently, the postcopy blocktime feature maintains vCPU fault information
using an array (vcpu_addr[]). It has two issues.
Issue 1: Performance Concern
============================
The old algorithm was almost OK and fast on inserts, except that the lookup
is slow and won't scale if there are a lot of vCPUs: when a page is copied
during postcopy, mark_postcopy_blocktime_end() will walk the whole array
trying to find which vCPUs are blocked by the address. So it needs
constant O(N) walk for each page resolution.
Alexey (the author of postcopy blocktime) mentioned the perf issue and how
to optimize it in a piece of comment in the page resolution path. The
comment was (interestingly..) not complete, but it's relatively clear what
he wanted to say about this perf issue.
Issue 2: Wrong Accounting on re-entrancies
==========================================
People might think that each vCPU should only and always get one fault at a
time, so that when the blocktime layer captured one fault on one vCPU, we
should never see another fault message on this vCPU.
It's almost correct, except in some extreme rare cases.
Case 1: it's possible the fault thread processes the userfaultfd messages
too fast so it can see >1 messages on one vCPU before the previous one was
resolved.
Case 2: it's theoretically also possible one vCPU can get even more than
one message on the same fault address if a fault is retried by the
kernel (e.g., handle_userfault() got interrupted before page resolution).
As this info might be important, instead of using commit message, I put
more details into the code as comment, when introducing an array
maintaining concurrent faults on one vCPU. Please refer to the comments
for details on both cases, especially case 1 which can be tricky.
Case 1 sounds rare, but it can be easily reproduced locally for me when we
run blocktime together with the migration-test on the vanilla postcopy.
New Design
==========
This patch should do almost what Alexey mentioned, but slightly
differently: instead of having an array to maintain vCPU fault addresses,
for each of the fault message we push a message into a hash, indexed by the
fault address.
With the hash, it can replace the old two structs: both the vcpu_addr[]
array, and also the array to store the start time of the fault. However
due to above we need one more counter array to account concurrent faults on
the same vCPU - that should even be needed in the old code, it's just that
the old code was buggy and it will blindly overwrite an existing
entry.. now we'll start to really track everything.
The hash structure might be more efficient than tree to maintain such
addr->(cpu, fault_time) information, so that the insert() and lookup()
paths should ideally both be ~O(1). After all, we do not need to sort.
Here we need to do one remove() though after the lookup(). It could be
slow but only if many vCPUs faulted exactly on the same address (so when
the list of cpu entries is long), which should be unlikely. Even with that,
it's still a worst case O(N) (consider 400 vCPUs faulted on the same
address and how likely is it..) rather than a constant O(N) complexity.
When at it, touch up the tracepoints to make them slightly more useful.
One tracepoint is added when walking all the fault entries.
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Link: https://lore.kernel.org/r/20250613141217.474825-13-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
|
|
The variable vcpu_total_blocktime isn't easy to follow. In reality, it
wants to capture the case where all vCPUs are stopped, and now there will
be some vCPUs starts running.
The name now starts to conflict with vcpu_blocktime_total[], meanwhile it's
actually not necessary to have the variable at all: since nobody is
touching smp_cpus_down except ourselves, we can safely do the calculation
at the end before decrementing smp_cpus_down.
Hopefully this makes the logic easier to read, side benefit is we drop one
temp var.
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Link: https://lore.kernel.org/r/20250613141217.474825-12-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
|
|
Looking up the vCPU index for each fault can be expensive when there're
hundreds of vCPUs. Provide a cache for tid->vcpu instead with a hash
table, then lookup from there.
When at it, add another counter to record how many non-vCPU faults it gets.
For example, the main thread can also access a guest page that was missing.
These kind of faults are not accounted by blocktime so far.
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Link: https://lore.kernel.org/r/20250613141217.474825-11-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
|
|
Before this patch, the blocktime context can be created very early, because
postcopy_ram_supported_by_host() <- migrate_caps_check() can happen during
migration object init.
The trick here is the blocktime context needs system vCPU information,
which seems to be possible to change after that point. I didn't verify it,
but it doesn't sound right.
Now move it out and initialize the context only when postcopy listen
starts. That is already during a migration so it should be guaranteed the
vCPU topology can never change on both sides.
While at it, assert that the ctx isn't created instead this time; the old
"if" trick isn't needed when we're sure it will only happen once now.
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Link: https://lore.kernel.org/r/20250613141217.474825-10-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
|
|
Blocktime so far only cares about the time one vcpu (or the whole system)
got blocked. It would be also be helpful if it can also report the latency
of page requests, which could be very sensitive during postcopy.
Blocktime itself is sometimes not very important, especially when one
thinks about KVM async PF support, which means vCPUs are literally almost
not blocked at all because the guest OS is smart enough to switch to
another task when a remote fault is needed.
However, latency is still sensitive and important because even if the guest
vCPU is running on threads that do not need a remote fault, the workload
that accesses some missing page is still affected.
Add two entries to the report, showing how long it takes to resolve a
remote fault. Mention in the QAPI doc that this is not the real average
fault latency, but only the ones that was requested for a remote fault.
Unwrap get_vcpu_blocktime_list() so we don't need to walk the list twice,
meanwhile add the entry checks in qtests for all postcopy tests.
Cc: Markus Armbruster <armbru@redhat.com>
Cc: Dr. David Alan Gilbert <dave@treblig.org>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Tested-by: Mario Casquero <mcasquer@redhat.com>
Link: https://lore.kernel.org/r/20250613141217.474825-9-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
|
|
Add a field to count how many remote faults one vCPU has taken. So far
it's still not used, but will be soon.
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Link: https://lore.kernel.org/r/20250613141217.474825-8-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
|
|
With 64-bit fields, it is trivial. The caution is when exposing any values
in QMP, it was still declared with milliseconds (ms). Hence it's needed to
do the convertion when exporting the values to existing QMP queries.
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Link: https://lore.kernel.org/r/20250613141217.474825-7-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
|
|
Now with 64bits, the offseting using start_time is not needed anymore,
because the array can always remember the whole timestamp.
Then drop the unused parameter in get_low_time_offset() altogether.
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Link: https://lore.kernel.org/r/20250613141217.474825-6-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
|
|
I am guessing it was used to be 32bits because of the atomic ops. Now all
the atomic ops are gone and we're protected by a mutex instead, it's ok we
can switch to 64 bits.
Reasons to move over:
- Allow further patches to change the unit from ms to us: with postcopy
preempt mode, we're really into hundreds of microseconds level on
blocktime. We'd better be able to trap those.
- This also paves way for some other tricks that the original version
used to avoid overflows, e.g., start_time was almost only useful before
to make sure the sampled timestamp won't overflow a 32-bit field.
- This prepares further reports on top of existing data collected,
e.g. average page fault latencies. When average operation is taken into
account, milliseconds are simply too coarse grained.
When at it:
- Rename page_fault_vcpu_time to vcpu_blocktime_start.
- Rename vcpu_blocktime to vcpu_blocktime_total.
- Touch up the trace-events to not dump blocktime ctx pointer
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Link: https://lore.kernel.org/r/20250613141217.474825-5-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
|
|
Now with the mutex protection it's not needed anymore.
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Link: https://lore.kernel.org/r/20250613141217.474825-4-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
|
|
The postcopy blocktime feature was tricky that it used quite some atomic
operations over quite a few arrays and vars, without explaining how that
would be thread safe. The thread safety here is about concurrency between
the fault thread and the fault resolution threads, possible to access the
same chunk of data. All these atomic ops can be expensive too before
knowing clearly how it works.
OTOH, postcopy has one page_request_mutex used to serialize the received
bitmap updates. So far it's ok - we don't yet have a lot of threads
contending the lock. It might change after multifd will be supported, but
that's a separate story. What is important is, with that mutex, it's
pretty lightweight to move all the blocktime maintenance into the mutex
critical section. It's because the blocktime layer is lightweighted:
almost "remember which vcpu faulted on which address", and "ok we get some
fault resolved, calculate how long it takes". It's also an optional
feature for now (but I have thought of changing that, maybe in the future).
Let's push the blocktime layer into the mutex, so that it's always
thread-safe even without any atomic ops.
To achieve that, I'll need to add a tid parameter on fault path so that
it'll start to pass the faulted thread ID into deeper the stack, but not
too deep. When at it, add a comment for the shared fault handler (for
example, vhost-user devices running with postcopy), to mention a TODO. One
reason it might not be trivial is that vhost-user's userfaultfds should be
opened by vhost-user process, so it's pretty hard to control making sure
the TID feature will be around. It wasn't supported before, so keep it
like that for now.
Now we should be as ease when everything is protected by a mutex that we
always take anyway.
One side effect: we can finally remove one ramblock_recv_bitmap_test() in
mark_postcopy_blocktime_begin(), which was pretty weird and which also
includes a weird (but maybe necessary.. but maybe not?) operation to inject
a blocktime entry then quickly erase it.. When we're with the mutex, and
when we make sure it's invoked after checking the receive bitmap, it's not
needed anymore. Instead, we assert.
As another side effect, this paves way for removing all atomic ops in all
the mem accesses in blocktime layer.
Note that we need a stub for mark_postcopy_blocktime_begin() for Windows
builds.
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Link: https://lore.kernel.org/r/20250613141217.474825-3-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
|
|
Add a global property to allow enabling postcopy-blocktime feature.
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Link: https://lore.kernel.org/r/20250613141217.474825-2-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
|
|
This is a follow up on the other commit "migration/ram: avoid to do log
clear in the last round" but for postcopy.
https://lore.kernel.org/r/20250514115827.3216082-1-yanfei.xu@bytedance.com
I can observe more than 10% reduction of average page fault latency during
postcopy phase with this optimization:
Before: 268.00us (+-1.87%)
After: 232.67us (+-2.01%)
The test was done with a 16GB VM with 80 vCPUs, running a workload that
busy random writes to 13GB memory.
Cc: Yanfei Xu <yanfei.xu@bytedance.com>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Link: https://lore.kernel.org/r/20250613140801.474264-12-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
|
|
There're a few things off here in that logic, rewrite it. When at it, add
rich comment to explain each of the decisions.
Since this is very sensitive path for migration, below are the list of
things changed with their reasonings.
(1) Exact pending size is only needed for precopy not postcopy
Fundamentally it's because "exact" version only does one more deep
sync to fetch the pending results, while in postcopy's case it's
never going to sync anything more than estimate as the VM on source
is stopped.
(2) Do _not_ rely on threshold_size anymore to decide whether postcopy
should complete
threshold_size was calculated from the expected downtime and
bandwidth only during precopy as an efficient way to decide when to
switchover. It's not sensible to rely on threshold_size in postcopy.
For precopy, if switchover is decided, the migration will complete
soon. It's not true for postcopy. Logically speaking, postcopy
should only complete the migration if all pending data is flushed.
Here it used to work because save_complete() used to implicitly
contain save_live_iterate() when there's pending size.
Even if that looks benign, having RAMs to be migrated in postcopy's
save_complete() has other bad side effects:
(a) Since save_complete() needs to be run once at a time, it means
when moving RAM there's no way moving other things (rather than
round-robin iterating the vmstate handlers like what we do with
ITERABLE phase). Not an immediate concern, but it may stop working
in the future when there're more than one iterables (e.g. vfio
postcopy).
(b) postcopy recovery, unfortunately, only works during ITERABLE
phase. IOW, if src QEMU moves RAM during postcopy's save_complete()
and network failed, then it'll crash both QEMUs... OTOH if it failed
during iteration it'll still be recoverable. IOW, this change should
further reduce the window QEMU split brain and crash in extreme cases.
If we enable the ram_save_complete() tracepoints, we'll see this
before this patch:
1267959@1748381938.294066:ram_save_complete dirty=9627, done=0
1267959@1748381938.308884:ram_save_complete dirty=0, done=1
It means in this migration there're 9627 pages migrated at complete()
of postcopy phase.
After this change, all the postcopy RAM should be migrated in iterable
phase, rather than save_complete():
1267959@1748381938.294066:ram_save_complete dirty=0, done=0
1267959@1748381938.308884:ram_save_complete dirty=0, done=1
(3) Adjust when to decide to switch to postcopy
This shouldn't be super important, the movement makes sure there's
only one in_postcopy check, then we are clear on what we do with the
two completely differnt use cases (precopy v.s. postcopy).
(4) Trivial touch up on threshold_size comparision
Which changes:
"(!pending_size || pending_size < s->threshold_size)"
into:
"(pending_size <= s->threshold_size)"
Reviewed-by: Juraj Marcin <jmarcin@redhat.com>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Link: https://lore.kernel.org/r/20250613140801.474264-11-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
|
|
Take notes on start/end state of dirty pages for the whole system.
Reviewed-by: Juraj Marcin <jmarcin@redhat.com>
Link: https://lore.kernel.org/r/20250613140801.474264-10-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
|
|
The check over PAGE_DIRTY_FOUND isn't necessary. We could indent one less
and assert that instead.
Reviewed-by: Juraj Marcin <jmarcin@redhat.com>
Link: https://lore.kernel.org/r/20250613140801.474264-9-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
|
|
Since we use the same save_complete() hook for both precopy and postcopy,
add a set of helpers to invoke the hook() to dedup the code.
Reviewed-by: Juraj Marcin <jmarcin@redhat.com>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Link: https://lore.kernel.org/r/20250613140801.474264-8-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
|
|
Now after merging the precopy and postcopy version of complete() hook,
rename the precopy version from save_live_complete_precopy() to
save_complete().
Dropping the "live" when at it, because it's in most cases not live when
happening (in precopy).
No functional change intended.
Reviewed-by: Juraj Marcin <jmarcin@redhat.com>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Link: https://lore.kernel.org/r/20250613140801.474264-7-peterx@redhat.com
[peterx: squash the fixup that covers a few more doc spots, per Juraj]
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
|
|
The hook is only defined in two vmstate users ("ram" and "block dirty
bitmap"), meanwhile both of them define the hook exactly the same as the
precopy version. Hence, this postcopy version isn't needed.
No functional change intended.
Reviewed-by: Juraj Marcin <jmarcin@redhat.com>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Link: https://lore.kernel.org/r/20250613140801.474264-6-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
|
|
It's not possible to happen in bg-snapshot case.
Reviewed-by: Juraj Marcin <jmarcin@redhat.com>
Link: https://lore.kernel.org/r/20250613140801.474264-5-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
|
|
Unfortunately, it was never correctly shown..
This is only found when I started to look into making the blocktime feature
more useful (so as to avoid using bpftrace, even though I'm not sure which
one will be harder to use..).
So the old dump would look like this:
Postcopy vCPU Blocktime: 0-1,4,10,21,33,46,48,59
Even though there're actually 40 vcpus, and the string will merge same
elements and also sort them.
To fix it, simply loop over the uint32List manually. Now it looks like:
Postcopy vCPU Blocktime (ms):
[15, 0, 0, 43, 29, 34, 36, 29, 37, 41,
33, 37, 45, 52, 50, 38, 40, 37, 40, 49,
40, 35, 35, 35, 81, 19, 18, 19, 18, 30,
22, 3, 0, 0, 0, 0, 0, 0, 0, 0]
Cc: Dr. David Alan Gilbert <dave@treblig.org>
Cc: Alexey Perevalov <a.perevalov@samsung.com>
Cc: Markus Armbruster <armbru@redhat.com>
Tested-by: Mario Casquero <mcasquer@redhat.com>
Reviewed-by: Juraj Marcin <jmarcin@redhat.com>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Link: https://lore.kernel.org/r/20250613140801.474264-3-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
|
|
Dave suggested the HMP output for "info migrate" can not only leverage the
lines but also better grouping:
https://lore.kernel.org/r/aC4_-nMc7FwsMf9p@gallifrey
I followed Dave's suggestion, and some more modifications on top:
- Added all elements into the picture
- Use size_to_str() and drop most of the units: benefit is more friendly
to most human eyes, bad side effect is lose of details, but that should
be corner case per my uses, and one can still leverage the QMP interface
when necessary.
- Sub-grouping for "Transfers" ("Channels" and "Page Types").
- Better indentations
Sample output:
(qemu) info migrate
Status: postcopy-active
Time (ms): total=47317, setup=5, down=8
RAM info:
Throughput (Mbps): 1342.83
Sizes: pagesize=4 KiB, total=4.02 GiB
Transfers: transferred=1.41 GiB, remain=2.46 GiB
Channels: precopy=15.2 MiB, multifd=0 B, postcopy=1.39 GiB
Page Types: normal=367713, zero=41195
Page Rates (pps): transfer=40900, dirty=4
Others: dirty_syncs=2, postcopy_req=57503
Suggested-by: Dr. David Alan Gilbert <dave@treblig.org>
Tested-by: Li Zhijian <lizhijian@fujitsu.com>
Reviewed-by: Li Zhijian <lizhijian@fujitsu.com>
Acked-by: Dr. David Alan Gilbert <dave@treblig.org>
Reviewed-by: Juraj Marcin <jmarcin@redhat.com>
Tested-by: Mario Casquero <mcasquer@redhat.com>
Link: https://lore.kernel.org/r/20250613140801.474264-2-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
|
|
Define a list of vfio devices in CPR state, in a subsection so that
older QEMU can be live updated to this version. However, new QEMU
will not be live updateable to old QEMU. This is acceptable because
CPR is not yet commonly used, and updates to older versions are unusual.
The contents of each device object will be defined by the vfio subsystem
in a subsequent patch.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Link: https://lore.kernel.org/qemu-devel/1751493538-202042-14-git-send-email-steven.sistare@oracle.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
|
|
Add the helper function cpr_get_fd_param, to use when preserving
a file descriptor that is opened externally and passed to QEMU.
cpr_get_fd_param returns a descriptor number either from a QEMU
command-line parameter, from a getfd command, or from CPR state.
When a descriptor is passed to new QEMU via SCM_RIGHTS, its number
changes. Hence, during CPR, the command-line parameter is ignored
in new QEMU, and over-ridden by the value found in CPR state.
Similarly, if the descriptor was originally specified by a getfd
command in old QEMU, the fd number is not known outside of QEMU,
and it changes when sent to new QEMU via SCM_RIGHTS. Hence the
user cannot send getfd to new QEMU, but when the user sends a
hotplug command that references the fd, cpr_get_fd_param finds
its value in CPR state.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Link: https://lore.kernel.org/qemu-devel/1751493538-202042-5-git-send-email-steven.sistare@oracle.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
|
|
Update ReplayRamDiscard() function to return the result and unify the
ReplayRamPopulate() and ReplayRamDiscard() to ReplayRamDiscardState() at
the same time due to their identical definitions. This unification
simplifies related structures, such as VirtIOMEMReplayData, which makes
it cleaner.
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
Link: https://lore.kernel.org/r/20250612082747.51539-4-chenyi.qiang@intel.com
Signed-off-by: Peter Xu <peterx@redhat.com>
|
|
Syncing volatile memory provides no benefit, instead it can cause
performance issues in some cases. Only sync memory that is marked as
non-volatile after migration completes on destination.
Signed-off-by: Ben Chaney <bchaney@akamai.com>
Fixes: bd108a44bc29 (migration: ram: Switch to ram block writeback)
Link: https://lore.kernel.org/r/1CC43F59-336F-4A12-84AD-DB89E0A17A95@akamai.com
Signed-off-by: Peter Xu <peterx@redhat.com>
|
|
Extend cpr_transfer_input to handle SOCKET_ADDRESS_TYPE_FD alongside
SOCKET_ADDRESS_TYPE_UNIX. This change supports the use of pre-listened
socket file descriptors for cpr migration channels.
This change is particularly useful in qtest environments, where the
socket may be created externally and passed via fd.
Reviewed-by: Jason J. Herne <jjherne@linux.ibm.com>
Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
Signed-off-by: Jaehoon Kim <jhkim@linux.ibm.com>
Link: https://lore.kernel.org/r/20250611205610.147008-3-jhkim@linux.ibm.com
Signed-off-by: Peter Xu <peterx@redhat.com>
|
|
There won't be any ram sync after the stage of save_complete, therefore
it's unnecessary to do manually protect for dirty pages being sent. Skip
to do this in last round can reduce noticeable downtime.
Signed-off-by: Yanfei Xu <yanfei.xu@bytedance.com>
Tested-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Link: https://lore.kernel.org/r/20250514115827.3216082-1-yanfei.xu@bytedance.com
[peterx: add comments]
Signed-off-by: Peter Xu <peterx@redhat.com>
|
|
Define a vmstate priority that is lower than the default, so its handlers
run after all default priority handlers. Since 0 is no longer the default
priority, translate an uninitialized priority of 0 to MIG_PRI_DEFAULT.
CPR for vfio will use this to install handlers for containers that run
after handlers for the devices that they contain.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Peter Xu <peterx@redhat.com>
Link: https://lore.kernel.org/qemu-devel/1749569991-25171-3-git-send-email-steven.sistare@oracle.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
|
|
Add the cpr_incoming_needed, cpr_open_fd, and cpr_resave_fd helpers,
for use when adding cpr support for vfio and iommufd.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Link: https://lore.kernel.org/qemu-devel/1749569991-25171-2-git-send-email-steven.sistare@oracle.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
|
|
thread_sync_sem is an one-shot event so it can be converted into
QemuEvent, which is more lightweight.
Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Link: https://lore.kernel.org/r/20250529-event-v5-9-53b285203794@daynix.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
colo_exit_sem and colo_incoming_sem represent one-shot events so they
can be converted into QemuEvent, which is more lightweight.
Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Link: https://lore.kernel.org/r/20250529-event-v5-8-53b285203794@daynix.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
pause_event can utilize qemu_event_reset() to discard events.
Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Link: https://lore.kernel.org/r/20250529-event-v5-7-53b285203794@daynix.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
A new parameter "-a" is added to "info migrate" to dump all info, while
when not specified it only dumps the important ones. When at it, reorg
everything to make it easier to read for human.
The general rule is:
- Put important things at the top
- Reuse a single line when things are very relevant, hence reducing lines
needed to show the results
- Remove almost useless ones (e.g. "normal_bytes", while we also have
both "page size" and "normal" pages)
- Regroup things, so that related fields will show together
- etc.
Before this change, it looks like (one example of a completed case):
globals:
store-global-state: on
only-migratable: off
send-configuration: on
send-section-footer: on
send-switchover-start: on
clear-bitmap-shift: 18
Migration status: completed
total time: 122952 ms
downtime: 76 ms
setup: 15 ms
transferred ram: 130825923 kbytes
throughput: 8717.68 mbps
remaining ram: 0 kbytes
total ram: 16777992 kbytes
duplicate: 997263 pages
normal: 32622225 pages
normal bytes: 130488900 kbytes
dirty sync count: 10
page size: 4 kbytes
multifd bytes: 117134260 kbytes
pages-per-second: 169431
postcopy request count: 5835
precopy ram: 15 kbytes
postcopy ram: 13691151 kbytes
After this change, sample output (default, no "-a" specified):
Status: postcopy-active
Time (ms): total=40504, setup=14, down=145
RAM info:
Bandwidth (mbps): 6102.65
Sizes (KB): psize=4, total=16777992,
transferred=37673019, remain=2136404,
precopy=3, multifd=26108780, postcopy=11563855
Pages: normal=9394288, zero=600672, rate_per_sec=185875
Others: dirty_syncs=3, dirty_pages_rate=278378, postcopy_req=4078
Sample output when "-a" specified:
Status: active
Time (ms): total=3040, setup=4, exp_down=300
RAM info:
Throughput (mbps): 10.51
Sizes (KB): psize=4, total=4211528,
transferred=3979, remain=4206452,
precopy=3978, multifd=0, postcopy=0
Pages: normal=992, zero=277, rate_per_sec=320
Others: dirty_syncs=1
Globals:
store-global-state: on
only-migratable: off
send-configuration: on
send-section-footer: on
send-switchover-start: on
clear-bitmap-shift: 18
XBZRLE: size=67108864, transferred=0, pages=0, miss=188451
miss_rate=0.00, encode_rate=0.00, overflow=0
CPU Throttle (%): 0
Dirty-limit Throttle (us): 0
Dirty-limit Ring Full (us): 0
Postcopy Blocktime (ms): 0
Postcopy vCPU Blocktime: ...
Reviewed-by: Dr. David Alan Gilbert <dave@treblig.org>
Tested-by: Mario Casquero <mcasquer@redhat.com>
[peterx: print "," too in 1st line of RAM info]
Signed-off-by: Peter Xu <peterx@redhat.com>
|
|
With commit 82137e6c8c ("migration: enforce multifd and postcopy preempt to
be set before incoming"), and if postcopy preempt / multifd is enabled, one
cannot setup any capability because these checks would always fail.
(qemu) migrate_set_capability xbzrle off
Error: Postcopy preempt must be set before incoming starts
To fix it, check existing cap and only raise an error if the specific cap
changed.
Fixes: 82137e6c8c ("migration: enforce multifd and postcopy preempt to be set before incoming")
Reviewed-by: Dr. David Alan Gilbert <dave@treblig.org>
Reviewed-by: Juraj Marcin <jmarcin@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
|
|
If zerocopy is enabled for multifd then QIO_CHANNEL_WRITE_FLAG_ZERO_COPY
flag is forced into all multifd channel write calls via p->write_flags
that was setup in multifd_nocomp_send_setup().
However, device state packets aren't compatible with zerocopy - the data
buffer isn't getting kept pinned until multifd channel flush.
Make sure to mask that QIO_CHANNEL_WRITE_FLAG_ZERO_COPY flag in a multifd
send thread if the data being sent is device state.
Fixes: 0525b91a0b99 ("migration/multifd: Device state transfer support - send side")
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Link: https://lore.kernel.org/r/3bd5f48578e29f3a78f41b1e4fbea3d4b2d9b136.1747403393.git.maciej.szmigiero@oracle.com
Signed-off-by: Peter Xu <peterx@redhat.com>
|
|
Enable Multifd and Postcopy migration together.
The migration_ioc_process_incoming() routine checks
magic value sent on each channel and helps to properly
setup multifd and postcopy channels.
The Precopy and Multifd threads work during the initial
guest RAM transfer. When migration moves to the Postcopy
phase, the multifd threads cease to send data on multifd
channels and Postcopy threads on the destination
request/pull data from the source side.
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Signed-off-by: Prasad Pandit <pjp@fedoraproject.org>
Link: https://lore.kernel.org/r/20250512125124.147064-3-ppandit@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
|
|
During multifd migration, zero pages are written if
they are migrated more than once.
This may result in a migration thread hang issue when
multifd and postcopy are enabled together.
When postcopy is enabled, always write zero pages as and
when they are migrated.
Signed-off-by: Prasad Pandit <pjp@fedoraproject.org>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Link: https://lore.kernel.org/r/20250512125124.147064-2-ppandit@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
|
|
I hit following error which testing migration in pure RoCE env:
"-incoming rdma:[::]:8089: RDMA ERROR: You only have RoCE / iWARP devices in your
systems and your management software has specified '[::]', but IPv6 over RoCE /
iWARP is not supported in Linux.#012'."
In our setup, we use rdma bind on ipv6 on target host, while connect from source
with ipv4, remove the qemu_rdma_broken_ipv6_kernel, migration just work
fine.
Checking the git history, the function was added since introducing of
rdma migration, which is more than 10 years ago. linux-rdma has
improved support on RoCE/iWARP for ipv6 over past years. There are a few fixes
back in 2016 seems related to the issue, eg:
aeb76df46d11 ("IB/core: Set routable RoCE gid type for ipv4/ipv6 networks")
other fixes back in 2018, eg:
052eac6eeb56 RDMA/cma: Update RoCE multicast routines to use net namespace
8d20a1f0ecd5 RDMA/cma: Fix rdma_cm raw IB path setting for RoCE
9327c7afdce3 RDMA/cma: Provide a function to set RoCE path record L2 parameters
5c181bda77f4 RDMA/cma: Set default GID type as RoCE when resolving RoCE route
3c7f67d1880d IB/cma: Fix default RoCE type setting
be1d325a3358 IB/core: Set RoCEv2 MGID according to spec
63a5f483af0e IB/cma: Set default gid type to RoCEv2
So remove the outdated function and it's usage.
Cc: Peter Xu <peterx@redhat.com>
Cc: Li Zhijian <lizhijian@fujitsu.com>
Cc: Yu Zhang <yu.zhang@ionos.com>
Cc: Fabiano Rosas <farosas@suse.de>
Cc: qemu-devel@nongnu.org
Cc: linux-rdma@vger.kernel.org
Cc: michael@flatgalaxy.com
Signed-off-by: Jack Wang <jinpu.wang@ionos.com>
Tested-by: Li zhijian <lizhijian@fujitsu.com>
Reviewed-by: Michael Galaxy <mrgalaxy@nvidia.com>
Link: https://lore.kernel.org/r/20250402051306.6509-1-jinpu.wang@ionos.com
[peterx: some cosmetic changes]
Signed-off-by: Peter Xu <peterx@redhat.com>
|
|
The preempt mode postcopy has been introduced for a while. From latency
POV, it should always win the vanilla postcopy.
However there's one thing missing when preempt mode is enabled right now,
which is the spatial locality hint when there're page requests from the
destination side.
In vanilla postcopy, as long as a page request was unqueued, it will update
the PSS of the precopy background stream, so that after a page request the
background thread will move the pages after whatever was requested. It's
pretty much a natural behavior when there's only one channel anyway, and
one scanner to send the pages.
Preempt mode didn't follow that, because preempt mode has its own channel
and its own PSS (which doesn't linearly scan the guest memory, but
dedicated to resolve page requested from destination). So the page request
process and the background migration process are completely separate.
This patch adds the hint explicitly for preempt mode. With that, whenever
the preempt mode receives a page request on the source, it will service the
remote page fault in the return path, then it'll provide a hint to the
background thread so that we'll start sending the pages right after the
requested ones in the background, assuming the follow up pages have a
higher chance to be accessed later.
NOTE: since the background migration thread and return path thread run
completely concurrently, it doesn't always mean the hint will be applied
every single time. For example, it's possible that the return path thread
receives multiple page requests in a row without the background thread
getting the chance to consume one. In such case, the preempt thread only
provide the hint if the previous hint has been consumed. After all,
there's no point queuing hints when we only have one linear scanner.
This could measureably improve the simple sequential memory access pattern
during postcopy (when preempt is on). For random accesses, I can measure a
slight increase of remote page fault latency from ~500us -> ~600us, that
could be a trade-off to have such hint mechanism, and after all that's
still greatly improved comparing to vanilla postcopy on random (~10ms).
The patch is verified by our QE team in a video streaming test case, to
reduce the pause of the video from ~1min to a few seconds when switching
over to postcopy with preempt mode.
Reported-by: Xiaohui Li <xiaohli@redhat.com>
Tested-by: Xiaohui Li <xiaohli@redhat.com>
Reviewed-by: Juraj Marcin <jmarcin@redhat.com>
Link: https://lore.kernel.org/r/20250424220705.195544-1-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
|
|
Implement save_postcopy_prepare(), preparing for the enablement
of both multifd and postcopy.
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Prasad Pandit <pjp@fedoraproject.org>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Message-ID: <20250411114534.3370816-5-ppandit@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
|
|
Add a savevm handler for a module to opt-in sending extra sections right
before postcopy starts, and before VM is stopped.
RAM will start to use this new savevm handler in the next patch to do flush
and sync for multifd pages.
Note that we choose to do it before VM stopped because the current only
potential user is not sensitive to VM status, so doing it before VM is
stopped is preferred to enlarge any postcopy downtime.
It is still a bit unfortunate that we need to introduce such a new savevm
handler just for the only use case, however it's so far the cleanest.
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Prasad Pandit <pjp@fedoraproject.org>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Message-ID: <20250411114534.3370816-4-ppandit@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
|