aboutsummaryrefslogtreecommitdiff
path: root/migration
AgeCommit message (Collapse)AuthorFilesLines
2022-01-28migration/ram: clean up unused comment.Xu Zheng1-1/+0
Just a removal of an unused comment. a0a8aa147aa did many fixes and removed the parameter named "ms", but forget to remove the corresponding comment in function named "ram_save_host_page". Signed-off-by: Xu Zheng <xuzheng@cmss.chinamobile.com> Signed-off-by: Mao Zhongyi <maozhongyi@cmss.chinamobile.com> Signed-off-by: Juan Quintela <quintela@redhat.com> Reviewed-by: Juan Quintela <quintela@redhat.com>
2022-01-28migration: Report the error returned when save_live_iterate failsDavid Edmondson1-2/+3
Should qemu_savevm_state_iterate() encounter a failure when calling a particular save_live_iterate function, report the error code returned by the function. Signed-off-by: David Edmondson <david.edmondson@oracle.com> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com>
2022-01-28migration/migration.c: Remove the MIGRATION_STATUS_ACTIVE when migration ↵Zhang Chen1-6/+0
finished The MIGRATION_STATUS_ACTIVE indicates that migration is running. Remove it to be handled by the default operation, It should be part of the unknown ending states. Signed-off-by: Zhang Chen <chen.zhang@intel.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com>
2022-01-28migration/migration.c: Avoid COLO boot in postcopy migrationZhang Chen1-5/+5
COLO dose not support postcopy migration and remove the Fixme. Signed-off-by: Zhang Chen <chen.zhang@intel.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com>
2022-01-28migration/migration.c: Add missed default error handler for migration stateZhang Chen1-1/+1
In the migration_completion() no other status is expected, for example MIGRATION_STATUS_CANCELLING, MIGRATION_STATUS_CANCELLED, etc. Signed-off-by: Zhang Chen <chen.zhang@intel.com> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com>
2022-01-28multifd: Rename pages_used to normal_pagesJuan Quintela2-3/+4
Signed-off-by: Juan Quintela <quintela@redhat.com> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
2022-01-28multifd: recv side only needs the RAMBlock host addressJuan Quintela4-9/+6
So we can remove the MultiFDPages. Signed-off-by: Juan Quintela <quintela@redhat.com> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
2022-01-28multifd: Use normal pages array on the recv sideJuan Quintela4-34/+33
Signed-off-by: Juan Quintela <quintela@redhat.com> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com> --- Rename num_normal_pages to total_normal_pages (peter)
2022-01-28multifd: Use normal pages array on the send sideJuan Quintela5-21/+33
We are only sending normal pages through multifd channels. Later on this series, we are going to also send zero pages. We are going to detect if a page is zero or non zero in the multifd channel thread, not on the main thread. So we receive an array of pages page->offset[N] And we will end with: p->normal[N - zero_pages] p->zero[zero_pages]. In this patch, we just copy all the pages in offset to normal. for (i = 0; i < pages->num; i++) { p->narmal[p->normal_num] = pages->offset[i]; p->normal_num++: } Later in the series this becomes: for (i = 0; i < pages->num; i++) { if (buffer_is_zero(page->offset[i])) { p->zerol[p->zero_num] = pages->offset[i]; p->zero_num++: } else { p->narmal[p->normal_num] = pages->offset[i]; p->normal_num++: } } Signed-off-by: Juan Quintela <quintela@redhat.com> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com> --- Improving comment (dave) Renaming num_normal_pages to total_normal_pages (peter)
2022-01-28multifd: Unfold "used" variable by its valueJuan Quintela1-5/+3
Signed-off-by: Juan Quintela <quintela@redhat.com> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
2022-01-28multifd: Use a single writev on the send sideJuan Quintela1-12/+8
Until now, we wrote the packet header with write(), and the rest of the pages with writev(). Just increase the size of the iovec and do a single writev(). Signed-off-by: Juan Quintela <quintela@redhat.com> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
2022-01-28multifd: Remove send_write() methodJuan Quintela4-54/+2
Everything use now iov's. Signed-off-by: Juan Quintela <quintela@redhat.com> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
2022-01-28multifd: Make zstd use iov'sJuan Quintela1-4/+4
Signed-off-by: Juan Quintela <quintela@redhat.com> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
2022-01-28multifd: Make zlib use iov'sJuan Quintela1-4/+4
Signed-off-by: Juan Quintela <quintela@redhat.com> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
2022-01-28multifd: Move iov from pages to paramsJuan Quintela2-12/+30
This will allow us to reduce the number of system calls on the next patch. Signed-off-by: Juan Quintela <quintela@redhat.com>
2022-01-28multifd: Use proper maximum compression valuesJuan Quintela2-4/+4
It happens that there are functions to calculate the worst possible compression size for a packet. Use them. Suggested-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
2022-01-28migration: Move ram_release_pages() call to save_zero_page_to_file()Juan Quintela1-11/+10
We always need to call it when we find a zero page, so put it in a single place. Signed-off-by: Juan Quintela <quintela@redhat.com> Reviewed-by: Peter Xu <peterx@redhat.com>
2022-01-28migration: simplify do_compress_ram_pageJuan Quintela1-8/+3
The goto is not needed at all. Signed-off-by: Juan Quintela <quintela@redhat.com> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
2022-01-28migration: Remove masking for compressionJuan Quintela1-2/+2
Remove the mask in the call to ram_release_pages(). Nothing else does it, and if the offset has that bits set, we have a lot of trouble. Signed-off-by: Juan Quintela <quintela@redhat.com> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
2022-01-28migration: ram_release_pages() always receive 1 page as argumentJuan Quintela1-4/+4
Remove the pages argument. And s/pages/page/ Signed-off-by: Juan Quintela <quintela@redhat.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com> Reviewed-by: Peter Xu <peterx@redhat.com> --- - Use 1LL instead of casts (philmd) - Change the whole 1ULL for TARGET_PAGE_SIZE
2022-01-28migration: We only need last_stage in two placesJuan Quintela1-23/+18
We only need last_stage in two places and we are passing it all around. Just add a field to RAMState that passes it. Signed-off-by: Juan Quintela <quintela@redhat.com> Reviewed-by: Peter Xu <peterx@redhat.com> --- Repeat subject (philmd suggestion)
2022-01-28migration: All this fields are unsignedJuan Quintela4-43/+43
So printing it as %d is wrong. Notice that for the channel id, that is an uint8_t, but I changed it anyways for consistency. Signed-off-by: Juan Quintela <quintela@redhat.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com> Reviewed-by: Peter Xu <peterx@redhat.com>
2022-01-12aio-posix: split poll check from ready handlerStefan Hajnoczi1-4/+4
Adaptive polling measures the execution time of the polling check plus handlers called when a polled event becomes ready. Handlers can take a significant amount of time, making it look like polling was running for a long time when in fact the event handler was running for a long time. For example, on Linux the io_submit(2) syscall invoked when a virtio-blk device's virtqueue becomes ready can take 10s of microseconds. This can exceed the default polling interval (32 microseconds) and cause adaptive polling to stop polling. By excluding the handler's execution time from the polling check we make the adaptive polling calculation more accurate. As a result, the event loop now stays in polling mode where previously it would have fallen back to file descriptor monitoring. The following data was collected with virtio-blk num-queues=2 event_idx=off using an IOThread. Before: 168k IOPS, IOThread syscalls: 9837.115 ( 0.020 ms): IO iothread1/620155 io_submit(ctx_id: 140512552468480, nr: 16, iocbpp: 0x7fcb9f937db0) = 16 9837.158 ( 0.002 ms): IO iothread1/620155 write(fd: 103, buf: 0x556a2ef71b88, count: 8) = 8 9837.161 ( 0.001 ms): IO iothread1/620155 write(fd: 104, buf: 0x556a2ef71b88, count: 8) = 8 9837.163 ( 0.001 ms): IO iothread1/620155 ppoll(ufds: 0x7fcb90002800, nfds: 4, tsp: 0x7fcb9f1342d0, sigsetsize: 8) = 3 9837.164 ( 0.001 ms): IO iothread1/620155 read(fd: 107, buf: 0x7fcb9f939cc0, count: 512) = 8 9837.174 ( 0.001 ms): IO iothread1/620155 read(fd: 105, buf: 0x7fcb9f939cc0, count: 512) = 8 9837.176 ( 0.001 ms): IO iothread1/620155 read(fd: 106, buf: 0x7fcb9f939cc0, count: 512) = 8 9837.209 ( 0.035 ms): IO iothread1/620155 io_submit(ctx_id: 140512552468480, nr: 32, iocbpp: 0x7fca7d0cebe0) = 32 174k IOPS (+3.6%), IOThread syscalls: 9809.566 ( 0.036 ms): IO iothread1/623061 io_submit(ctx_id: 140539805028352, nr: 32, iocbpp: 0x7fd0cdd62be0) = 32 9809.625 ( 0.001 ms): IO iothread1/623061 write(fd: 103, buf: 0x5647cfba5f58, count: 8) = 8 9809.627 ( 0.002 ms): IO iothread1/623061 write(fd: 104, buf: 0x5647cfba5f58, count: 8) = 8 9809.663 ( 0.036 ms): IO iothread1/623061 io_submit(ctx_id: 140539805028352, nr: 32, iocbpp: 0x7fd0d0388b50) = 32 Notice that ppoll(2) and eventfd read(2) syscalls are eliminated because the IOThread stays in polling mode instead of falling back to file descriptor monitoring. As usual, polling is not implemented on Windows so this patch ignores the new io_poll_read() callback in aio-win32.c. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Message-id: 20211207132336.36627-2-stefanha@redhat.com [Fixed up aio_set_event_notifier() calls in tests/unit/test-fdmon-epoll.c added after this series was queued. --Stefan] Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2021-12-22failover: Silence warning messages during qtestLaurent Vivier1-1/+3
virtio-net-failover test tries several device combinations that produces some expected warnings. These warning can be confusing, so we disable them during the qtest sequence. Reported-by: Thomas Huth <thuth@redhat.com> Signed-off-by: Laurent Vivier <lvivier@redhat.com> Message-Id: <20211220145314.390697-1-lvivier@redhat.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com> [thuth: Fix memory leak by using error_free()] Signed-off-by: Thomas Huth <thuth@redhat.com>
2021-12-15multifd: Make zlib compression method not use iovsJuan Quintela1-8/+9
Signed-off-by: Juan Quintela <quintela@redhat.com> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
2021-12-15multifd: Make zstd compression method not use iovsJuan Quintela1-10/+10
Signed-off-by: Juan Quintela <quintela@redhat.com> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
2021-12-15COLO: Move some trace code behind qemu_mutex_unlock_iothread()Rao, Lei1-3/+3
There is no need to put some trace code in the critical section. So, moving it behind qemu_mutex_unlock_iothread() can reduce the lock time. Signed-off-by: Lei Rao <lei.rao@intel.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com>
2021-12-15multifd: Shut down the QIO channels to avoid blocking the send threads when ↵Li Zhang1-0/+3
they are terminated. When doing live migration with multifd channels 8, 16 or larger number, the guest hangs in the presence of the network errors such as missing TCP ACKs. At sender's side: The main thread is blocked on qemu_thread_join, migration_fd_cleanup is called because one thread fails on qio_channel_write_all when the network problem happens and other send threads are blocked on sendmsg. They could not be terminated. So the main thread is blocked on qemu_thread_join to wait for the threads terminated. (gdb) bt 0 0x00007f30c8dcffc0 in __pthread_clockjoin_ex () at /lib64/libpthread.so.0 1 0x000055cbb716084b in qemu_thread_join (thread=0x55cbb881f418) at ../util/qemu-thread-posix.c:627 2 0x000055cbb6b54e40 in multifd_save_cleanup () at ../migration/multifd.c:542 3 0x000055cbb6b4de06 in migrate_fd_cleanup (s=0x55cbb8024000) at ../migration/migration.c:1808 4 0x000055cbb6b4dfb4 in migrate_fd_cleanup_bh (opaque=0x55cbb8024000) at ../migration/migration.c:1850 5 0x000055cbb7173ac1 in aio_bh_call (bh=0x55cbb7eb98e0) at ../util/async.c:141 6 0x000055cbb7173bcb in aio_bh_poll (ctx=0x55cbb7ebba80) at ../util/async.c:169 7 0x000055cbb715ba4b in aio_dispatch (ctx=0x55cbb7ebba80) at ../util/aio-posix.c:381 8 0x000055cbb7173ffe in aio_ctx_dispatch (source=0x55cbb7ebba80, callback=0x0, user_data=0x0) at ../util/async.c:311 9 0x00007f30c9c8cdf4 in g_main_context_dispatch () at /usr/lib64/libglib-2.0.so.0 10 0x000055cbb71851a2 in glib_pollfds_poll () at ../util/main-loop.c:232 11 0x000055cbb718521c in os_host_main_loop_wait (timeout=42251070366) at ../util/main-loop.c:255 12 0x000055cbb7185321 in main_loop_wait (nonblocking=0) at ../util/main-loop.c:531 13 0x000055cbb6e6ba27 in qemu_main_loop () at ../softmmu/runstate.c:726 14 0x000055cbb6ad6fd7 in main (argc=68, argv=0x7ffc0c578888, envp=0x7ffc0c578ab0) at ../softmmu/main.c:50 To make sure that the send threads could be terminated, IO channels should be shut down to avoid waiting IO. Signed-off-by: Li Zhang <lizhang@suse.de> Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com>
2021-12-15multifd: Fill offset and block for receptionJuan Quintela1-0/+2
We were using the iov directly, but we will need this info on the following patch. Signed-off-by: Juan Quintela <quintela@redhat.com> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
2021-12-15multifd: remove used parameter from send_recv_pages() methodJuan Quintela4-14/+11
It is already there as p->pages->num. Signed-off-by: Juan Quintela <quintela@redhat.com> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
2021-12-15multifd: remove used parameter from send_prepare() methodJuan Quintela4-15/+10
It is already there as p->pages->num. Signed-off-by: Juan Quintela <quintela@redhat.com> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
2021-12-15multifd: The variable is only used inside the loopJuan Quintela1-2/+1
Signed-off-by: Juan Quintela <quintela@redhat.com> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
2021-12-15multifd: Add missing documentionJuan Quintela3-0/+5
Signed-off-by: Juan Quintela <quintela@redhat.com> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
2021-12-15multifd: Rename used field to numJuan Quintela2-20/+20
We will need to split it later in zero_num (number of zero pages) and normal_num (number of normal pages). This name is better. Signed-off-by: Juan Quintela <quintela@redhat.com> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
2021-12-15migration: Never call twice qemu_target_page_size()Juan Quintela3-8/+11
Signed-off-by: Juan Quintela <quintela@redhat.com> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
2021-12-15multifd: Delete useless operationJuan Quintela2-18/+8
We are dividing by page_size to multiply again in the only use. Once there, improve the comments. Signed-off-by: Juan Quintela <quintela@redhat.com> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
2021-12-15migration: Remove is_zero_range()Juan Quintela1-7/+2
It just calls buffer_is_zero(). Just change the callers. Signed-off-by: Juan Quintela <quintela@redhat.com> Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
2021-12-15migration/colo: Optimize COLO primary node start code pathZhang Chen2-8/+7
Optimize COLO primary start path from: MIGRATION_STATUS_XXX --> MIGRATION_STATUS_ACTIVE --> MIGRATION_STATUS_COLO --> MIGRATION_STATUS_COMPLETED To: MIGRATION_STATUS_XXX --> MIGRATION_STATUS_COLO --> MIGRATION_STATUS_COMPLETED No need to start primary COLO through "MIGRATION_STATUS_ACTIVE". Signed-off-by: Zhang Chen <chen.zhang@intel.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com>
2021-12-15Fixed a QEMU hang when guest poweroff in COLO modeRao, Lei2-0/+26
When the PVM guest poweroff, the COLO thread may wait a semaphore in colo_process_checkpoint().So, we should wake up the COLO thread before migration shutdown. Signed-off-by: Lei Rao <lei.rao@intel.com> Reviewed-by: Zhang Chen <chen.zhang@intel.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com>
2021-12-15migration/colo: More accurate update checkpoint timeZhang Chen1-3/+2
Previous operation(like vm_start and replication_start_all) will consume extra time before update the timer, so reduce time in this patch. Signed-off-by: Zhang Chen <chen.zhang@intel.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com>
2021-12-15migration/ram.c: Remove the qemu_mutex_lock in colo_flush_ram_cache.Rao, Lei1-2/+0
The code to acquire bitmap_mutex is added in the commit of "63268c4970a5f126cc9af75f3ccb8057abef5ec0". There is no need to acquire bitmap_mutex in colo_flush_ram_cache(). This is because the colo_flush_ram_cache only be called on the COLO secondary VM, which is the destination side. On the COLO secondary VM, only the COLO thread will touch the bitmap of ram cache. Signed-off-by: Lei Rao <lei.rao@intel.com> Reviewed-by: Zhang Chen <chen.zhang@intel.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com>
2021-11-09Reset the auto-converge counter at every checkpoint.Rao, Lei3-0/+14
if we don't reset the auto-converge counter, it will continue to run with COLO running, and eventually the system will hang due to the CPU throttle reaching DEFAULT_MIGRATE_MAX_CPU_THROTTLE. Signed-off-by: Lei Rao <lei.rao@intel.com> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Reviewed-by: Lukas Straub <lukasstraub2@web.de> Tested-by: Lukas Straub <lukasstraub2@web.de> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com>
2021-11-09Reduce the PVM stop time during CheckpointRao, Lei1-3/+45
When flushing memory from ram cache to ram during every checkpoint on secondary VM, we can copy continuous chunks of memory instead of 4096 bytes per time to reduce the time of VM stop during checkpoint. Signed-off-by: Lei Rao <lei.rao@intel.com> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Reviewed-by: Lukas Straub <lukasstraub2@web.de> Reviewed-by: Juan Quintela <quintela@redhat.com> Tested-by: Lukas Straub <lukasstraub2@web.de> Signed-off-by: Juan Quintela <quintela@redhat.com>
2021-11-06migration: Check that postcopy fd's are not NULLJuan Quintela1-0/+4
If postcopy has finished, it frees the array. But vhost-user unregister it at cleanup time. fixes: c4f7538 Signed-off-by: Juan Quintela <quintela@redhat.com> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
2021-11-03colo: Don't dump colo cache if dump-guest-core=offLukas Straub1-0/+6
One might set dump-guest-core=off to make coredumps smaller and still allow to debug many qemu bugs. Extend this option to the colo cache. Signed-off-by: Lukas Straub <lukasstraub2@web.de> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com>
2021-11-03Changed the last-mode to none of first start COLORao, Lei1-5/+3
When we first stated the COLO, the last-mode is as follows: { "execute": "query-colo-status" } {"return": {"last-mode": "primary", "mode": "primary", "reason": "none"}} The last-mode is unreasonable. After the patch, will be changed to the following: { "execute": "query-colo-status" } {"return": {"last-mode": "none", "mode": "primary", "reason": "none"}} Signed-off-by: Lei Rao <lei.rao@intel.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com>
2021-11-03Removed the qemu_fclose() in colo_process_incoming_threadRao, Lei1-5/+0
After the live migration, the related fd will be cleanup in migration_incoming_state_destroy(). So, the qemu_close() in colo_process_incoming_thread is not necessary. Signed-off-by: Lei Rao <lei.rao@intel.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com>
2021-11-03colo: fixed 'Segmentation fault' when the simplex mode PVM poweroffRao, Lei1-0/+1
The GDB statck is as follows: Program terminated with signal SIGSEGV, Segmentation fault. 0 object_class_dynamic_cast (class=0x55c8f5d2bf50, typename=0x55c8f2f7379e "qio-channel") at qom/object.c:832 if (type->class->interfaces && [Current thread is 1 (Thread 0x7f756e97eb00 (LWP 1811577))] (gdb) bt 0 object_class_dynamic_cast (class=0x55c8f5d2bf50, typename=0x55c8f2f7379e "qio-channel") at qom/object.c:832 1 0x000055c8f2c3dd14 in object_dynamic_cast (obj=0x55c8f543ac00, typename=0x55c8f2f7379e "qio-channel") at qom/object.c:763 2 0x000055c8f2c3ddce in object_dynamic_cast_assert (obj=0x55c8f543ac00, typename=0x55c8f2f7379e "qio-channel", file=0x55c8f2f73780 "migration/qemu-file-channel.c", line=117, func=0x55c8f2f73800 <__func__.18724> "channel_shutdown") at qom/object.c:786 3 0x000055c8f2bbc6ac in channel_shutdown (opaque=0x55c8f543ac00, rd=true, wr=true, errp=0x0) at migration/qemu-file-channel.c:117 4 0x000055c8f2bba56e in qemu_file_shutdown (f=0x7f7558070f50) at migration/qemu-file.c:67 5 0x000055c8f2ba5373 in migrate_fd_cancel (s=0x55c8f4ccf3f0) at migration/migration.c:1699 6 0x000055c8f2ba1992 in migration_shutdown () at migration/migration.c:187 7 0x000055c8f29a5b77 in main (argc=69, argv=0x7fff3e9e8c08, envp=0x7fff3e9e8e38) at vl.c:4512 The root cause is that we still want to shutdown the from_dst_file in migrate_fd_cancel() after qemu_close in colo_process_checkpoint(). So, we should set the s->rp_state.from_dst_file = NULL after qemu_close(). Signed-off-by: Lei Rao <lei.rao@intel.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com>
2021-11-03Fixed SVM hang when do failover before PVM crashRao, Lei1-0/+2
This patch fixed as follows: Thread 1 (Thread 0x7f34ee738d80 (LWP 11212)): #0 __pthread_clockjoin_ex (threadid=139847152957184, thread_return=0x7f30b1febf30, clockid=<optimized out>, abstime=<optimized out>, block=<optimized out>) at pthread_join_common.c:145 #1 0x0000563401998e36 in qemu_thread_join (thread=0x563402d66610) at util/qemu-thread-posix.c:587 #2 0x00005634017a79fa in process_incoming_migration_co (opaque=0x0) at migration/migration.c:502 #3 0x00005634019b59c9 in coroutine_trampoline (i0=63395504, i1=22068) at util/coroutine-ucontext.c:115 #4 0x00007f34ef860660 in ?? () at ../sysdeps/unix/sysv/linux/x86_64/__start_context.S:91 from /lib/x86_64-linux-gnu/libc.so.6 #5 0x00007f30b21ee730 in ?? () #6 0x0000000000000000 in ?? () Thread 13 (Thread 0x7f30b3dff700 (LWP 11747)): #0 __lll_lock_wait (futex=futex@entry=0x56340218ffa0 <qemu_global_mutex>, private=0) at lowlevellock.c:52 #1 0x00007f34efa000a3 in _GI__pthread_mutex_lock (mutex=0x56340218ffa0 <qemu_global_mutex>) at ../nptl/pthread_mutex_lock.c:80 #2 0x0000563401997f99 in qemu_mutex_lock_impl (mutex=0x56340218ffa0 <qemu_global_mutex>, file=0x563401b7a80e "migration/colo.c", line=806) at util/qemu-thread-posix.c:78 #3 0x0000563401407144 in qemu_mutex_lock_iothread_impl (file=0x563401b7a80e "migration/colo.c", line=806) at /home/workspace/colo-qemu/cpus.c:1899 #4 0x00005634017ba8e8 in colo_process_incoming_thread (opaque=0x563402d664c0) at migration/colo.c:806 #5 0x0000563401998b72 in qemu_thread_start (args=0x5634039f8370) at util/qemu-thread-posix.c:519 #6 0x00007f34ef9fd609 in start_thread (arg=<optimized out>) at pthread_create.c:477 #7 0x00007f34ef924293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 The QEMU main thread is holding the lock: (gdb) p qemu_global_mutex $1 = {lock = {_data = {lock = 2, __count = 0, __owner = 11212, __nusers = 9, __kind = 0, __spins = 0, __elision = 0, __list = {_prev = 0x0, __next = 0x0}}, __size = "\002\000\000\000\000\000\000\000\314+\000\000\t", '\000' <repeats 26 times>, __align = 2}, file = 0x563401c07e4b "util/main-loop.c", line = 240, initialized = true} >From the call trace, we can see it is a deadlock bug. and the QEMU main thread holds the global mutex to wait until the COLO thread ends. and the colo thread wants to acquire the global mutex, which will cause a deadlock. So, we should release the qemu_global_mutex before waiting colo thread ends. Signed-off-by: Lei Rao <lei.rao@intel.com> Reviewed-by: Li Zhijian <lizhijian@cn.fujitsu.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com>
2021-11-03Fixed qemu crash when guest power off in COLO modeRao, Lei1-1/+3
This patch fixes the following: qemu-system-x86_64: invalid runstate transition: 'shutdown' -> 'running' Aborted (core dumped) The gdb bt as following: 0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50 1 0x00007faa3d613859 in __GI_abort () at abort.c:79 2 0x000055c5a21268fd in runstate_set (new_state=RUN_STATE_RUNNING) at vl.c:723 3 0x000055c5a1f8cae4 in vm_prepare_start () at /home/workspace/colo-qemu/cpus.c:2206 4 0x000055c5a1f8cb1b in vm_start () at /home/workspace/colo-qemu/cpus.c:2213 5 0x000055c5a2332bba in migration_iteration_finish (s=0x55c5a4658810) at migration/migration.c:3376 6 0x000055c5a2332f3b in migration_thread (opaque=0x55c5a4658810) at migration/migration.c:3527 7 0x000055c5a251d68a in qemu_thread_start (args=0x55c5a5491a70) at util/qemu-thread-posix.c:519 8 0x00007faa3d7e9609 in start_thread (arg=<optimized out>) at pthread_create.c:477 9 0x00007faa3d710293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 Signed-off-by: Lei Rao <lei.rao@intel.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com>