aboutsummaryrefslogtreecommitdiff
path: root/migration
AgeCommit message (Collapse)AuthorFilesLines
2023-10-11migration: Add migration_rp_wait|kick()Peter Xu3-11/+34
It's just a simple wrapper for rp_sem on either wait() or kick(), make it even clearer on how it is used. Prepared to be used even for other things. Reviewed-by: Fabiano Rosas <farosas@suse.de> Signed-off-by: Peter Xu <peterx@redhat.com> Message-ID: <20231004220240.167175-8-peterx@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com>
2023-10-11migration: Remember num of ramblocks to sync during recoveryPeter Xu1-3/+14
Instead of only relying on the count of rp_sem, make the counter be part of RAMState so it can be used in both threads to synchronize on the process. rp_sem will be further reused in follow up patches, as a way to kick the main thread, e.g., on recovery failures. Reviewed-by: Fabiano Rosas <farosas@suse.de> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com> Message-ID: <20231004220240.167175-7-peterx@redhat.com>
2023-10-11qemufile: Always return a verbose errorPeter Xu1-3/+12
There're a lot of cases where we only have an errno set in last_error but without a detailed error description. When this happens, try to generate an error contains the errno as a descriptive error. This will be helpful in cases where one relies on the Error*. E.g., migration state only caches Error* in MigrationState.error. With this, we'll display correct error messages in e.g. query-migrate when the error was only set by qemu_file_set_error(). Reviewed-by: Fabiano Rosas <farosas@suse.de> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com> Message-ID: <20231004220240.167175-6-peterx@redhat.com>
2023-10-11migration: Introduce migrate_has_error()Peter Xu2-0/+8
Introduce a helper to detect whether MigrationState.error is set for whatever reason. This is preparation work for any thread (e.g. source return path thread) to setup errors in an unified way to MigrationState, rather than relying on its own way to set errors (mark_source_rp_bad()). Reviewed-by: Fabiano Rosas <farosas@suse.de> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com> Message-ID: <20231004220240.167175-3-peterx@redhat.com>
2023-10-11migration: Display error in query-migrate irrelevant of statusPeter Xu1-3/+5
Display it as long as being set, irrelevant of FAILED status. E.g., it may also be applicable to PAUSED stage of postcopy, to provide hint on what has gone wrong. The error_mutex seems to be overlooked when referencing the error, add it to be very safe. This will change QAPI behavior by showing up error message outside !FAILED status, but it's intended and doesn't expect to break anyone. Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2018404 Reviewed-by: Fabiano Rosas <farosas@suse.de> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com> Message-ID: <20231004220240.167175-2-peterx@redhat.com>
2023-10-11migration/rdma: Replace flawed device detail dump by tracingMarkus Armbruster2-15/+10
qemu_rdma_dump_id() dumps RDMA device details to stdout. rdma_start_outgoing_migration() calls it via qemu_rdma_source_init() and qemu_rdma_resolve_host() to show source device details. rdma_start_incoming_migration() arranges its call via rdma_accept_incoming_migration() and qemu_rdma_accept() to show destination device details. Two issues: 1. rdma_start_outgoing_migration() can run in HMP context. The information should arguably go the monitor, not stdout. 2. ibv_query_port() failure is reported as error. Its callers remain unaware of this failure (qemu_rdma_dump_id() can't fail), so reporting this to the user as an error is problematic. Fixable, but the device detail dump is noise, except when troubleshooting. Tracing is a better fit. Similar function qemu_rdma_dump_id() was converted to tracing in commit 733252deb8b (Tracify migration/rdma.c). Convert qemu_rdma_dump_id(), too. While there, touch up qemu_rdma_dump_gid()'s outdated comment. Signed-off-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Fabiano Rosas <farosas@suse.de> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com> Message-ID: <20230928132019.2544702-54-armbru@redhat.com>
2023-10-11migration/rdma: Use error_report() & friends instead of stderrMarkus Armbruster1-23/+21
error_report() obeys -msg, reports the current error location if any, and reports to the current monitor if any. Reporting to stderr directly with fprintf() or perror() is wrong, because it loses all this. Fix the offenders. Bonus: resolves a FIXME about problematic use of errno. Signed-off-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com> Message-ID: <20230928132019.2544702-53-armbru@redhat.com>
2023-10-11migration/rdma: Downgrade qemu_rdma_cleanup() errors to warningsMarkus Armbruster1-2/+2
Functions that use an Error **errp parameter to return errors should not also report them to the user, because reporting is the caller's job. When the caller does, the error is reported twice. When it doesn't (because it recovered from the error), there is no error to report, i.e. the report is bogus. qemu_rdma_source_init(), qemu_rdma_connect(), rdma_start_incoming_migration(), and rdma_start_outgoing_migration() violate this principle: they call error_report() via qemu_rdma_cleanup(). Moreover, qemu_rdma_cleanup() can't fail. It is called on error paths, and QIOChannel close and finalization. Are the conditions it reports really errors? I doubt it. Downgrade qemu_rdma_cleanup()'s errors to warnings. Signed-off-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Fabiano Rosas <farosas@suse.de> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com> Message-ID: <20230928132019.2544702-52-armbru@redhat.com>
2023-10-11migration/rdma: Silence qemu_rdma_register_and_get_keys()Markus Armbruster1-9/+0
Functions that use an Error **errp parameter to return errors should not also report them to the user, because reporting is the caller's job. When the caller does, the error is reported twice. When it doesn't (because it recovered from the error), there is no error to report, i.e. the report is bogus. qemu_rdma_write_one() violates this principle: it reports errors to stderr via qemu_rdma_register_and_get_keys(). I elected not to investigate how callers handle the error, i.e. precise impact is not known. Clean this up: silence qemu_rdma_register_and_get_keys(). I believe the caller's error reports suffice. If they don't, we need to convert to Error instead. Signed-off-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Li Zhijian <lizhijian@fujitsu.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com> Message-ID: <20230928132019.2544702-51-armbru@redhat.com>
2023-10-11migration/rdma: Silence qemu_rdma_block_for_wrid()Markus Armbruster1-16/+0
Functions that use an Error **errp parameter to return errors should not also report them to the user, because reporting is the caller's job. When the caller does, the error is reported twice. When it doesn't (because it recovered from the error), there is no error to report, i.e. the report is bogus. qemu_rdma_post_send_control(), qemu_rdma_exchange_get_response(), and qemu_rdma_write_one() violate this principle: they call error_report(), fprintf(stderr, ...), and perror() via qemu_rdma_block_for_wrid(), qemu_rdma_poll(), and qemu_rdma_wait_comp_channel(). I elected not to investigate how callers handle the error, i.e. precise impact is not known. Clean this up by dropping the error reporting from qemu_rdma_poll(), qemu_rdma_wait_comp_channel(), and qemu_rdma_block_for_wrid(). I believe the callers' error reports suffice. If they don't, we need to convert to Error instead. Bonus: resolves a FIXME about problematic use of errno. Signed-off-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Li Zhijian <lizhijian@fujitsu.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com> Message-ID: <20230928132019.2544702-50-armbru@redhat.com>
2023-10-11migration/rdma: Don't report received completion events as errorMarkus Armbruster1-2/+2
When qemu_rdma_wait_comp_channel() receives an event from the completion channel, it reports an error "receive cm event while wait comp channel,cm event is T", where T is the numeric event type. However, the function fails only when T is a disconnect or device removal. Events other than these two are not actually an error, and reporting them as an error is wrong. If we need to report them to the user, we should use something else, and what to use depends on why we need to report them to the user. For now, report this error only when the function actually fails. Signed-off-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Li Zhijian <lizhijian@fujitsu.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com> Message-ID: <20230928132019.2544702-49-armbru@redhat.com>
2023-10-11migration/rdma: Silence qemu_rdma_reg_control()Markus Armbruster1-1/+0
Functions that use an Error **errp parameter to return errors should not also report them to the user, because reporting is the caller's job. When the caller does, the error is reported twice. When it doesn't (because it recovered from the error), there is no error to report, i.e. the report is bogus. qemu_rdma_source_init() and qemu_rdma_accept() violate this principle: they call error_report() via qemu_rdma_reg_control(). I elected not to investigate how callers handle the error, i.e. precise impact is not known. Clean this up by dropping the error reporting from qemu_rdma_reg_control(). I believe the callers' error reports suffice. If they don't, we need to convert to Error instead. Signed-off-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Li Zhijian <lizhijian@fujitsu.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com> Message-ID: <20230928132019.2544702-48-armbru@redhat.com>
2023-10-11migration/rdma: Silence qemu_rdma_connect()Markus Armbruster1-10/+4
Functions that use an Error **errp parameter to return errors should not also report them to the user, because reporting is the caller's job. When the caller does, the error is reported twice. When it doesn't (because it recovered from the error), there is no error to report, i.e. the report is bogus. qemu_rdma_connect() violates this principle: it calls error_report() and perror(). I elected not to investigate how callers handle the error, i.e. precise impact is not known. Clean this up: replace perror() by changing error_setg() to error_setg_errno(), and drop error_report(). I believe the callers' error reports suffice then. If they don't, we need to convert to Error instead. Bonus: resolves a FIXME about problematic use of errno. Signed-off-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Li Zhijian <lizhijian@fujitsu.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com> Message-ID: <20230928132019.2544702-47-armbru@redhat.com>
2023-10-11migration/rdma: Silence qemu_rdma_resolve_host()Markus Armbruster1-1/+0
Functions that use an Error **errp parameter to return errors should not also report them to the user, because reporting is the caller's job. When the caller does, the error is reported twice. When it doesn't (because it recovered from the error), there is no error to report, i.e. the report is bogus. qemu_rdma_resolve_host() violates this principle: it calls error_report(). Clean this up: drop error_report(). Signed-off-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Li Zhijian <lizhijian@fujitsu.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com> Message-ID: <20230928132019.2544702-46-armbru@redhat.com>
2023-10-11migration/rdma: Convert qemu_rdma_alloc_pd_cq() to ErrorMarkus Armbruster1-13/+9
Functions that use an Error **errp parameter to return errors should not also report them to the user, because reporting is the caller's job. When the caller does, the error is reported twice. When it doesn't (because it recovered from the error), there is no error to report, i.e. the report is bogus. qemu_rdma_source_init() violates this principle: it calls error_report() via qemu_rdma_alloc_pd_cq(). I elected not to investigate how callers handle the error, i.e. precise impact is not known. Clean this up by converting qemu_rdma_alloc_pd_cq() to Error. The conversion loses a piece of advice on one of two failure paths: Your mlock() limits may be too low. Please check $ ulimit -a # and search for 'ulimit -l' in the output Not worth retaining. Signed-off-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Li Zhijian <lizhijian@fujitsu.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com> Message-ID: <20230928132019.2544702-45-armbru@redhat.com>
2023-10-11migration/rdma: Convert qemu_rdma_post_recv_control() to ErrorMarkus Armbruster1-12/+10
Just for symmetry with qemu_rdma_post_send_control(). Error messages lose detail I consider of no use to users. Signed-off-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Li Zhijian <lizhijian@fujitsu.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com> Message-ID: <20230928132019.2544702-44-armbru@redhat.com>
2023-10-11migration/rdma: Convert qemu_rdma_post_send_control() to ErrorMarkus Armbruster1-14/+17
Functions that use an Error **errp parameter to return errors should not also report them to the user, because reporting is the caller's job. When the caller does, the error is reported twice. When it doesn't (because it recovered from the error), there is no error to report, i.e. the report is bogus. qemu_rdma_exchange_send() violates this principle: it calls error_report() via qemu_rdma_post_send_control(). I elected not to investigate how callers handle the error, i.e. precise impact is not known. Clean this up by converting qemu_rdma_post_send_control() to Error. Signed-off-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Li Zhijian <lizhijian@fujitsu.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com> Message-ID: <20230928132019.2544702-43-armbru@redhat.com>
2023-10-11migration/rdma: Convert qemu_rdma_write() to ErrorMarkus Armbruster1-10/+6
Just for consistency with qemu_rdma_write_one() and qemu_rdma_write_flush(), and for slightly simpler code. Signed-off-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Li Zhijian <lizhijian@fujitsu.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com> Message-ID: <20230928132019.2544702-42-armbru@redhat.com>
2023-10-11migration/rdma: Convert qemu_rdma_write_one() to ErrorMarkus Armbruster1-20/+12
Functions that use an Error **errp parameter to return errors should not also report them to the user, because reporting is the caller's job. When the caller does, the error is reported twice. When it doesn't (because it recovered from the error), there is no error to report, i.e. the report is bogus. qemu_rdma_write_flush() violates this principle: it calls error_report() via qemu_rdma_write_one(). I elected not to investigate how callers handle the error, i.e. precise impact is not known. Clean this up by converting qemu_rdma_write_one() to Error. Bonus: resolves a FIXME about problematic use of errno. Signed-off-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Li Zhijian <lizhijian@fujitsu.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com> Message-ID: <20230928132019.2544702-41-armbru@redhat.com>
2023-10-11migration/rdma: Convert qemu_rdma_write_flush() to ErrorMarkus Armbruster1-6/+13
Functions that use an Error **errp parameter to return errors should not also report them to the user, because reporting is the caller's job. When the caller does, the error is reported twice. When it doesn't (because it recovered from the error), there is no error to report, i.e. the report is bogus. qio_channel_rdma_writev() violates this principle: it calls error_report() via qemu_rdma_write_flush(). I elected not to investigate how callers handle the error, i.e. precise impact is not known. Clean this up by converting qemu_rdma_write_flush() to Error. Necessitates setting an error when qemu_rdma_write_one() failed. Since this error will go away later in this series, simply use "FIXME temporary error message" there. Signed-off-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Li Zhijian <lizhijian@fujitsu.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com> Message-ID: <20230928132019.2544702-40-armbru@redhat.com>
2023-10-11migration/rdma: Convert qemu_rdma_reg_whole_ram_blocks() to ErrorMarkus Armbruster1-13/+13
Functions that use an Error **errp parameter to return errors should not also report them to the user, because reporting is the caller's job. When the caller does, the error is reported twice. When it doesn't (because it recovered from the error), there is no error to report, i.e. the report is bogus. qemu_rdma_exchange_send() violates this principle: it calls error_report() via callback qemu_rdma_reg_whole_ram_blocks(). I elected not to investigate how callers handle the error, i.e. precise impact is not known. Clean this up by converting the callback to Error. Signed-off-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Li Zhijian <lizhijian@fujitsu.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com> Message-ID: <20230928132019.2544702-39-armbru@redhat.com>
2023-10-11migration/rdma: Convert qemu_rdma_exchange_get_response() to ErrorMarkus Armbruster1-11/+11
Functions that use an Error **errp parameter to return errors should not also report them to the user, because reporting is the caller's job. When the caller does, the error is reported twice. When it doesn't (because it recovered from the error), there is no error to report, i.e. the report is bogus. qemu_rdma_exchange_send() and qemu_rdma_exchange_recv() violate this principle: they call error_report() via qemu_rdma_exchange_get_response(). I elected not to investigate how callers handle the error, i.e. precise impact is not known. Clean this up by converting qemu_rdma_exchange_get_response() to Error. Signed-off-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Li Zhijian <lizhijian@fujitsu.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com> Message-ID: <20230928132019.2544702-38-armbru@redhat.com>
2023-10-11migration/rdma: Convert qemu_rdma_exchange_send() to ErrorMarkus Armbruster1-13/+27
Functions that use an Error **errp parameter to return errors should not also report them to the user, because reporting is the caller's job. When the caller does, the error is reported twice. When it doesn't (because it recovered from the error), there is no error to report, i.e. the report is bogus. qio_channel_rdma_writev() violates this principle: it calls error_report() via qemu_rdma_exchange_send(). I elected not to investigate how callers handle the error, i.e. precise impact is not known. Clean this up by converting qemu_rdma_exchange_send() to Error. Necessitates setting an error when qemu_rdma_post_recv_control(), callback(), or qemu_rdma_exchange_get_response() failed. Since these errors will go away later in this series, simply use "FIXME temporary error message" there. Signed-off-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Li Zhijian <lizhijian@fujitsu.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com> Message-ID: <20230928132019.2544702-37-armbru@redhat.com>
2023-10-11migration/rdma: Convert qemu_rdma_exchange_recv() to ErrorMarkus Armbruster1-6/+9
Functions that use an Error **errp parameter to return errors should not also report them to the user, because reporting is the caller's job. When the caller does, the error is reported twice. When it doesn't (because it recovered from the error), there is no error to report, i.e. the report is bogus. qio_channel_rdma_readv() violates this principle: it calls error_report() via qemu_rdma_exchange_recv(). I elected not to investigate how callers handle the error, i.e. precise impact is not known. Clean this up by converting qemu_rdma_exchange_recv() to Error. Necessitates setting an error when qemu_rdma_exchange_get_response() failed. Since this error will go away later in this series, simply use "FIXME temporary error message" there. Signed-off-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Li Zhijian <lizhijian@fujitsu.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com> Message-ID: <20230928132019.2544702-36-armbru@redhat.com>
2023-10-11migration/rdma: Drop "@errp is clear" guards around error_setg()Markus Armbruster1-113/+51
These guards are all redundant now. Signed-off-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Li Zhijian <lizhijian@fujitsu.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com> Message-ID: <20230928132019.2544702-35-armbru@redhat.com>
2023-10-11migration/rdma: Fix error handling around rdma_getaddrinfo()Markus Armbruster1-4/+21
qemu_rdma_resolve_host() and qemu_rdma_dest_init() iterate over addresses to find one that works, holding onto the first Error from qemu_rdma_broken_ipv6_kernel() for use when no address works. Issues: 1. If @errp was &error_abort or &error_fatal, we'd terminate instead of trying the next address. Can't actually happen, since no caller passes these arguments. 2. When @errp is a pointer to a variable containing NULL, and qemu_rdma_broken_ipv6_kernel() fails, the variable no longer contains NULL. Subsequent iterations pass it again, violating Error usage rules. Dangerous, as setting an error would then trip error_setv()'s assertion. Works only because qemu_rdma_broken_ipv6_kernel() and the code following the loops carefully avoids setting a second error. 3. If qemu_rdma_broken_ipv6_kernel() fails, and then a later iteration finds a working address, @errp still holds the first error from qemu_rdma_broken_ipv6_kernel(). If we then run into another error, we report the qemu_rdma_broken_ipv6_kernel() failure instead. 4. If we don't run into another error, we leak the Error object. Use a local error variable, and propagate to @errp. This fixes 3. and also cleans up 1 and partly 2. Free this error when we have a working address. This fixes 4. Pass the local error variable to qemu_rdma_broken_ipv6_kernel() only until it fails. Pass null on any later iterations. This cleans up the remainder of 2. Signed-off-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Li Zhijian <lizhijian@fujitsu.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com> Message-ID: <20230928132019.2544702-34-armbru@redhat.com>
2023-10-11migration/rdma: Retire macro ERROR()Markus Armbruster1-48/+120
ERROR() has become "error_setg() unless an error has been set already". Hiding the conditional in the macro is in the way of further work. Replace the macro uses by their expansion, and delete the macro. Signed-off-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Li Zhijian <lizhijian@fujitsu.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com> Message-ID: <20230928132019.2544702-33-armbru@redhat.com>
2023-10-11migration/rdma: Delete inappropriate error_report() in macro ERROR()Markus Armbruster1-4/+0
Functions that use an Error **errp parameter to return errors should not also report them to the user, because reporting is the caller's job. When the caller does, the error is reported twice. When it doesn't (because it recovered from the error), there is no error to report, i.e. the report is bogus. Macro ERROR() violates this principle. Delete the error_report() there. Signed-off-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Li Zhijian <lizhijian@fujitsu.com> Tested-by: Li Zhijian <lizhijian@fujitsu.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com> Message-ID: <20230928132019.2544702-32-armbru@redhat.com>
2023-10-11migration/rdma: Plug a memory leak and improve a messageMarkus Armbruster1-2/+2
When migration capability @rdma-pin-all is true, but the server cannot honor it, qemu_rdma_connect() calls macro ERROR(), then returns success. ERROR() sets an error. Since qemu_rdma_connect() returns success, its caller rdma_start_outgoing_migration() duly assumes @errp is still clear. The Error object leaks. ERROR() additionally reports the situation to the user as an error: RDMA ERROR: Server cannot support pinning all memory. Will register memory dynamically. Is this an error or not? It actually isn't; we disable @rdma-pin-all and carry on. "Correcting" the user's configuration decisions that way feels problematic, but that's a topic for another day. Replace ERROR() by warn_report(). This plugs the memory leak, and emits a clearer message to the user. Signed-off-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Li Zhijian <lizhijian@fujitsu.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com> Message-ID: <20230928132019.2544702-31-armbru@redhat.com>
2023-10-11migration/rdma: Check negative error values the same way everywhereMarkus Armbruster1-41/+41
When a function returns 0 on success, negative value on error, checking for non-zero suffices, but checking for negative is clearer. So do that. Signed-off-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Fabiano Rosas <farosas@suse.de> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com> Message-ID: <20230928132019.2544702-30-armbru@redhat.com>
2023-10-11migration/rdma: Drop superfluous assignments to @retMarkus Armbruster1-25/+10
Signed-off-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Li Zhijian <lizhijian@fujitsu.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com> Message-ID: <20230928132019.2544702-29-armbru@redhat.com>
2023-10-11migration/rdma: Replace int error_state by bool erroredMarkus Armbruster1-33/+33
All we do with the value of RDMAContext member @error_state is test whether it's zero. Change to bool and rename to @errored. Signed-off-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Li Zhijian <lizhijian@fujitsu.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com> Message-ID: <20230928132019.2544702-28-armbru@redhat.com>
2023-10-11migration/rdma: Dumb down remaining int error values to -1Markus Armbruster1-22/+23
This is just to make the error value more obvious. Callers don't mind. Signed-off-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Li Zhijian <lizhijian@fujitsu.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com> Message-ID: <20230928132019.2544702-27-armbru@redhat.com>
2023-10-11migration/rdma: Return -1 instead of negative errno codeMarkus Armbruster1-22/+22
Several functions return negative errno codes on failure. Callers check for specific codes exactly never. For some of the functions, callers couldn't check even if they wanted to, because the functions also return negative values that aren't errno codes, leaving readers confused on what the function actually returns. Clean up and simplify: return -1 instead of negative errno code. Signed-off-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Li Zhijian <lizhijian@fujitsu.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com> Message-ID: <20230928132019.2544702-26-armbru@redhat.com>
2023-10-11migration/rdma: Fix rdma_getaddrinfo() error checkingMarkus Armbruster1-8/+6
rdma_getaddrinfo() returns 0 on success. On error, it returns one of the EAI_ error codes like getaddrinfo() does, or -1 with errno set. This is broken by design: POSIX implicitly specifies the EAI_ error codes to be non-zero, no more. They could clash with -1. Nothing we can do about this design flaw. Both callers of rdma_getaddrinfo() only recognize negative values as error. Works only because systems elect to make the EAI_ error codes negative. Best not to rely on that: change the callers to treat any non-zero value as failure. Also change them to return -1 instead of the value received from getaddrinfo() on failure, to avoid positive error values. Signed-off-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Li Zhijian <lizhijian@fujitsu.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com> Message-ID: <20230928132019.2544702-25-armbru@redhat.com>
2023-10-11migration/rdma: Fix QEMUFileHooks method return valuesMarkus Armbruster1-42/+37
The QEMUFileHooks methods don't come with a written contract. Digging through the code calling them, we find: * save_page(): Negative values RAM_SAVE_CONTROL_DELAYED and RAM_SAVE_CONTROL_NOT_SUPP are special. Any other negative value is an unspecified error. qemu_rdma_save_page() returns -EIO or rdma->error_state on error. I believe the latter is always negative. Nothing stops either of them to clash with the special values, though. Feels unlikely, but fix it anyway to return only the special values and -1. * before_ram_iterate(), after_ram_iterate(): Negative value means error. qemu_rdma_registration_start() and qemu_rdma_registration_stop() comply as far as I can tell. Make them comply *obviously*, by returning -1 on error. * hook_ram_load: Negative value means error. rdma_load_hook() already returns -1 on error. Leave it alone. Signed-off-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Li Zhijian <lizhijian@fujitsu.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com> Message-ID: <20230928132019.2544702-24-armbru@redhat.com>
2023-10-11migration/rdma: Drop dead qemu_rdma_data_init() code for !@host_portMarkus Armbruster1-17/+14
qemu_rdma_data_init() neglects to set an Error when it fails because @host_port is null. Fortunately, no caller passes null, so this is merely a latent bug. Drop the flawed code handling null argument. Signed-off-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Fabiano Rosas <farosas@suse.de> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com> Message-ID: <20230928132019.2544702-23-armbru@redhat.com>
2023-10-11migration/rdma: Fix qemu_get_cm_event_timeout() to always set errorMarkus Armbruster1-2/+8
qemu_get_cm_event_timeout() neglects to set an error when it fails because rdma_get_cm_event() fails. Harmless, as its caller qemu_rdma_connect() substitutes a generic error then. Fix it anyway. qemu_rdma_connect() also sets the generic error when its own call of rdma_get_cm_event() fails. Make the error handling more obvious: set a specific error right after rdma_get_cm_event() fails. Delete the generic error. Signed-off-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Peter Xu <peterx@redhat.com> Reviewed-by: Li Zhijian <lizhijian@fujitsu.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com> Message-ID: <20230928132019.2544702-22-armbru@redhat.com>
2023-10-11migration/rdma: Fix qemu_rdma_broken_ipv6_kernel() to set errorMarkus Armbruster1-0/+2
qemu_rdma_resolve_host() and qemu_rdma_dest_init() try addresses until they find on that works. If none works, they return the first Error set by qemu_rdma_broken_ipv6_kernel(), or else return a generic one. qemu_rdma_broken_ipv6_kernel() neglects to set an Error when ibv_open_device() fails. If a later address fails differently, we use that Error instead, or else the generic one. Harmless enough, but needs fixing all the same. Signed-off-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Fabiano Rosas <farosas@suse.de> Reviewed-by: Li Zhijian <lizhijian@fujitsu.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com> Message-ID: <20230928132019.2544702-21-armbru@redhat.com>
2023-10-11migration/rdma: Replace dangerous macro CHECK_ERROR_STATE()Markus Armbruster1-16/+27
Hiding return statements in macros is a bad idea. Use a function instead, and open code the return part. Signed-off-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Fabiano Rosas <farosas@suse.de> Reviewed-by: Li Zhijian <lizhijian@fujitsu.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com> Message-ID: <20230928132019.2544702-20-armbru@redhat.com>
2023-10-11migration/rdma: Fix io_writev(), io_readv() methods to obey contractMarkus Armbruster1-2/+10
QIOChannelClass methods qio_channel_rdma_readv() and qio_channel_rdma_writev() violate their method contract when rdma->error_state is non-zero: 1. They return whatever is in rdma->error_state then. Only -1 will be fine. -2 will be misinterpreted as "would block". Anything less than -2 isn't defined in the contract. A positive value would be misinterpreted as success, but I believe that's not actually possible. 2. They neglect to set an error then. If something up the call stack dereferences the error when failure is returned, it will crash. If it ignores the return value and checks the error instead, it will miss the error. Crap like this happens when return statements hide in macros, especially when their uses are far away from the definition. I elected not to investigate how callers are impacted. Expand the two bad macro uses, so we can set an error and return -1. The next commit will then get rid of the macro altogether. Signed-off-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Fabiano Rosas <farosas@suse.de> Reviewed-by: Li Zhijian <lizhijian@fujitsu.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com> Message-ID: <20230928132019.2544702-19-armbru@redhat.com>
2023-10-11migration/rdma: Ditch useless numeric error codes in error messagesMarkus Armbruster1-10/+10
Several error messages include numeric error codes returned by failed functions: * ibv_poll_cq() returns an unspecified negative value. Useless. * rdma_accept and rdma_get_cm_event() return -1. Useless. * qemu_rdma_poll() returns either -1 or an unspecified negative value. Useless. * qemu_rdma_block_for_wrid(), qemu_rdma_write_flush(), qemu_rdma_exchange_send(), qemu_rdma_exchange_recv(), qemu_rdma_write() return a negative value that may or may not be an errno value. While reporting human-readable errno information (which a number is not) can be useful, reporting an error code that may or may not be an errno value is useless. Drop these error codes from the error messages. Signed-off-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Fabiano Rosas <farosas@suse.de> Reviewed-by: Li Zhijian <lizhijian@fujitsu.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com> Message-ID: <20230928132019.2544702-18-armbru@redhat.com>
2023-10-11migration/rdma: Fix or document problematic uses of errnoMarkus Armbruster1-6/+39
We use errno after calling Libibverbs functions that are not documented to set errno (manual page does not mention errno), or where the documentation is unclear ("returns [...] the value of errno on failure"). While this could be read as "sets errno and returns it", a glance at the source code[*] kills that hope: static inline int ibv_post_send(struct ibv_qp *qp, struct ibv_send_wr *wr, struct ibv_send_wr **bad_wr) { return qp->context->ops.post_send(qp, wr, bad_wr); } The callback can be static int mana_post_send(struct ibv_qp *ibqp, struct ibv_send_wr *wr, struct ibv_send_wr **bad) { /* This version of driver supports RAW QP only. * Posting WR is done directly in the application. */ return EOPNOTSUPP; } Neither of them touches errno. One of these errno uses is easy to fix, so do that now. Several more will go away later in the series; add temporary FIXME commments. Three will remain; add TODO comments. TODO, not FIXME, because the bug might be in Libibverbs documentation. [*] https://github.com/linux-rdma/rdma-core.git commit 55fa316b4b18f258d8ac1ceb4aa5a7a35b094dcf Signed-off-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com> Message-ID: <20230928132019.2544702-17-armbru@redhat.com>
2023-10-11migration/rdma: Use bool for two RDMAContext flagsMarkus Armbruster1-3/+3
@error_reported and @received_error are flags. The latter is even assigned bool true. Change them from int to bool. Signed-off-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Fabiano Rosas <farosas@suse.de> Reviewed-by: Li Zhijian <lizhijian@fujitsu.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com> Message-ID: <20230928132019.2544702-16-armbru@redhat.com>
2023-10-11migration/rdma: Make qemu_rdma_buffer_mergeable() return boolMarkus Armbruster1-10/+10
qemu_rdma_buffer_mergeable() is semantically a predicate. It returns int 0 or 1. Return bool instead, and fix the function name's spelling. Signed-off-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Fabiano Rosas <farosas@suse.de> Reviewed-by: Li Zhijian <lizhijian@fujitsu.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com> Message-ID: <20230928132019.2544702-15-armbru@redhat.com>
2023-10-11migration/rdma: Drop qemu_rdma_search_ram_block() error handlingMarkus Armbruster1-16/+8
qemu_rdma_search_ram_block() can't fail. Return void, and drop the unreachable error handling. Signed-off-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Li Zhijian <lizhijian@fujitsu.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com> Message-ID: <20230928132019.2544702-14-armbru@redhat.com>
2023-10-11migration/rdma: Drop rdma_add_block() error handlingMarkus Armbruster1-21/+9
rdma_add_block() can't fail. Return void, and drop the unreachable error handling. Signed-off-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Fabiano Rosas <farosas@suse.de> Reviewed-by: Li Zhijian <lizhijian@fujitsu.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com> Message-ID: <20230928132019.2544702-13-armbru@redhat.com>
2023-10-11migration/rdma: Eliminate error_propagate()Markus Armbruster1-12/+7
When all we do with an Error we receive into a local variable is propagating to somewhere else, we can just as well receive it there right away. Signed-off-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Fabiano Rosas <farosas@suse.de> Reviewed-by: Li Zhijian <lizhijian@fujitsu.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com> Message-ID: <20230928132019.2544702-12-armbru@redhat.com>
2023-10-11migration/rdma: Put @errp parameter lastMarkus Armbruster1-3/+4
include/qapi/error.h demands: * - Functions that use Error to report errors have an Error **errp * parameter. It should be the last parameter, except for functions * taking variable arguments. qemu_rdma_connect() does not conform. Clean it up. Signed-off-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Fabiano Rosas <farosas@suse.de> Reviewed-by: Li Zhijian <lizhijian@fujitsu.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com> Message-ID: <20230928132019.2544702-11-armbru@redhat.com>
2023-10-11migration/rdma: Fix qemu_rdma_accept() to return failure on errorsMarkus Armbruster1-7/+12
qemu_rdma_accept() returns 0 in some cases even when it didn't complete its job due to errors. Impact is not obvious. I figure the caller will soon fail again with a misleading error message. Fix it to return -1 on any failure. Signed-off-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Fabiano Rosas <farosas@suse.de> Reviewed-by: Li Zhijian <lizhijian@fujitsu.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com> Message-ID: <20230928132019.2544702-10-armbru@redhat.com>