From 20471e00e2ea400c5a7126a98fb58c6a83b50628 Mon Sep 17 00:00:00 2001 From: Simon Marchi Date: Tue, 18 Jan 2022 14:33:57 -0500 Subject: gdbserver/linux: set lwp !stopped when failing to resume I see some failures, at least in gdb.multi/multi-re-run.exp and gdb.threads/interrupted-hand-call.exp. Running `stress -C $(nproc)` at the same time as the test makes those tests relatively frequent. Let's take gdb.multi/multi-re-run.exp as an example. The failure looks like this, an unexpected "no resumed": continue Continuing. No unwaited-for children left. (gdb) FAIL: gdb.multi/multi-re-run.exp: re_run_inf=2: iter=1: continue until exit The situation is: - Inferior 1 is stopped somewhere, it won't really play a role here. - Inferior 2 has 2 threads, both stopped. - We resume inferior 2, the leader thread is expected to exit, making the process exit. From GDB's perspective, a failing run looks like this: [infrun] fetch_inferior_event: enter [infrun] scoped_disable_commit_resumed: reason=handling event [infrun] do_target_wait: Found 2 inferiors, starting at #1 [infrun] random_pending_event_thread: None found. [remote] wait: enter [remote] Packet received: T0506:20dcffffff7f0000;07:20dcffffff7f0000;10:9551555555550000;thread:pae4cd.ae4cd;core:e; [remote] wait: exit [infrun] print_target_wait_results: target_wait (-1.0.0 [process -1], status) = [infrun] print_target_wait_results: 713933.713933.0 [Thread 713933.713933], [infrun] print_target_wait_results: status->kind = STOPPED, sig = GDB_SIGNAL_TRAP [infrun] handle_inferior_event: status->kind = STOPPED, sig = GDB_SIGNAL_TRAP [infrun] clear_step_over_info: clearing step over info [infrun] context_switch: Switching context from 0.0.0 to 713933.713933.0 [infrun] handle_signal_stop: stop_pc=0x555555555195 [infrun] start_step_over: enter [infrun] start_step_over: stealing global queue of threads to step, length = 0 [infrun] operator(): step-over queue now empty [infrun] start_step_over: exit [infrun] process_event_stop_test: no stepping, continue [remote] Sending packet: $Z0,555555555194,1#8e [remote] Packet received: OK [infrun] resume_1: step=0, signal=GDB_SIGNAL_0, trap_expected=0, current thread [713933.713933.0] at 0x555555555195 [remote] Sending packet: $QPassSignals:e;10;14;17;1a;1b;1c;21;24;25;2c;4c;97;#0a [remote] Packet received: OK [remote] Sending packet: $vCont;c:pae4cd.-1#9f [infrun] prepare_to_wait: prepare_to_wait [infrun] reset: reason=handling event [infrun] maybe_set_commit_resumed_all_targets: enabling commit-resumed for target extended-remote [infrun] maybe_call_commit_resumed_all_targets: calling commit_resumed for target extended-remote [infrun] maybe_call_commit_resumed_all_targets: calling commit_resumed for target extended-remote [infrun] fetch_inferior_event: exit [infrun] fetch_inferior_event: enter [infrun] scoped_disable_commit_resumed: reason=handling event [infrun] do_target_wait: Found 2 inferiors, starting at #0 [infrun] random_pending_event_thread: None found. [remote] wait: enter [remote] Packet received: N [remote] wait: exit [infrun] print_target_wait_results: target_wait (-1.0.0 [process -1], status) = [infrun] print_target_wait_results: -1.0.0 [process -1], [infrun] print_target_wait_results: status->kind = NO_RESUMED [infrun] handle_inferior_event: status->kind = NO_RESUMED [remote] Sending packet: $Hgp0.0#ad [remote] Packet received: OK [remote] Sending packet: $qXfer:threads:read::0,1000#92 [remote] Packet received: l\n\n\n\n\n [infrun] stop_waiting: stop_waiting [remote] Sending packet: $qXfer:threads:read::0,1000#92 [remote] Packet received: l\n\n\n\n\n [infrun] infrun_async: enable=0 [infrun] reset: reason=handling event [infrun] maybe_set_commit_resumed_all_targets: enabling commit-resumed for target extended-remote [infrun] maybe_call_commit_resumed_all_targets: calling commit_resumed for target extended-remote [infrun] maybe_call_commit_resumed_all_targets: calling commit_resumed for target extended-remote [infrun] fetch_inferior_event: exit We can see that we resume the inferior with vCont;c, but got NO_RESUMED. When the test passes, we get an EXITED status to indicate the process has exited. From GDBserver's point of view, it looks like this. The logs contain some logging I added and that are part of this patch. [remote] getpkt: getpkt ("vCont;c:pae4cf.-1"); [no ack sent] [threads] resume: enter [threads] thread_needs_step_over: Need step over [LWP 713931]? Ignoring, should remain stopped [threads] thread_needs_step_over: Need step over [LWP 713932]? Ignoring, should remain stopped [threads] get_pc: pc is 0x555555555195 [threads] thread_needs_step_over: Need step over [LWP 713935]? No, no breakpoint found at 0x555555555195 [threads] get_pc: pc is 0x7ffff7d35a95 [threads] thread_needs_step_over: Need step over [LWP 713936]? No, no breakpoint found at 0x7ffff7d35a95 [threads] resume: Resuming, no pending status or step over needed [threads] resume_one_thread: resuming LWP 713935 [threads] proceed_one_lwp: lwp 713935 [threads] resume_one_lwp_throw: continue from pc 0x555555555195 [threads] resume_one_lwp_throw: Resuming lwp 713935 (continue, signal 0, stop not expected) [threads] resume_one_lwp_throw: NOW ptid=713935.713935.0 stopped=0 resumed=0 [threads] resume_one_thread: resuming LWP 713936 [threads] proceed_one_lwp: lwp 713936 [threads] resume_one_lwp_throw: continue from pc 0x7ffff7d35a95 [threads] resume_one_lwp_throw: Resuming lwp 713936 (continue, signal 0, stop not expected) [threads] resume_one_lwp_throw: ptrace errno = 3 (No such process) [threads] resume: exit [threads] wait_1: enter [threads] wait_1: [] [threads] wait_for_event_filtered: waitpid(-1, ...) returned 0, ERRNO-OK [threads] resume_stopped_resumed_lwps: resuming stopped-resumed LWP LWP 713935.713936 at 7ffff7d35a95: step=0 [threads] resume_one_lwp_throw: continue from pc 0x7ffff7d35a95 [threads] resume_one_lwp_throw: Resuming lwp 713936 (continue, signal 0, stop not expected) [threads] resume_one_lwp_throw: ptrace errno = 3 (No such process) [threads] operator(): check_zombie_leaders: leader_pid=713931, leader_lp!=NULL=1, num_lwps=2, zombie=0 [threads] operator(): check_zombie_leaders: leader_pid=713935, leader_lp!=NULL=1, num_lwps=2, zombie=1 [threads] operator(): Thread group leader 713935 zombie (it exited, or another thread execd). [threads] delete_lwp: deleting 713935 [threads] wait_for_event_filtered: exit (no unwaited-for LWP) sigchld_handler [threads] wait_1: ret = null_ptid, TARGET_WAITKIND_NO_RESUMED [threads] wait_1: exit What happens is: - We resume the leader (713935) successfully. - The leader exits. - We resume the secondary thread (713936), we get ESRCH. This is expected this the leader has exited. - resume_one_lwp_throw throws, it's caught by resume_one_lwp. - resume_one_lwp checks with check_ptrace_stopped_lwp_gone that the failure can be explained by the LWP becoming zombie, and swallows the error. - Note that this means that the secondary lwp still has stopped==1. - wait_1 is called, probably because linux_process_target::resume marks the async pipe at the end. - The exit event isn't ready yet, probably because the machine is under load, so waitpid returns nothing. - check_zombie_leaders detects that the leader is zombie and deletes - We try to find a resumed (non-stopped) LWP to get an event from, there's none since the leader (that was resumed) is now deleted, and the secondary thread is still marked stopped. wait_for_event_filtered returns -1, causing wait_1 to return NO_RESUMED. What I notice here is that there is some kind of race between the availability of the process' exit notification and the call to wait_1 that results from marking the async pipe at the end of resume. I think what we want from this wait_1 invocation is to keep waiting, as we will eventually get thread exit notifications for both of our threads. The fix I came up with is to mark the secondary thread as !stopped (or resumed) when we fail to resume it. This makes wait_1 see that there is at least one resume lwp, so it won't return NO_RESUMED. I think this makes sense to consider it resumed, because we are going to receive an exit event for it. Here's the GDBserver logs with the fix applied: [threads] resume: enter [threads] thread_needs_step_over: Need step over [LWP 724595]? Ignoring, should remain stopped [threads] thread_needs_step_over: Need step over [LWP 724596]? Ignoring, should remain stopped [threads] get_pc: pc is 0x555555555195 [threads] thread_needs_step_over: Need step over [LWP 724597]? No, no breakpoint found at 0x555555555195 [threads] get_pc: pc is 0x7ffff7d35a95 [threads] thread_needs_step_over: Need step over [LWP 724598]? No, no breakpoint found at 0x7ffff7d35a95 [threads] resume: Resuming, no pending status or step over needed [threads] resume_one_thread: resuming LWP 724597 [threads] proceed_one_lwp: lwp 724597 [threads] resume_one_lwp_throw: continue from pc 0x555555555195 [threads] resume_one_lwp_throw: Resuming lwp 724597 (continue, signal 0, stop not expected) [threads] resume_one_lwp_throw: NOW ptid=724597.724597.0 stopped=0 resumed=0 [threads] resume_one_thread: resuming LWP 724598 [threads] proceed_one_lwp: lwp 724598 [threads] resume_one_lwp_throw: continue from pc 0x7ffff7d35a95 [threads] resume_one_lwp_throw: Resuming lwp 724598 (continue, signal 0, stop not expected) [threads] resume_one_lwp_throw: ptrace errno = 3 (No such process) [threads] resume: exit [threads] wait_1: enter [threads] wait_1: [] sigchld_handler [threads] wait_for_event_filtered: waitpid(-1, ...) returned 0, ERRNO-OK [threads] operator(): check_zombie_leaders: leader_pid=724595, leader_lp!=NULL=1, num_lwps=2, zombie=0 [threads] operator(): check_zombie_leaders: leader_pid=724597, leader_lp!=NULL=1, num_lwps=2, zombie=1 [threads] operator(): Thread group leader 724597 zombie (it exited, or another thread execd). [threads] delete_lwp: deleting 724597 [threads] wait_for_event_filtered: sigsuspend'ing sigchld_handler [threads] wait_for_event_filtered: waitpid(-1, ...) returned 724598, ERRNO-OK [threads] wait_for_event_filtered: waitpid 724598 received 0 (exited) [threads] filter_event: 724598 exited [threads] wait_for_event_filtered: waitpid(-1, ...) returned 724597, ERRNO-OK [threads] wait_for_event_filtered: waitpid 724597 received 0 (exited) [threads] wait_for_event_filtered: waitpid(-1, ...) returned 0, ERRNO-OK sigchld_handler [threads] wait_1: ret = LWP 724597.724598, exited with retcode 0 [threads] wait_1: exit Change-Id: Idf0bdb4cb0313f1b49e4864071650cc83fb3c100 --- gdbserver/linux-low.cc | 20 ++++++++++++++++++-- 1 file changed, 18 insertions(+), 2 deletions(-) (limited to 'gdbserver') diff --git a/gdbserver/linux-low.cc b/gdbserver/linux-low.cc index 7726a4a..d68d419 100644 --- a/gdbserver/linux-low.cc +++ b/gdbserver/linux-low.cc @@ -4091,7 +4091,15 @@ linux_process_target::resume_one_lwp_throw (lwp_info *lwp, int step, (PTRACE_TYPE_ARG4) (uintptr_t) signal); if (errno) - perror_with_name ("resuming thread"); + { + int saved_errno = errno; + + threads_debug_printf ("ptrace errno = %d (%s)", + saved_errno, strerror (saved_errno)); + + errno = saved_errno; + perror_with_name ("resuming thread"); + } /* Successfully resumed. Clear state that no longer makes sense, and mark the LWP as running. Must not do this before resuming @@ -4152,7 +4160,15 @@ linux_process_target::resume_one_lwp (lwp_info *lwp, int step, int signal, } catch (const gdb_exception_error &ex) { - if (!check_ptrace_stopped_lwp_gone (lwp)) + if (check_ptrace_stopped_lwp_gone (lwp)) + { + /* This could because we tried to resume an LWP after its leader + exited. Mark it as resumed, so we can collect an exit event + from it. */ + lwp->stopped = 0; + lwp->stop_reason = TARGET_STOPPED_BY_NO_REASON; + } + else throw; } } -- cgit v1.1