Age | Commit message (Collapse) | Author | Files | Lines |
|
While looking at the regcache code, I noticed that the address space
(passed to regcache when constructing it, and available through
regcache::aspace) wasn't relevant for the regcache itself. Callers of
regcache::aspace use that method because it appears to be a convenient
way of getting the address space for a thread, if you already have the
regcache. But there is always another way to get the address space, as
the callers pretty much always know which thread they are dealing with.
The regcache code itself doesn't use the address space.
This patch removes anything related to address_space from the regcache
code, and updates callers to get it from the thread in context. This
removes a bit of unnecessary complexity from the regcache code.
The current get_thread_arch_regcache function gets an address_space for
the given thread using the target_thread_address_space function (which
calls the target_ops::thread_address_space method). This suggest that
there might have been the intention of supporting per-thread address
spaces. But digging through the history, I did not find any such case.
Maybe this method was just added because we needed a way to get an
address space from a ptid (because constructing a regcache required an
address space), and this seemed like the right way to do it, I don't
know.
The only implementations of thread_address_space and
process_stratum_target::thread_address_space and
linux_nat_target::thread_address_space, which essentially just return
the inferior's address space. And thread_address_space is only used in
the current get_thread_arch_regcache, which gets removed. So, I think
that the thread_address_space target method can be removed, and we can
assume that it's fine to use the inferior's address space everywhere.
Callers of regcache::aspace are updated to get the address space from
the relevant inferior, either using some context they already know
about, or in last resort using the current global context.
So, to summarize:
- remove everything in regcache related to address spaces
- in particular, remove get_thread_arch_regcache, and rename
get_thread_arch_aspace_regcache to get_thread_arch_regcache
- remove target_ops::thread_address_space, and
target_thread_address_space
- adjust all users of regcache::aspace to get the address space another
way
Change-Id: I04fd41b22c83fe486522af7851c75bcfb31c88c7
|
|
This implements support for the new GDB_THREAD_OPTION_EXIT thread
option for native Linux.
Reviewed-By: Andrew Burgess <aburgess@redhat.com>
Change-Id: Ia69fc0b9b96f9af7de7cefc1ddb1fba9bbb0bb90
Bug: https://sourceware.org/bugzilla/show_bug.cgi?id=27338
|
|
Currently, infrun assumes that when TARGET_WAITKIND_THREAD_EXITED is
reported, the corresponding GDB thread has already been removed from
the GDB thread list.
Later in the series, that will no longer work, as infrun will need to
refer to the thread's thread_info when it processes
TARGET_WAITKIND_THREAD_EXITED.
As preparation, this patch makes deleting the GDB thread
responsibility of infrun, instead of the target.
Reviewed-By: Andrew Burgess <aburgess@redhat.com>
Change-Id: I013d87f61ffc9aaca49f0d6ce2a43e3ea69274de
|
|
This commit teaches the native Linux target about the
GDB_THREAD_OPTION_CLONE thread option. It's actually simpler to just
continue reporting all clone events unconditionally to the core.
There's never any harm in reporting a clone event when the option is
disabled. All we need to do is to report support for the option,
otherwise GDB falls back to use target_thread_events().
Bug: https://sourceware.org/bugzilla/show_bug.cgi?id=19675
Bug: https://sourceware.org/bugzilla/show_bug.cgi?id=27830
Reviewed-By: Andrew Burgess <aburgess@redhat.com>
Change-Id: If90316e2dcd0c61d0fefa0d463c046011698acf9
|
|
(A good chunk of the problem statement in the commit log below is
Andrew's, adjusted for a different solution, and for covering
displaced stepping too. The testcase is mostly Andrew's too.)
This commit addresses bugs gdb/19675 and gdb/27830, which are about
stepping over a breakpoint set at a clone syscall instruction, one is
about displaced stepping, and the other about in-line stepping.
Currently, when a new thread is created through a clone syscall, GDB
sets the new thread running. With 'continue' this makes sense
(assuming no schedlock):
- all-stop mode, user issues 'continue', all threads are set running,
a newly created thread should also be set running.
- non-stop mode, user issues 'continue', other pre-existing threads
are not affected, but as the new thread is (sort-of) a child of the
thread the user asked to run, it makes sense that the new threads
should be created in the running state.
Similarly, if we are stopped at the clone syscall, and there's no
software breakpoint at this address, then the current behaviour is
fine:
- all-stop mode, user issues 'stepi', stepping will be done in place
(as there's no breakpoint to step over). While stepping the thread
of interest all the other threads will be allowed to continue. A
newly created thread will be set running, and then stopped once the
thread of interest has completed its step.
- non-stop mode, user issues 'stepi', stepping will be done in place
(as there's no breakpoint to step over). Other threads might be
running or stopped, but as with the continue case above, the new
thread will be created running. The only possible issue here is
that the new thread will be left running after the initial thread
has completed its stepi. The user would need to manually select
the thread and interrupt it, this might not be what the user
expects. However, this is not something this commit tries to
change.
The problem then is what happens when we try to step over a clone
syscall if there is a breakpoint at the syscall address.
- For both all-stop and non-stop modes, with in-line stepping:
+ user issues 'stepi',
+ [non-stop mode only] GDB stops all threads. In all-stop mode all
threads are already stopped.
+ GDB removes s/w breakpoint at syscall address,
+ GDB single steps just the thread of interest, all other threads
are left stopped,
+ New thread is created running,
+ Initial thread completes its step,
+ [non-stop mode only] GDB resumes all threads that it previously
stopped.
There are two problems in the in-line stepping scenario above:
1. The new thread might pass through the same code that the initial
thread is in (i.e. the clone syscall code), in which case it will
fail to hit the breakpoint in clone as this was removed so the
first thread can single step,
2. The new thread might trigger some other stop event before the
initial thread reports its step completion. If this happens we
end up triggering an assertion as GDB assumes that only the
thread being stepped should stop. The assert looks like this:
infrun.c:5899: internal-error: int finish_step_over(execution_control_state*): Assertion `ecs->event_thread->control.trap_expected' failed.
- For both all-stop and non-stop modes, with displaced stepping:
+ user issues 'stepi',
+ GDB starts the displaced step, moves thread's PC to the
out-of-line scratch pad, maybe adjusts registers,
+ GDB single steps the thread of interest, [non-stop mode only] all
other threads are left as they were, either running or stopped.
In all-stop, all other threads are left stopped.
+ New thread is created running,
+ Initial thread completes its step, GDB re-adjusts its PC,
restores/releases scratchpad,
+ [non-stop mode only] GDB resumes the thread, now past its
breakpoint.
+ [all-stop mode only] GDB resumes all threads.
There is one problem with the displaced stepping scenario above:
3. When the parent thread completed its step, GDB adjusted its PC,
but did not adjust the child's PC, thus that new child thread
will continue execution in the scratch pad, invoking undefined
behavior. If you're lucky, you see a crash. If unlucky, the
inferior gets silently corrupted.
What is needed is for GDB to have more control over whether the new
thread is created running or not. Issue #1 above requires that the
new thread not be allowed to run until the breakpoint has been
reinserted. The only way to guarantee this is if the new thread is
held in a stopped state until the single step has completed. Issue #3
above requires that GDB is informed of when a thread clones itself,
and of what is the child's ptid, so that GDB can fixup both the parent
and the child.
When looking for solutions to this problem I considered how GDB
handles fork/vfork as these have some of the same issues. The main
difference between fork/vfork and clone is that the clone events are
not reported back to core GDB. Instead, the clone event is handled
automatically in the target code and the child thread is immediately
set running.
Note we have support for requesting thread creation events out of the
target (TARGET_WAITKIND_THREAD_CREATED). However, those are reported
for the new/child thread. That would be sufficient to address in-line
stepping (issue #1), but not for displaced-stepping (issue #3). To
handle displaced-stepping, we need an event that is reported to the
_parent_ of the clone, as the information about the displaced step is
associated with the clone parent. TARGET_WAITKIND_THREAD_CREATED
includes no indication of which thread is the parent that spawned the
new child. In fact, for some targets, like e.g., Windows, it would be
impossible to know which thread that was, as thread creation there
doesn't work by "cloning".
The solution implemented here is to model clone on fork/vfork, and
introduce a new TARGET_WAITKIND_THREAD_CLONED event. This event is
similar to TARGET_WAITKIND_FORKED and TARGET_WAITKIND_VFORKED, except
that we end up with a new thread in the same process, instead of a new
thread of a new process. Like FORKED and VFORKED, THREAD_CLONED
waitstatuses have a child_ptid property, and the child is held stopped
until GDB explicitly resumes it. This addresses the in-line stepping
case (issues #1 and #2).
The infrun code that handles displaced stepping fixup for the child
after a fork/vfork event is thus reused for THREAD_CLONE, with some
minimal conditions added, addressing the displaced stepping case
(issue #3).
The native Linux backend is adjusted to unconditionally report
TARGET_WAITKIND_THREAD_CLONED events to the core.
Following the follow_fork model in core GDB, we introduce a
target_follow_clone target method, which is responsible for making the
new clone child visible to the rest of GDB.
Subsequent patches will add clone events support to the remote
protocol and gdbserver.
displaced_step_in_progress_thread becomes unused with this patch, but
a new use will reappear later in the series. To avoid deleting it and
readding it back, this patch marks it with attribute unused, and the
latter patch removes the attribute again. We need to do this because
the function is static, and with no callers, the compiler would warn,
(error with -Werror), breaking the build.
This adds a new gdb.threads/stepi-over-clone.exp testcase, which
exercises stepping over a clone syscall, with displaced stepping vs
inline stepping, and all-stop vs non-stop. We already test stepping
over clone syscalls with gdb.base/step-over-syscall.exp, but this test
uses pthreads, while the other test uses raw clone, and this one is
more thorough. The testcase passes on native GNU/Linux, but fails
against GDBserver. GDBserver will be fixed by a later patch in the
series.
Co-authored-by: Andrew Burgess <aburgess@redhat.com>
Bug: https://sourceware.org/bugzilla/show_bug.cgi?id=19675
Bug: https://sourceware.org/bugzilla/show_bug.cgi?id=27830
Change-Id: I95c06024736384ae8542a67ed9fdf6534c325c8e
Reviewed-By: Andrew Burgess <aburgess@redhat.com>
|
|
I noticed that on an Ubuntu 20.04 system, after a following patch
("Step over clone syscall w/ breakpoint,
TARGET_WAITKIND_THREAD_CLONED"), the gdb.threads/step-over-exec.exp
was passing cleanly, but still, we'd end up with four new unexpected
GDB core dumps:
=== gdb Summary ===
# of unexpected core files 4
# of expected passes 48
That said patch is making the pre-existing
gdb.threads/step-over-exec.exp testcase (almost silently) expose a
latent problem in gdb/linux-nat.c, resulting in a GDB crash when:
#1 - a non-leader thread execs
#2 - the post-exec program stops somewhere
#3 - you kill the inferior
Instead of #3 directly, the testcase just returns, which ends up in
gdb_exit, tearing down GDB, which kills the inferior, and is thus
equivalent to #3 above.
Vis (after said patch is applied):
$ gdb --args ./gdb /home/pedro/gdb/build/gdb/testsuite/outputs/gdb.threads/step-over-exec/step-over-exec-execr-thread-other-diff-text-segs-true
...
(top-gdb) r
...
(gdb) b main
...
(gdb) r
...
Breakpoint 1, main (argc=1, argv=0x7fffffffdb88) at /home/pedro/gdb/build/gdb/testsuite/../../../src/gdb/testsuite/gdb.threads/step-over-exec.c:69
69 argv0 = argv[0];
(gdb) c
Continuing.
[New Thread 0x7ffff7d89700 (LWP 2506975)]
Other going in exec.
Exec-ing /home/pedro/gdb/build/gdb/testsuite/outputs/gdb.threads/step-over-exec/step-over-exec-execr-thread-other-diff-text-segs-true-execd
process 2506769 is executing new program: /home/pedro/gdb/build/gdb/testsuite/outputs/gdb.threads/step-over-exec/step-over-exec-execr-thread-other-diff-text-segs-true-execd
Thread 1 "step-over-exec-" hit Breakpoint 1, main () at /home/pedro/gdb/build/gdb/testsuite/../../../src/gdb/testsuite/gdb.threads/step-over-exec-execd.c:28
28 foo ();
(gdb) k
...
Thread 1 "gdb" received signal SIGSEGV, Segmentation fault.
0x000055555574444c in thread_info::has_pending_waitstatus (this=0x0) at ../../src/gdb/gdbthread.h:393
393 return m_suspend.waitstatus_pending_p;
(top-gdb) bt
#0 0x000055555574444c in thread_info::has_pending_waitstatus (this=0x0) at ../../src/gdb/gdbthread.h:393
#1 0x0000555555a884d1 in get_pending_child_status (lp=0x5555579b8230, ws=0x7fffffffd130) at ../../src/gdb/linux-nat.c:1345
#2 0x0000555555a8e5e6 in kill_unfollowed_child_callback (lp=0x5555579b8230) at ../../src/gdb/linux-nat.c:3564
#3 0x0000555555a92a26 in gdb::function_view<int (lwp_info*)>::bind<int, lwp_info*>(int (*)(lwp_info*))::{lambda(gdb::fv_detail::erased_callable, lwp_info*)#1}::operator()(gdb::fv_detail::erased_callable, lwp_info*) const (this=0x0, ecall=..., args#0=0x5555579b8230) at ../../src/gdb/../gdbsupport/function-view.h:284
#4 0x0000555555a92a51 in gdb::function_view<int (lwp_info*)>::bind<int, lwp_info*>(int (*)(lwp_info*))::{lambda(gdb::fv_detail::erased_callable, lwp_info*)#1}::_FUN(gdb::fv_detail::erased_callable, lwp_info*) () at ../../src/gdb/../gdbsupport/function-view.h:278
#5 0x0000555555a91f84 in gdb::function_view<int (lwp_info*)>::operator()(lwp_info*) const (this=0x7fffffffd210, args#0=0x5555579b8230) at ../../src/gdb/../gdbsupport/function-view.h:247
#6 0x0000555555a87072 in iterate_over_lwps(ptid_t, gdb::function_view<int (lwp_info*)>) (filter=..., callback=...) at ../../src/gdb/linux-nat.c:864
#7 0x0000555555a8e732 in linux_nat_target::kill (this=0x55555653af40 <the_amd64_linux_nat_target>) at ../../src/gdb/linux-nat.c:3590
#8 0x0000555555cfdc11 in target_kill () at ../../src/gdb/target.c:911
...
The root of the problem is that when a non-leader LWP execs, it just
changes its tid to the tgid, replacing the pre-exec leader thread,
becoming the new leader. There's no thread exit event for the execing
thread. It's as if the old pre-exec LWP vanishes without trace. The
ptrace man page says:
"PTRACE_O_TRACEEXEC (since Linux 2.5.46)
Stop the tracee at the next execve(2). A waitpid(2) by the
tracer will return a status value such that
status>>8 == (SIGTRAP | (PTRACE_EVENT_EXEC<<8))
If the execing thread is not a thread group leader, the thread
ID is reset to thread group leader's ID before this stop.
Since Linux 3.0, the former thread ID can be retrieved with
PTRACE_GETEVENTMSG."
When the core of GDB processes an exec events, it deletes all the
threads of the inferior. But, that is too late -- deleting the thread
does not delete the corresponding LWP, so we end leaving the pre-exec
non-leader LWP stale in the LWP list. That's what leads to the crash
above -- linux_nat_target::kill iterates over all LWPs, and after the
patch in question, that code will look for the corresponding
thread_info for each LWP. For the pre-exec non-leader LWP still
listed, won't find one.
This patch fixes it, by deleting the pre-exec non-leader LWP (and
thread) from the LWP/thread lists as soon as we get an exec event out
of ptrace.
GDBserver does not need an equivalent fix, because it is already doing
this, as side effect of mourning the pre-exec process, in
gdbserver/linux-low.cc:
else if (event == PTRACE_EVENT_EXEC && cs.report_exec_events)
{
...
/* Delete the execing process and all its threads. */
mourn (proc);
switch_to_thread (nullptr);
The crash with gdb.threads/step-over-exec.exp is not observable on
newer systems, which postdate the glibc change to move "libpthread.so"
internals to "libc.so.6", because right after the exec, GDB traps a
load event for "libc.so.6", which leads to GDB trying to open
libthread_db for the post-exec inferior, and, on such systems that
succeeds. When we load libthread_db, we call
linux_stop_and_wait_all_lwps, which, as the name suggests, stops all
lwps, and then waits to see their stops. While doing this, GDB
detects that the pre-exec stale LWP is gone, and deletes it.
If we use "catch exec" to stop right at the exec before the
"libc.so.6" load event ever happens, and issue "kill" right there,
then GDB crashes on newer systems as well. So instead of tweaking
gdb.threads/step-over-exec.exp to cover the fix, add a new
gdb.threads/threads-after-exec.exp testcase that uses "catch exec".
The test also uses the new "maint info linux-lwps" command if testing
on Linux native, which also exposes the stale LWP problem with an
unfixed GDB.
Also tweak a comment in infrun.c:follow_exec referring to how
linux-nat.c used to behave, as it would become stale otherwise.
Reviewed-By: Andrew Burgess <aburgess@redhat.com>
Change-Id: I21ec18072c7750f3a972160ae6b9e46590376643
|
|
This adds a maintenance command that lets you list all the LWPs under
control of the linux-nat target.
For example:
(gdb) maint info linux-lwps
LWP Ptid Thread ID
560948.561047.0 None
560948.560948.0 1.1
This shows that "560948.561047.0" LWP doesn't map to any thread_info
object, which is bogus. We'll be using this in a testcase in a
following patch.
Co-Authored-By: Pedro Alves <pedro@palves.net>
Change-Id: Ic4e9e123385976e5cd054391990124b7a20fb3f5
|
|
Overview
========
Consider the following situation, GDB is in non-stop mode, the main
thread is running while a second thread is stopped. The user has the
second thread selected as the current thread and asks GDB to detach.
At the exact moment of detach the main thread exits.
This situation currently causes crashes, assertion failures, and
unexpected errors to be reported from GDB for both native and remote
targets.
This commit addresses this situation for native and remote targets.
There are a number of different fixes, but all are required in order
to get this functionality working correct for native and remote
targets.
Native Linux Target
===================
For the native Linux target, detaching is handled in the function
linux_nat_target::detach. In here we call stop_wait_callback for each
thread, and it is this callback that will spot that the main thread
has exited.
GDB then detaches from everything except the main thread by calling
detach_callback.
After this the first problem is this assert:
/* Only the initial process should be left right now. */
gdb_assert (num_lwps (pid) == 1);
The num_lwps call will return 0 as the main thread has exited and all
of the other threads have now been detached. I fix this by changing
the assert to allow for 0 or 1 lwps at this point. As the 0 case can
only happen in non-stop mode, the assert becomes:
gdb_assert (num_lwps (pid) == 1
|| (target_is_non_stop_p () && num_lwps (pid) == 0));
The next problem is that we do:
main_lwp = find_lwp_pid (ptid_t (pid));
and then proceed assuming that main_lwp is not nullptr. In the case
that the main thread has exited though, main_lwp will be nullptr.
However, we only need main_lwp so that GDB can detach from the
thread. If the main thread has exited, and GDB has already detached
from every other thread, then GDB has finished detaching, GDB can skip
the calls that try to detach from the main thread, and then tell the
user that the detach was a success.
For Remote Targets
==================
On remote targets there are two problems.
First is that when the exit occurs during the early phase of the
detach, we see the stop notification arrive while GDB is removing the
breakpoints ahead of the detach. The 'set debug remote on' trace
looks like this:
[remote] Sending packet: $z0,7f1648fe0241,1#35
[remote] Notification received: Stop:W0;process:2a0ac8
# At this point an unpatched gdbserver segfaults, and the connection
# is broken. A patched gdbserver continues as below...
[remote] Packet received: E01
[remote] Sending packet: $z0,7f1648ff00a8,1#68
[remote] Packet received: E01
[remote] Sending packet: $z0,7f1648ff132f,1#6b
[remote] Packet received: E01
[remote] Sending packet: $D;2a0ac8#3e
[remote] Packet received: E01
I was originally running into Segmentation Faults, from within
gdbserver/mem-break.cc, in the function find_gdb_breakpoint. This
function calls current_process() and then dereferences the result to
find the breakpoint list.
However, in our case, the current process has already exited, and so
the current_process() call returns nullptr. At the point of failure,
the gdbserver backtrace looks like this:
#0 0x00000000004190e4 in find_gdb_breakpoint (z_type=48 '0', addr=4198762, kind=1) at ../../src/gdbserver/mem-break.cc:982
#1 0x000000000041930d in delete_gdb_breakpoint (z_type=48 '0', addr=4198762, kind=1) at ../../src/gdbserver/mem-break.cc:1093
#2 0x000000000042d8db in process_serial_event () at ../../src/gdbserver/server.cc:4372
#3 0x000000000042dcab in handle_serial_event (err=0, client_data=0x0) at ../../src/gdbserver/server.cc:4498
...
The problem is that, as a result non-stop being on, the process
exiting is only reported back to GDB after the request to remove a
breakpoint has been sent. Clearly gdbserver can't actually remove
this breakpoint -- the process has already exited -- so I think the
best solution is for gdbserver just to report an error, which is what
I've done.
The second problem I ran into was on the gdb side, as the process has
already exited, but GDB has not yet acknowledged the exit event, the
detach -- the 'D' packet in the above trace -- fails. This was being
reported to the user with a 'Can't detach process' error. As the test
actually calls detach from Python code, this error was then becoming a
Python exception.
Though clearly the detach has returned an error, and so, maybe, having
GDB throw an error would be fine, I think in this case, there's a good
argument that the remote error can be ignored -- if GDB tries to
detach and gets back an error, and if there's a pending exit event for
the pid we tried to detach, then just ignore the error and pretend the
detach worked fine.
We could possibly check for a pending exit event before sending the
detach packet, however, I believe that it might be possible (in
non-stop mode) for the stop notification to arrive after the detach is
sent, but before gdbserver has started processing the detach. In this
case we would still need to check for pending stop events after seeing
the detach fail, so I figure there's no point having two checks -- we
just send the detach request, and if it fails, check to see if the
process has already exited.
Testing
=======
In order to test this issue I needed to ensure that the exit event
arrives at the same time as the detach call. The window of
opportunity for getting the exit to arrive is so small I've never
managed to trigger this in real use -- I originally spotted this issue
while working on another patch, which did manage to trigger this
issue.
However, if we trigger both the exit and the detach from a single
Python function then we never return to GDB's event loop, as such GDB
never processes the exit event, and so the first time GDB gets a
chance to see the exit is during the detach call. And so that is the
approach I've taken for testing this patch.
Tested-By: Kevin Buettner <kevinb@redhat.com>
Approved-By: Kevin Buettner <kevinb@redhat.com>
|
|
This function is just a wrapper around the current inferior's gdbarch.
I find that having that wrapper just obscures where the arch is coming
from, and that it's often used as "I don't know which arch to use so
I'll use this magical target_gdbarch function that gets me an arch" when
the arch should in fact come from something in the context (a thread,
objfile, symbol, etc). I think that removing it and inlining
`current_inferior ()->arch ()` everywhere will make it a bit clearer
where that arch comes from and will trigger people into reflecting
whether this is the right place to get the arch or not.
Change-Id: I79f14b4e4934c88f91ca3a3155f5fc3ea2fadf6b
Reviewed-By: John Baldwin <jhb@FreeBSD.org>
Approved-By: Andrew Burgess <aburgess@redhat.com>
|
|
I noticed a comment by an include and remembered that I think these
don't really provide much value -- sometimes they are just editorial,
and sometimes they are obsolete. I think it's better to just remove
them. Tested by rebuilding.
Approved-By: Andrew Burgess <aburgess@redhat.com>
|
|
Currently, each target backend is responsible for printing "[Thread
...exited]" before deleting a thread. This leads to unnecessary
differences between targets, like e.g. with the remote target, we
never print such messages, even though we do print "[New Thread ...]".
E.g., debugging the gdb.threads/attach-many-short-lived-threads.exp
with gdbserver, letting it run for a bit, and then pressing Ctrl-C, we
currently see:
(gdb) c
Continuing.
^C[New Thread 3850398.3887449]
[New Thread 3850398.3887500]
[New Thread 3850398.3887551]
[New Thread 3850398.3887602]
[New Thread 3850398.3887653]
...
Thread 1 "attach-many-sho" received signal SIGINT, Interrupt.
0x00007ffff7e6a23f in __GI___clock_nanosleep (clock_id=clock_id@entry=0, flags=flags@entry=0, req=req@entry=0x7fffffffda80, rem=rem@entry=0x7fffffffda80)
at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:78
78 in ../sysdeps/unix/sysv/linux/clock_nanosleep.c
(gdb)
Above, we only see "New Thread" notifications, even though threads
were deleted.
After this patch, we'll see:
(gdb) c
Continuing.
^C[Thread 3558643.3577053 exited]
[Thread 3558643.3577104 exited]
[Thread 3558643.3577155 exited]
[Thread 3558643.3579603 exited]
...
[New Thread 3558643.3597415]
[New Thread 3558643.3600015]
[New Thread 3558643.3599965]
...
Thread 1 "attach-many-sho" received signal SIGINT, Interrupt.
0x00007ffff7e6a23f in __GI___clock_nanosleep (clock_id=clock_id@entry=0, flags=flags@entry=0, req=req@entry=0x7fffffffda80, rem=rem@entry=0x7fffffffda80)
at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:78
78 in ../sysdeps/unix/sysv/linux/clock_nanosleep.c
(gdb) q
This commit fixes this by moving the thread exit printing to common
code instead, triggered from within delete_thread (or rather,
set_thread_exited).
There's one wrinkle, though. While most targest want to print:
[Thread ... exited]
the Windows target wants to print:
[Thread ... exited with code <exit_code>]
... and sometimes wants to suppress the notification for the main
thread. To address that, this commits adds a delete_thread_with_code
function, only used by that target (so far).
This fix was originally posted as part of a larger series:
https://inbox.sourceware.org/gdb-patches/20221212203101.1034916-1-pedro@palves.net/
But didn't really need to be part of that series. In order to get
this fix merged sooner, I (Andrew Burgess) have rebased this commit
outside of the original series. Any bugs introduced while splitting
this patch out and rebasing, are entirely my own.
Bug: https://sourceware.org/bugzilla/show_bug.cgi?id=30129
Co-Authored-By: Andrew Burgess <aburgess@redhat.com>
|
|
This commit adjusts some of the debug output in linux-nat.c, but makes
no other functional changes to GDB.
In resume_lwp I've added the word "sibling" to one of the debug
messages. All the other debug messages in this function talk about
operating on the sibling thread, so I think it makes sense, for
consistency, if the message I've updated also talks about the sibling
thread.
In resume_stopped_resumed_lwps I've reordered the condition checks so
that the vfork-parent check now happens after the checks for whether
the thread is already resumed or not. This makes no functional
difference to GDB, but does, I think, mean we see more helpful debug
messages first.
Consider the situation where a vfork-parent thread is already resumed,
and resume_stopped_resumed_lwps is called. Previously the message
saying that the thread was not being resumed due to being a
vfork-parent, was printed. This might give the impression that the
thread is left in a not resumed state, which is misleading.
After this change we now get a message saying that the thread is not
being resumed due to it not being stopped (i.e. is already resumed).
With this message the already resumed nature of the thread is much
clearer.
I found this change helpful when debugging some vfork related issues.
|
|
While working on some of the recent patches relating to vfork handling
I found this debug output very helpful, I'd like to propose adding
this into GDB.
With debug turned off there should be no user visible changes after
this commit.
|
|
While working on a later patch, which changes gdb.base/foll-vfork.exp,
I noticed that sometimes I would hit this assert:
x86_linux_update_debug_registers: Assertion `lwp_is_stopped (lwp)' failed.
I eventually tracked it down to a combination of schedule-multiple
mode being on, target-non-stop being off, follow-fork-mode being set
to child, and some bad timing. The failing case is pretty simple, a
single threaded application performs a vfork, the child process then
execs some other application while the parent process (once the vfork
child has completed its exec) just exits. As best I understand
things, here's what happens when things go wrong:
1. The parent process performs a vfork, GDB sees the VFORKED event
and creates an inferior and thread for the vfork child,
2. GDB resumes the vfork child process. As schedule-multiple is on
and target-non-stop is off, this is translated into a request to
start all processes (see user_visible_resume_ptid),
3. In the linux-nat layer we spot that one of the threads we are
about to start is a vfork parent, and so don't start that
thread (see resume_lwp), the vfork child thread is resumed,
4. GDB waits for the next event, eventually entering
linux_nat_target::wait, which in turn calls linux_nat_wait_1,
5. In linux_nat_wait_1 we eventually call
resume_stopped_resumed_lwps, this should restart threads that have
stopped but don't actually have anything interesting to report.
6. Unfortunately, resume_stopped_resumed_lwps doesn't check for
vfork parents like resume_lwp does, so at this point the vfork
parent is resumed. This feels like the start of the bug, and this
is where I'm proposing to fix things, but, resuming the vfork parent
isn't the worst thing in the world because....
7. As the vfork child is still alive the kernel holds the vfork
parent stopped,
8. Eventually the child performs its exec and GDB is sent and EXECD
event. However, because the parent is resumed, as soon as the child
performs its exec the vfork parent also sends a VFORK_DONE event to
GDB,
9. Depending on timing both of these events might seem to arrive in
GDB at the same time. Normally GDB expects to see the EXECD or
EXITED/SIGNALED event from the vfork child before getting the
VFORK_DONE in the parent. We know this because it is as a result of
the EXECD/EXITED/SIGNALED that GDB detaches from the parent (see
handle_vfork_child_exec_or_exit for details). Further the comment
in target/waitstatus.h on TARGET_WAITKIND_VFORK_DONE indicates that
when we remain attached to the child (not the parent) we should not
expect to see a VFORK_DONE,
10. If both events arrive at the same time then GDB will randomly
choose one event to handle first, in some cases this will be the
VFORK_DONE. As described above, upon seeing a VFORK_DONE GDB
expects that (a) the vfork child has finished, however, in this case
this is not completely true, the child has finished, but GDB has not
processed the event associated with the completion yet, and (b) upon
seeing a VFORK_DONE GDB assumes we are remaining attached to the
parent, and so resumes the parent process,
11. GDB now handles the EXECD event. In our case we are detaching
from the parent, so GDB calls target_detach (see
handle_vfork_child_exec_or_exit),
12. While this has been going on the vfork parent is executing, and
might even exit,
13. In linux_nat_target::detach the first thing we do is stop all
threads in the process we're detaching from, the result of the stop
request will be cached on the lwp_info object,
14. In our case the vfork parent has exited though, so when GDB
waits for the thread, instead of a stop due to signal, we instead
get a thread exited status,
15. Later in the detach process we try to resume the threads just
prior to making the ptrace call to actually detach (see
detach_one_lwp), as part of the process to resume a thread we try to
touch some registers within the thread, and before doing this GDB
asserts that the thread is stopped,
16. An exited thread is not classified as stopped, and so the assert
triggers!
So there's two bugs I see here. The first, and most critical one here
is in step #6. I think that resume_stopped_resumed_lwps should not
resume a vfork parent, just like resume_lwp doesn't resume a vfork
parent.
With this change in place the vfork parent will remain stopped in step
instead GDB will only see the EXECD/EXITED/SIGNALLED event. The
problems in #9 and #10 are therefore skipped and we arrive at #11,
handling the EXECD event. As the parent is still stopped #12 doesn't
apply, and in #13 when we try to stop the process we will see that it
is already stopped, there's no risk of the vfork parent exiting before
we get to this point. And finally, in #15 we are safe to poke the
process registers because it will not have exited by this point.
However, I did mention two bugs.
The second bug I've not yet managed to actually trigger, but I'm
convinced it must exist: if we forget vforks for a moment, in step #13
above, when linux_nat_target::detach is called, we first try to stop
all threads in the process GDB is detaching from. If we imagine a
multi-threaded inferior with many threads, and GDB running in non-stop
mode, then, if the user tries to detach there is a chance that thread
could exit just as linux_nat_target::detach is entered, in which case
we should be able to trigger the same assert.
But, like I said, I've not (yet) managed to trigger this second bug,
and even if I could, the fix would not belong in this commit, so I'm
pointing this out just for completeness.
There's no test included in this commit. In a couple of commits time
I will expand gdb.base/foll-vfork.exp which is when this bug would be
exposed. Unfortunately there are at least two other bugs (separate
from the ones discussed above) that need fixing first, these will be
fixed in the next commits before the gdb.base/foll-vfork.exp test is
expanded.
If you do want to reproduce this failure then you will for certainly
need to run the gdb.base/foll-vfork.exp test in a loop as the failures
are all very timing sensitive. I've found that running multiple
copies in parallel makes the failure more likely to appear, I usually
run ~6 copies in parallel and expect to see a failure after within
10mins.
|
|
Since commit 05c06f318fd9 ("Linux: Access memory even if threads are
running"), GDB prefers pread64/pwrite64 to access inferior memory
instead of ptrace. That change broke reading shared libraries on
SPARC64 Linux, as reported by PR gdb/30525 ("gdb cannot read shared
libraries on SPARC64").
On SPARC64 Linux, surprisingly (to me), userspace shared libraries are
mapped at high 64-bit addresses:
(gdb) info sharedlibrary
Cannot access memory at address 0xfff80001002011e0
Cannot access memory at address 0xfff80001002011d8
Cannot access memory at address 0xfff80001002011d8
From To Syms Read Shared Object Library
0xfff80001000010a0 0xfff8000100021f80 Yes (*) /lib64/ld-linux.so.2
(*): Shared library is missing debugging information.
Those addresses are 64-bit addresses with the high bits set. When
interpreted as signed, they're negative.
The Linux kernel rejects pread64/pwrite64 if the offset argument of
type off_t (a signed type) is negative, which happens if the memory
address we're accessing has its high bit set. See
linux/fs/read_write.c sys_pread64 and sys_pwrite64 in Linux.
Thankfully, lseek does not fail in that situation. So the fix is to
use the 'lseek + read|write' path if the offset would be negative.
Fix this in both native GDB and GDBserver.
Tested on a SPARC64 GNU/Linux and x86-64 GNU/Linux.
Bug: https://sourceware.org/bugzilla/show_bug.cgi?id=30525
Change-Id: I79c724f918037ea67b7396fadb521bc9d1b10dc5
|
|
Fix some more typos:
- distinquish -> distinguish
- actualy -> actually
- singe -> single
- frash -> frame
- chid -> child
- dissassembler -> disassembler
- uninitalized -> uninitialized
- precontidion -> precondition
- regsiters -> registers
- marge -> merge
- sate -> state
- garanteed -> guaranteed
- explictly -> explicitly
- prefices (nonstandard plural) -> prefixes
- bondary -> boundary
- formated -> formatted
- ithe -> the
- arrav -> array
- coresponding -> corresponding
- owend -> owned
- fials -> fails
- diasm -> disasm
- ture -> true
- tpye -> type
There's one code change, the name of macro SIG_CODE_BONDARY_FAULT changed to
SIG_CODE_BOUNDARY_FAULT.
Tested on x86_64-linux.
|
|
Make find_thread_ptid (the overload that takes a process_stratum_target)
a method of process_stratum_target.
Change-Id: Ib190a925a83c6b93e9c585dc7c6ab65efbdd8629
Reviewed-By: Tom Tromey <tom@tromey.com>
|
|
I noticed that some debug log output printing an lwp's pending status
wasn't considering lp->waitstatus. This fixes it, by introducing a
new pending_status_str function.
Also fix the comment in gdb/linux-nat.h describing
lwp_info::waitstatus and details the description of lwp_info::status
while at it.
Change-Id: I66e5c7a363d30a925b093b195d72925ce5b6b980
Approved-By: Andrew Burgess <aburgess@redhat.com>
|
|
Replace spaces with tabs in a bunch of places.
Change-Id: If0f87180f1d13028dc178e5a8af7882a067868b0
|
|
I've long wanted to remove 'struct buffer', and thanks to Simon's
earlier patch, I was finally able to do so. My feeling has been that
gdb already has several decent structures available for growing
strings: std::string of course, but also obstack and even objalloc
from BFD and dyn-string from libiberty. The previous patches in this
series removed all the uses of struct buffer, so this one can remove
the code and the remaining #includes.
|
|
While investigating something else, I noticed some weird code in
agent_run_command (use of memcpy rather than strcpy). Then I noticed
that 'cmd' is used as both an in and out parameter, despite being
const.
Casting away const like this is bad. This patch removes the const and
fixes the memcpy. I also added a static assert to assure myself that
the code in gdbserver is correct -- gdbserver is passing its own
buffer directly to agent_run_command.
Reviewed-By: Andrew Burgess <aburgess@redhat.com>
|
|
This commit is the result of running the gdb/copyright.py script,
which automated the update of the copyright year range for all
source files managed by the GDB project to be updated to include
year 2023.
|
|
As of 1bcb0708f229 ("gdb/linux-nat: Check whether /proc/pid/mem is
writable"), GDB checks if /proc/pid/mem is writable. This is done
early at GDB startup, in order to get a consistent warning, instead of
a warning that depends on whenever GDB writes to inferior memory.
PR gdb/29907 points out that some build systems (like QEMU's,
apparently) may call 'gdb --version' to check GDB's presence & its
version on the system, and that Gentoo's build process has sandboxing
which blocks the /proc/pid/mem access and thus GDB warns, which
results in build fails.
To help with that, this patch delays the /proc/pid/mem check until we
start or attach to an inferior. Ends up potentially emiting a warning
close where we already emit other ptrace- and /proc- related warnings,
which just Feels Right.
Bug: https://sourceware.org/bugzilla/show_bug.cgi?id=29907
Change-Id: I5537653ecfbbe76a04ab035e40e59d09b4980763
|
|
Make the inferior_ptid bubble up to linux_nat_target::xfer_partial.
Change-Id: I62dbc5734c26648bb465f449c2003c73751cd812
|
|
I noticed we could reduce duplication a bit here.
Change-Id: If24e54d1dac71b46f7c1f68a18a073d4c084b644
|
|
Not a big deal, but it seems strange to check errno instead of the
ptrace return value to know whether it succeeded.
Change-Id: If0a6d0280ab0e5ecb077e546af0d6fe489c5b9fd
|
|
No caller cares about the value of *SIGINFO on failure. It's also
documented in the function doc that *SIGINFO is uninitialized (I
understand "untouched") on failure.
Change-Id: I5ef38a5f58e3635e109b919ddf6f827f38f1225a
|
|
Change return type to bool.
Change-Id: I1bf0360bfdd1b5994cd0f96c268d806f96fe51a4
|
|
No behavior change expected.
Change-Id: Ifaa64ecd619483646b024fd7c62e571e92a8eedb
|
|
Add a pid parameter to linux_proc_xfer_memory_partial, making the
inferior_ptid reference bubble up close to the target_ops::xfer_partial
boundary. No behavior change expected.
Change-Id: I58171b00ee1bba1ea22efdbb5dcab8b1ab3aac4c
|
|
linux_handle_extended_wait calls target_post_attach if we're handling
a PTRACE_EVENT_CLONE, and libthread_db.so isn't active.
target_post_attach just calls linux_init_ptrace_procfs to set the
lwp's ptrace options. However, this is completely unnecessary,
because, as man ptrace [1] says, options are inherited:
"Flags are inherited by new tracees created and "auto-attached" via
active PTRACE_O_TRACEFORK, PTRACE_O_TRACEVFORK, or PTRACE_O_TRACECLONE
options."
This removes the unnecessary call.
[1] - https://man7.org/linux/man-pages/man2/ptrace.2.html
Approved-By: Simon Marchi <simon.marchi@efficios.com>
Change-Id: I533eaa60b700f7e40760311fc0d344d0b3f19a78
|
|
Currently, every internal_error call must be passed __FILE__/__LINE__
explicitly, like:
internal_error (__FILE__, __LINE__, "foo %d", var);
The need to pass in explicit __FILE__/__LINE__ is there probably
because the function predates widespread and portable variadic macros
availability. We can use variadic macros nowadays, and in fact, we
already use them in several places, including the related
gdb_assert_not_reached.
So this patch renames the internal_error function to something else,
and then reimplements internal_error as a variadic macro that expands
__FILE__/__LINE__ itself.
The result is that we now should call internal_error like so:
internal_error ("foo %d", var);
Likewise for internal_warning.
The patch adjusts all calls sites. 99% of the adjustments were done
with a perl/sed script.
The non-mechanical changes are in gdbsupport/errors.h,
gdbsupport/gdb_assert.h, and gdb/gdbarch.py.
Approved-By: Simon Marchi <simon.marchi@efficios.com>
Change-Id: Ia6f372c11550ca876829e8fd85048f4502bdcf06
|
|
Converting from free-form macros to an enum gives a bit of type-safety.
This caught places where we would assign host error numbers to what
should contain a target fileio error number, for instance in
target_fileio_pread.
I added the FILEIO_SUCCESS enumerator, because
remote.c:remote_hostio_parse_result initializes the remote_errno output
variable to 0. It seems better to have an explicit enumerator than to
assign a value for which there is no enumerator. I considered
initializing this variable to FILEIO_EUNKNOWN instead, such that if the
remote side replies with an error and omits the errno value, we'll get
an errno that represents an error instead of 0 (which reprensents no
error). But it's not clear what the consequences of that change would
be, so I prefer to err on the side of caution and just keep the existing
behavior (there is no intended change in behavior with this patch).
Note that remote_hostio_parse_resul still reads blindly what the remote
side sends as a target errno into this variable, so we can still end up
with a nonsensical value here. It's not good, but out of the scope of
this patch.
Convert host_to_fileio_error and fileio_errno_to_host to return / accept
a fileio_error instead of an int, and cascade the change in the whole
chain that uses that.
Change-Id: I454b0e3fcf0732447bc872252fa8e57d138b0e03
|
|
Commit 05c06f318fd9a112529dfc313e6512b399a645e4 enabled GDB to access
memory while threads are running. It did this by accessing
/proc/PID/task/LWP/mem.
Unfortunately, this interface is not implemented for writing in older
kernels (such as RHEL6). This means that GDB is unable to insert
breakpoints on these hosts:
$ ./gdb -q gdb -ex start
Reading symbols from gdb...
Temporary breakpoint 1 at 0x40fdd5: file ../../src/gdb/gdb.c, line 28.
Starting program: /home/rhel6/fsf/linux/gdb/gdb
Warning:
Cannot insert breakpoint 1.
Cannot access memory at address 0x40fdd5
(gdb)
Before this patch, linux_proc_xfer_memory_partial (previously called
linux_proc_xfer_partial) would return TARGET_XFER_EOF if the write to
/proc/PID/mem failed. [More specifically, linux_proc_xfer_partial
would not "bother for one word," but the effect is the essentially
same.]
This status was checked by linux_nat_target::xfer_partial, which would
then fallback to using ptrace to perform the operation.
This is the specific hunk that removed the fallback:
- xfer = linux_proc_xfer_partial (object, annex, readbuf, writebuf,
- offset, len, xfered_len);
- if (xfer != TARGET_XFER_EOF)
- return xfer;
+ return linux_proc_xfer_memory_partial (readbuf, writebuf,
+ offset, len, xfered_len);
+ }
return inf_ptrace_target::xfer_partial (object, annex, readbuf, writebuf,
offset, len, xfered_len);
This patch makes linux_nat_target::xfer_partial go straight to writing
memory via ptrace if writing via /proc/pid/mem is not possible in the
running kernel, enabling GDB to insert breakpoints on these older
kernels. Note that a recent patch changed the return status from
TARGET_XFER_EOF to TARGET_XFER_E_IO.
Tested on {unix,native-gdbserver,native-extended-gdbserver}/-m{32,64}
on x86_64, s390x, aarch64, and ppc64le.
Change-Id: If1d884278e8c4ea71d8836bedd56e6a6c242a415
|
|
Probe whether /proc/pid/mem is writable, by using it to write to a GDB
variable. This will be used in the following patch to avoid falling
back to writing to inferior memory with ptrace if /proc/pid/mem _is_
writable.
Change-Id: If87eff0b46cbe5e32a583e2977a9e17d29d0ed3e
|
|
This changes the parameter of target_ops::async from int to bool.
Regression tested on x86-64 Fedora 34.
|
|
For every stop, Linux GDB and GDBserver save the stopped thread's PC,
in lwp->stop_pc. This is done in save_stop_reason, in both
gdb/linux-nat.c and gdbserver/linux-low.cc. However, while we're
going through the shell after "run", in startup_inferior, we shouldn't
be reading registers, as we haven't yet determined the target's
architecture -- the shell's architecture may not even be the same as
the final inferior's.
In gdb/linux-nat.c, lwp->stop_pc is only needed when the thread has
stopped for a breakpoint, and since when going through the shell, no
breakpoint is going to hit, we could simply teach save_stop_reason to
only record the stop pc when the thread stopped for a breakpoint.
However, in gdbserver/linux-low.cc, lwp->stop_pc is used in more cases
than breakpoint hits (e.g., it's used in tracepoints & the
"while-stepping" feature).
So to avoid GDB vs GDBserver divergence, we apply the same approach to
both implementations.
We set a flag in the inferior (process in GDBserver) whenever it is
being nursed through the shell, and when that flag is set,
save_stop_reason bails out early. While going through the shell,
we'll only ever get process exits (normal or signalled), random
signals, and exec events, so nothing is lost.
Change-Id: If0f01831514d3a74d17efd102875de7d2c6401ad
|
|
When accessing /proc/PID/mem, if pread64/pwrite64/read/write encounters
an error and return -1, linux_proc_xfer_memory_partial return
TARGET_XFER_EOF.
I think it should return TARGET_XFER_E_IO in this case. TARGET_XFER_EOF
is returned when pread64/pwrite64/read/frite returns 0, which indicates
that the address space is gone and the whole process has exited or
execed.
This patch makes this change.
Regression tested on x86_64-linux-gnu.
Change-Id: I6030412459663b8d7933483fdda22a6c2c5d7221
|
|
This changes target_pid_to_exec_file and target_ops::pid_to_exec_file
to return a "const char *". I couldn't build many of these targets,
but did examine the code by hand -- also, as this only affects the
return type, it's normally pretty safe. This brings gdb and gdbserver
a bit closer, and allows for the removal of a const_cast as well.
|
|
The current target_resume interface is a bit odd & non-intuitive.
I've found myself explaining it a couple times the recent past, while
reviewing patches that assumed STEP/SIGNAL always applied to the
passed in PTID. It goes like this today:
- if the passed in PTID is a thread, then the step/signal request is
for that thread.
- otherwise, if PTID is a wildcard (all threads or all threads of
process), the step/signal request is for inferior_ptid, and PTID
indicates which set of threads run free.
Because GDB always switches the current thread to "leader" thread
being resumed/stepped/signalled, we can simplify this a bit to:
- step/signal are always for inferior_ptid.
- PTID indicates the set of threads that run free.
Still not ideal, but it's a minimal change and at least there are no
special cases this way.
That's what this patch does. It renames the PTID parameter to
SCOPE_PTID, adds some assertions to target_resume, and tweaks
target_resume's description. In addition, it also renames PTID to
SCOPE_PTID in the remote and linux-nat targets, and simplifies their
implementation a little bit. Other targets could do the same, but
they don't have to.
Change-Id: I02a2ec2ab3a3e9b191de1e9a84f55c17cab7daaf
|
|
linux_handle_extended_wait
The check removed by this patch, using current_inferior, looks wrong.
When debugging multiple inferiors with the Linux native target and
linux_handle_extended_wait is called, there's no guarantee about which
is the current inferior. The vfork-done event we receive could be for
any inferior. If the vfork-done event is for a non-current inferior, we
end up wrongfully ignoring it. As a result, the core never processes a
TARGET_WAITKIND_VFORK_DONE event, program_space::breakpoints_not_allowed
is never cleared, and breakpoints are never reinserted. However,
because the Linux native target decided to ignore the event, it resumed
the thread - while breakpoints out. And that's bad.
The proposed fix is to remove this check. Always report vfork-done
events and let infrun's logic decide if it should be ignored. We don't
save much cycles by filtering the event here.
Add a test that replicates the situation described above. See comments
in the test for more details.
Change-Id: Ibe33c1716c3602e847be6c2093120696f2286fbf
|
|
Now that filtered and unfiltered output can be treated identically, we
can unify the printf family of functions. This is done under the name
"gdb_printf". Most of this patch was written by script.
|
|
A number of spots call printf_unfiltered only because they are in code
that should not be interrupted by the pager. However, I believe these
cases are all handled by infrun's blanket ban on paging, and so can be
converted to the default (_filtered) API.
After this patch, I think all the remaining _unfiltered calls are ones
that really ought to be. A few -- namely in complete_command -- could
be replaced by a scoped assignment to pagination_enabled, but for the
remainder, the code seems simple enough like this.
|
|
The current zombie leader detection code in linux-nat.c has a race --
if a multi-threaded inferior exits just before check_zombie_leaders
finds that the leader is now zombie via checking /proc/PID/status,
check_zombie_leaders deletes the leader, assuming we won't get an
event for that exit (which we won't in some scenarios, but not in this
one). That might seem mostly harmless, but it has some downsides:
- later when we continue pulling events out of the kernel, we will
collect the exit event of the non-leader threads, and once we see
the last lwp in our list exit, we return _that_ lwp's exit code as
whole-process exit code to infrun, instead of the leader's exit
code.
- this can cause a hang in stop_all_threads in infrun.c. Say there
are 2 threads in the process. stop_all_threads stops each of those
threads, and then waits for two stop or exit events, one for each
thread. If the whole process exits, and check_zombie_leaders hits
the false-positive case, linux-nat.c will only return one event to
GDB (the whole-process exit returned when we see the last thread,
the non-leader thread, exit), making stop_all_threads hang forever
waiting for a second event that will never come.
However, in this false-positive scenario, where the whole process is
exiting, as opposed to just the leader (with pthread_exit(), for
example), we _will_ get an exit event shortly for the leader, after we
collect the exit event of all the other non-leader threads. Or put
another way, we _always_ get an event for the leader after we see it
become zombie.
I tried a number of approaches to fix this:
#1 - My first thought to address the race was to make GDB always
report the whole-process exit status for the leader thread, not for
whatever is the last lwp in the list. We _always_ get a final exit
(or exec) event for the leader, and when the race triggers, we're not
collecting it.
#2 - My second thought was to try to plug the race in the first place.
I thought of making GDB call waitpid/WNOHANG for all non-leader
threads immediately when the zombie leader is detected, assuming there
would be an exit event pending for each of them waiting to be
collected. Turns out that that doesn't work -- you can see the leader
become zombie _before_ the kernel kills all other threads. Waitpid in
that small time window returns 0, indicating no-event. Thankfully we
hit that race window all the time, which avoided trading one race for
another. Looking at the non-leader thread's status in /proc doesn't
help either, the threads are still in running state for a bit, for the
same reason.
#3 - My next attempt, which seemed promising, was to synchronously
stop and wait for the stop for each of the non-leader threads. For
the scenario in question, this will collect all the exit statuses of
the non-leader threads. Then, if we are left with only the zombie
leader in the lwp list, it means we either have a normal while-process
exit or an exec, in which case we should not delete the leader. If
_only_ the leader exited, like in gdb.threads/leader-exit.exp, then
after pausing threads, we will still have at least one live non-leader
thread in the list, and so we delete the leader lwp. I got this
working and polished, and it was only after staring at the kernel code
to convince myself that this would really work (and it would, for the
scenario I considered), that I realized I had failed to account for
one scenario -- if any non-leader thread is _already_ stopped when
some thread triggers a group exit, like e.g., if you have some threads
stopped and then resume just one thread with scheduler-locking or
non-stop, and that thread exits the process. I also played with
PTRACE_EVENT_EXIT, see if it would help in any way to plug the race,
and I couldn't find a way that it would result in any practical
difference compared to looking at /proc/PID/status, with respect to
having a race.
So I concluded that there's no way to plug the race, we just have to
deal with it. Which means, going back to approach #1. That is the
approach taken by this patch.
Change-Id: I6309fd4727da8c67951f9cea557724b77e8ee979
|
|
Reorganize linux-nat.c:linux_nat_filter_event such that all the
handling for events for LWPs not in the LWP list is together. This
helps make a following patch clearer. The comments and debug messages
have also been tweaked - the end goal is to have them synchronized
with the gdbserver counterpart.
Change-Id: I8586d8dcd76d8bd3795145e3056fc660e3b2cd22
|
|
Subclasses of inf_ptrace_target have to opt-in to using the event_pipe
by implementing the can_async_p and async methods. For subclasses
which do this, inf_ptrace_target provides is_async_p, async_wait_fd
and closes the pipe in the close target method.
inf_ptrace_target also provides wrapper routines around the event pipe
(async_file_open, async_file_close, async_file_flush, and
async_file_mark) for use in target methods such as async.
inf_ptrace_target also exports a static async_file_mark_if_open
function which can be used in SIGCHLD signal handlers.
|
|
If the attach target supports async mode, enable it after the
attach target's ::attach method returns.
|
|
Now that target_resume always enables async mode after target::resume
returns, these calls are redundant.
The other place that target resume methods are invoked outside of
target_resume are as the beneath target in record_full_wait_1. In
this case, async mode should already be enabled when supported by the
target before the resume method is invoked due to the following:
In general, targets which support async mode run as async until
::wait returns TARGET_WAITKIND_NO_RESUMED to indicate that there are
no unwaited for children (either they have exited or are stopped).
When that occurs, the loop in wait_one disables async mode. Later
if a stopped child is resumed, async mode is re-enabled in
do_target_resume before waiting for the next event.
In the case of record_full_wait_1, this function is invoked from the
::wait target method when fetching an event. If the underlying
target supports async mode, then an earlier call to do_target_resume
to resume the child reporting an event in the loop in
record_full_wait_1 would have already enabled async mode before
::wait was invoked. In addition, nothing in the code executed in
the loop in record_full_wait_1 disables async mode. Async mode is
only disabled higher in the call stack in wait_one after ::wait
returns.
It is also true that async mode can be disabled by an
INF_EXEC_COMPLETE event passed to inferior_event_handle, but all of
the places that invoke that are in the gdb core which is "above" a
target ::wait method.
Note that there is an earlier call to enable async mode in
linux_nat_target::resume. That call also marks the async event pipe
to report an existing event after enabling async mode, so it needs to
stay.
|
|
Use event_pipe from gdbsupport in place of the existing file
descriptor array.
|
|
Change-Id: I80328fab7096221356864b5a4fb30858b48d2c10
|