diff options
author | Jason Molenda <jmolenda@apple.com> | 2024-02-14 13:06:20 -0800 |
---|---|---|
committer | GitHub <noreply@github.com> | 2024-02-14 13:06:20 -0800 |
commit | aab48c99c2234e348aa37657accfb6110c84c9b7 (patch) | |
tree | 2880adbcfe4b958d3295afdeffc55cece42bf28c /lldb/include | |
parent | f3b92fae138f47fb78a55254d73913f1e7935852 (diff) | |
download | llvm-aab48c99c2234e348aa37657accfb6110c84c9b7.zip llvm-aab48c99c2234e348aa37657accfb6110c84c9b7.tar.gz llvm-aab48c99c2234e348aa37657accfb6110c84c9b7.tar.bz2 |
[lldb] Detect a Darwin kernel issue and work around it (#81573)
On arm64 machines, when there is a hardware breakpoint or watchpoint
set, and lldb has instruction-stepped a thread, and then done a
Process::Resume, we will sometimes receive an extra "instruction step
completed" mach exception and the pc has not advanced. From a user's
perspective, they hit Continue and lldb stops again at the same spot.
From the testsuite's perspective, this has been a constant source of
testsuite failures for any test using hardware watchpoints and
breakpoints, the arm64 CI bots seem especially good at hitting this
issue.
Jim and I have been slowly looking at this for a few months now, and
finally I decided to try to detect this situation in lldb and silently
resume the process again when it happens.
We were already detecting this "got an insn-step finished mach exception
but this thread was not instruction stepping" combination in
StopInfoMachException where we take the mach exception and create a
StopInfo object for it. We had a lot of logging we used to understand
the failure as it was hit on the bots in assert builds.
This patch adds a new case to `Thread::GetPrivateStopInfo()` to call the
StopInfo's (new) `IsContinueInterrupted()` method. In
StopInfoMachException, where we previously had logging for assert
builds, I now note it in an ivar, and when
`Thread::GetPrivateStopInfo()` asks if this has happened, we check all
of the combination of events that this comes up: We have a hardware
breakpoint or watchpoint, we were not instruction stepping this thread
but got an insn-step mach exception, the pc is the same as the previous
stop's pc. And in that case, `Thread::GetPrivateStopInfo()` returns no
StopInfo -- indicating that this thread would like to resume execution.
The `Thread` object has two StackFrameLists, `m_curr_frames_sp` and
`m_prev_frames_sp`. When a thread resumes execution, we move
`m_curr_frames_sp` in to `m_prev_frames_sp` and when it stops executing,
w euse `m_prev_frames_sp` to seed the new `m_curr_frames_sp` if most of
the stack is the same as before.
In this same location, I now save the Thread's RegisterContext::GetPC
into an ivar, `m_prev_framezero_pc`. StopInfoMachException needs this
information to check all of the conditions I outlined above for
`IsContinueInterrupted`.
This has passed exhaustive testing and we do not have any testsuite
failures for hardware watchpoints and breakpoints due to this kernel bug
with the patch in place. In focusing on these tests for thousands of
runs, I have found two other uncommon race conditions for the
TestConcurrent* tests on arm64. TestConcurrentManyBreakpoints.py (which
uses no hardware watchpoint/breakpoints) will sometimes only have 99
breakpoints when it expects 100, and any of the concurrent tests using
the shared harness (I've seen it in
TestConcurrentWatchBreakDelay.py,
TestConcurrentTwoBreakpointsOneSignal.py,
TestConcurrentSignalDelayWatch.py) can fail when the test harness checks
that there is only one thread still running at the end, and it finds two
-- one of them under pthread_exit / pthread_terminate. Both of these
failures happen on github main without my changes, and with my changes -
they are unrelated race conditions in these tests, and I'm sure I'll be
looking into them at some point if they hit the CI bots with frequency.
On my computer, these are in the 0.3-0.5% of the time class. But the CI
bots do have different timing.
Diffstat (limited to 'lldb/include')
-rw-r--r-- | lldb/include/lldb/Target/StopInfo.h | 5 | ||||
-rw-r--r-- | lldb/include/lldb/Target/Thread.h | 14 |
2 files changed, 19 insertions, 0 deletions
diff --git a/lldb/include/lldb/Target/StopInfo.h b/lldb/include/lldb/Target/StopInfo.h index 305fc5d..d1848fc 100644 --- a/lldb/include/lldb/Target/StopInfo.h +++ b/lldb/include/lldb/Target/StopInfo.h @@ -79,6 +79,11 @@ public: virtual bool IsValidForOperatingSystemThread(Thread &thread) { return true; } + /// A Continue operation can result in a false stop event + /// before any execution has happened. We need to detect this + /// and silently continue again one more time. + virtual bool WasContinueInterrupted(Thread &thread) { return false; } + // Sometimes the thread plan logic will know that it wants a given stop to // stop or not, regardless of what the ordinary logic for that StopInfo would // dictate. The main example of this is the ThreadPlanCallFunction, which diff --git a/lldb/include/lldb/Target/Thread.h b/lldb/include/lldb/Target/Thread.h index 30863ad..96ca95ad 100644 --- a/lldb/include/lldb/Target/Thread.h +++ b/lldb/include/lldb/Target/Thread.h @@ -11,6 +11,7 @@ #include <memory> #include <mutex> +#include <optional> #include <string> #include <vector> @@ -1226,6 +1227,16 @@ public: lldb::ValueObjectSP GetSiginfoValue(); + /// Request the pc value the thread had when previously stopped. + /// + /// When the thread performs execution, it copies the current RegisterContext + /// GetPC() value. This method returns that value, if it is available. + /// + /// \return + /// The PC value before execution was resumed. May not be available; + /// an empty std::optional is returned in that case. + std::optional<lldb::addr_t> GetPreviousFrameZeroPC(); + protected: friend class ThreadPlan; friend class ThreadList; @@ -1306,6 +1317,9 @@ protected: ///populated after a thread stops. lldb::StackFrameListSP m_prev_frames_sp; ///< The previous stack frames from ///the last time this thread stopped. + std::optional<lldb::addr_t> + m_prev_framezero_pc; ///< Frame 0's PC the last + /// time this thread was stopped. int m_resume_signal; ///< The signal that should be used when continuing this ///thread. lldb::StateType m_resume_state; ///< This state is used to force a thread to |