aboutsummaryrefslogtreecommitdiff
path: root/lldb/include
diff options
context:
space:
mode:
authorJason Molenda <jmolenda@apple.com>2024-02-14 13:06:20 -0800
committerGitHub <noreply@github.com>2024-02-14 13:06:20 -0800
commitaab48c99c2234e348aa37657accfb6110c84c9b7 (patch)
tree2880adbcfe4b958d3295afdeffc55cece42bf28c /lldb/include
parentf3b92fae138f47fb78a55254d73913f1e7935852 (diff)
downloadllvm-aab48c99c2234e348aa37657accfb6110c84c9b7.zip
llvm-aab48c99c2234e348aa37657accfb6110c84c9b7.tar.gz
llvm-aab48c99c2234e348aa37657accfb6110c84c9b7.tar.bz2
[lldb] Detect a Darwin kernel issue and work around it (#81573)
On arm64 machines, when there is a hardware breakpoint or watchpoint set, and lldb has instruction-stepped a thread, and then done a Process::Resume, we will sometimes receive an extra "instruction step completed" mach exception and the pc has not advanced. From a user's perspective, they hit Continue and lldb stops again at the same spot. From the testsuite's perspective, this has been a constant source of testsuite failures for any test using hardware watchpoints and breakpoints, the arm64 CI bots seem especially good at hitting this issue. Jim and I have been slowly looking at this for a few months now, and finally I decided to try to detect this situation in lldb and silently resume the process again when it happens. We were already detecting this "got an insn-step finished mach exception but this thread was not instruction stepping" combination in StopInfoMachException where we take the mach exception and create a StopInfo object for it. We had a lot of logging we used to understand the failure as it was hit on the bots in assert builds. This patch adds a new case to `Thread::GetPrivateStopInfo()` to call the StopInfo's (new) `IsContinueInterrupted()` method. In StopInfoMachException, where we previously had logging for assert builds, I now note it in an ivar, and when `Thread::GetPrivateStopInfo()` asks if this has happened, we check all of the combination of events that this comes up: We have a hardware breakpoint or watchpoint, we were not instruction stepping this thread but got an insn-step mach exception, the pc is the same as the previous stop's pc. And in that case, `Thread::GetPrivateStopInfo()` returns no StopInfo -- indicating that this thread would like to resume execution. The `Thread` object has two StackFrameLists, `m_curr_frames_sp` and `m_prev_frames_sp`. When a thread resumes execution, we move `m_curr_frames_sp` in to `m_prev_frames_sp` and when it stops executing, w euse `m_prev_frames_sp` to seed the new `m_curr_frames_sp` if most of the stack is the same as before. In this same location, I now save the Thread's RegisterContext::GetPC into an ivar, `m_prev_framezero_pc`. StopInfoMachException needs this information to check all of the conditions I outlined above for `IsContinueInterrupted`. This has passed exhaustive testing and we do not have any testsuite failures for hardware watchpoints and breakpoints due to this kernel bug with the patch in place. In focusing on these tests for thousands of runs, I have found two other uncommon race conditions for the TestConcurrent* tests on arm64. TestConcurrentManyBreakpoints.py (which uses no hardware watchpoint/breakpoints) will sometimes only have 99 breakpoints when it expects 100, and any of the concurrent tests using the shared harness (I've seen it in TestConcurrentWatchBreakDelay.py, TestConcurrentTwoBreakpointsOneSignal.py, TestConcurrentSignalDelayWatch.py) can fail when the test harness checks that there is only one thread still running at the end, and it finds two -- one of them under pthread_exit / pthread_terminate. Both of these failures happen on github main without my changes, and with my changes - they are unrelated race conditions in these tests, and I'm sure I'll be looking into them at some point if they hit the CI bots with frequency. On my computer, these are in the 0.3-0.5% of the time class. But the CI bots do have different timing.
Diffstat (limited to 'lldb/include')
-rw-r--r--lldb/include/lldb/Target/StopInfo.h5
-rw-r--r--lldb/include/lldb/Target/Thread.h14
2 files changed, 19 insertions, 0 deletions
diff --git a/lldb/include/lldb/Target/StopInfo.h b/lldb/include/lldb/Target/StopInfo.h
index 305fc5d..d1848fc 100644
--- a/lldb/include/lldb/Target/StopInfo.h
+++ b/lldb/include/lldb/Target/StopInfo.h
@@ -79,6 +79,11 @@ public:
virtual bool IsValidForOperatingSystemThread(Thread &thread) { return true; }
+ /// A Continue operation can result in a false stop event
+ /// before any execution has happened. We need to detect this
+ /// and silently continue again one more time.
+ virtual bool WasContinueInterrupted(Thread &thread) { return false; }
+
// Sometimes the thread plan logic will know that it wants a given stop to
// stop or not, regardless of what the ordinary logic for that StopInfo would
// dictate. The main example of this is the ThreadPlanCallFunction, which
diff --git a/lldb/include/lldb/Target/Thread.h b/lldb/include/lldb/Target/Thread.h
index 30863ad..96ca95ad 100644
--- a/lldb/include/lldb/Target/Thread.h
+++ b/lldb/include/lldb/Target/Thread.h
@@ -11,6 +11,7 @@
#include <memory>
#include <mutex>
+#include <optional>
#include <string>
#include <vector>
@@ -1226,6 +1227,16 @@ public:
lldb::ValueObjectSP GetSiginfoValue();
+ /// Request the pc value the thread had when previously stopped.
+ ///
+ /// When the thread performs execution, it copies the current RegisterContext
+ /// GetPC() value. This method returns that value, if it is available.
+ ///
+ /// \return
+ /// The PC value before execution was resumed. May not be available;
+ /// an empty std::optional is returned in that case.
+ std::optional<lldb::addr_t> GetPreviousFrameZeroPC();
+
protected:
friend class ThreadPlan;
friend class ThreadList;
@@ -1306,6 +1317,9 @@ protected:
///populated after a thread stops.
lldb::StackFrameListSP m_prev_frames_sp; ///< The previous stack frames from
///the last time this thread stopped.
+ std::optional<lldb::addr_t>
+ m_prev_framezero_pc; ///< Frame 0's PC the last
+ /// time this thread was stopped.
int m_resume_signal; ///< The signal that should be used when continuing this
///thread.
lldb::StateType m_resume_state; ///< This state is used to force a thread to