Age | Commit message (Collapse) | Author | Files | Lines |
|
Downstream, AMD is carrying a testcase
(gdb.rocm/continue-over-kernel-exit.exp) that exposes a couple issues
with the amd-dbgapi target's handling of exited threads. The test
can't be added upstream yet, unfortunately, due to dependency on DWARF
extensions that can't be upstreamed yet. However, it can be found on
the mailing list on the same series as this patch.
The test spawns a kernel with a number of waves. The waves do nothing
but exit. There is a breakpoint on the s_endpgm instruction. Once
that breakpoint is hit, the test issues a "continue" command. We
should see one breakpoint hit per wave, and then the whole program
exiting. We do see that, however we also see this:
[New AMDGPU Wave ?:?:?:1 (?,?,?)/?]
[AMDGPU Wave ?:?:?:1 (?,?,?)/? exited]
*repeat for other waves*
...
[Thread 0x7ffff626f640 (LWP 3048491) exited]
[Thread 0x7fffeb7ff640 (LWP 3048488) exited]
[Inferior 1 (process 3048475) exited normally]
That "New AMDGPU Wave" output comes from infrun.c itself adding the
thread to the GDB thread list, because it got an event for a thread
not on the thread list yet. The output shows "?"s instead of proper
coordinates, because the event was a TARGET_WAITKIND_THREAD_EXITED,
i.e., the wave was already gone when infrun.c added the thread to the
thread list.
That shouldn't ever happen for the amd-dbgapi target, threads should
only ever be added by the backend.
Note "New AMDGPU Wave ?:?:?:1" is for wave 1. What happened was that
wave 1 terminated previously, and a previous call to
amd_dbgapi_target::update_thread_list() noticed the wave had vanished
and removed it from the GDB thread list. However, because the wave
was stepping when it terminated (due to the displaced step over the
s_endpgm) instruction, it is guaranteed that the amd-dbgapi library
queues a WAVE_COMMAND_TERMINATED event for the exit.
When we process that WAVE_COMMAND_TERMINATED event, in
amd-dbgapi-target.c:process_one_event, we return it to the core as a
TARGET_WAITKIND_THREAD_EXITED event:
static void
process_one_event (amd_dbgapi_event_id_t event_id,
amd_dbgapi_event_kind_t event_kind)
{
...
if (status == AMD_DBGAPI_STATUS_ERROR_INVALID_WAVE_ID
&& event_kind == AMD_DBGAPI_EVENT_KIND_WAVE_COMMAND_TERMINATED)
ws.set_thread_exited (0);
...
}
Recall the wave is already gone from the GDB thread list. So when GDB
sees that TARGET_WAITKIND_THREAD_EXITED event for a thread it doesn't
know about, it adds the thread to the thread list, resulting in that:
[New AMDGPU Wave ?:?:?:1 (?,?,?)/?]
and then, because it was a TARGET_WAITKIND_THREAD_EXITED event, GDB
marks the thread exited right afterwards:
[AMDGPU Wave ?:?:?:1 (?,?,?)/? exited]
The fix is to make amd_dbgapi_target::update_thread_list() _not_
delete vanishing waves iff they were stepping or in progress of being
stopped. These two cases are the ones dbgapi guarantees will result
in a WAVE_COMMAND_TERMINATED event if the wave terminates:
/**
* A command for a wave was not able to complete because the wave has
* terminated.
*
* Commands that can result in this event are ::amd_dbgapi_wave_stop and
* ::amd_dbgapi_wave_resume in single step mode. Since the wave terminated
* before stopping, this event will be reported instead of
* ::AMD_DBGAPI_EVENT_KIND_WAVE_STOP.
*
* The wave that terminated is available by the ::AMD_DBGAPI_EVENT_INFO_WAVE
* query. However, the wave will be invalid since it has already terminated.
* It is the client's responsibility to know what command was being performed
* and was unable to complete due to the wave terminating.
*/
AMD_DBGAPI_EVENT_KIND_WAVE_COMMAND_TERMINATED = 2,
As the comment says, it's GDB's responsability to know whether the
wave was stepping or being stopped. Since we now have a wave_info map
with one entry for each wave, that seems like the place to store that
information. However, I still decided to put all the coordinate
information in its own structure. I.e., basically renamed the
existing wave_info to wave_coordinates, and then added a new wave_info
structure that holds the new state, plus a wave_coordinates object.
This seemed cleaner as there are places where we only need to
instantiate a wave_coordinates object.
There's an extra twist. The testcase also exercises stopping at a new
kernel right after the first kernel fully exits. In that scenario, we
were hitting this assertion after the first kernel fully exits and the
hit of the breakpoint at the second kernel is handled:
[amd-dbgapi] process_event_queue: Pulled event from dbgapi: event_id.handle = 26, event_kind = WAVE_STOP
[amd-dbgapi-lib] suspending queue_3, queue_2, queue_1 (refresh wave list)
../../src/gdb/amd-dbgapi-target.c:1625: internal-error: amd_dbgapi_thread_deleted: Assertion `it != info->wave_info_map.end ()' failed.
A problem internal to GDB has been detected,
further debugging may prove unreliable.
This is the exact same problem as above, just a different
manifestation. In this scenario, we end up in update_thread_list
successfully deleting the exited thread (because it was no longer the
current thread) that was incorrectly added by infrun.c. Because it
was added by infrun.c and not by amd-dbgapi-target.c:add_gpu_thread,
it doesn't have an entry in the wave_info map, so
amd_dbgapi_thread_deleted trips on this assertion:
gdb_assert (it != info->wave_info_map.end ());
here:
...
-> stop_all_threads
-> update_thread_list
-> target_update_thread_list
-> amd_dbgapi_target::update_thread_list
-> thread_db_target::update_thread_list
-> linux_nat_target::update_thread_list
-> delete_exited_threads
-> delete_thread
-> delete_thread_1
-> gdb::observers::observable<thread_info*>::notify
-> amd_dbgapi_thread_deleted
-> internal_error_loc
The testcase thus tries both running to exit after the first kernel
exits, and running to a breakpoint in a second kernel after the first
kernel exits.
Approved-By: Lancelot Six <lancelot.six@amd.com> (amdgpu)
Change-Id: I43a66f060c35aad1fe0d9ff022ce2afd0537f028
|
|
Currently, if you step over kernel exit, you see:
stepi
[AMDGPU Wave ?:?:?:1 (?,?,?)/? exited]
Command aborted, thread exited.
(gdb)
Those '?' are because the thread/wave is already gone by the time GDB
prints the "exited" notification, we can't ask dbgapi for any info
about the wave anymore.
This commit fixes it by caching the wave's coordinates as soon as GDB
sees the wave for the first time, and making
amd_dbgapi_target::pid_to_str use the cached info.
At first I thought of clearing the wave_info object from a
thread_exited observer. However, that is too soon, resulting in this:
(gdb) si
[AMDGPU Wave 1:4:1:1 (0,0,0)/0 exited]
Command aborted, thread exited.
(gdb) thread
[Current thread is 6 (AMDGPU Wave ?:?:?:0 (?,?,?)/?) (exited)]
We need instead to clear the wave info when the thread is ultimately
deleted, so we get:
(gdb) si
[AMDGPU Wave 1:4:1:1 (0,0,0)/0 exited]
Command aborted, thread exited.
(gdb) thread
[Current thread is 6 (AMDGPU Wave 1:4:1:1 (0,0,0)/0) (exited)]
And for that, we need a new thread_deleted observable.
Approved-By: Simon Marchi <simon.marchi@efficios.com>
Approved-By: Lancelot Six <lancelot.six@amd.com> (amdgpu)
Change-Id: I6c3e22541f051e1205f75eb657b04dc15e547580
|
|
Since GDB now requires C++17, we don't need the internally maintained
gdb::optional implementation. This patch does the following replacing:
- gdb::optional -> std::optional
- gdb::in_place -> std::in_place
- #include "gdbsupport/gdb_optional.h" -> #include <optional>
This change has mostly been done automatically. One exception is
gdbsupport/thread-pool.* which did not use the gdb:: prefix as it
already lives in the gdb namespace.
Change-Id: I19a92fa03e89637bab136c72e34fd351524f65e9
Approved-By: Tom Tromey <tom@tromey.com>
Approved-By: Pedro Alves <pedro@palves.net>
|
|
The AMD_DBGAPI_SCOPED_DEBUG_START_END macro in gdb/amd-dbgapi-target.c
is incorrectly controlled by "set debug infrun", while it should be
controlled by "set debug amd-dbgapi" instead. This commit fixes it.
Change-Id: I8ec2b1a4b9980c2d565a8aafd060ed070eeb3b29
|
|
Make the inferior's gdbarch field private, and add getters and setters.
This helped me by allowing putting breakpoints on set_arch to know when
the inferior's arch was set. A subsequent patch in this series also
adds more things in set_arch.
Change-Id: I0005bd1ef4cd6b612af501201cec44e457998eec
Reviewed-By: John Baldwin <jhb@FreeBSD.org>
Approved-By: Andrew Burgess <aburgess@redhat.com>
|
|
The amd-dbgapi library exposes a setting called "memory precision" for
AMD GPUs [1]. Here's a copy of the description of the setting:
The AMD GPU can overlap the execution of memory instructions with other
instructions. This can result in a wave stopping due to a memory violation
or hardware data watchpoint hit with a program counter beyond the
instruction that caused the wave to stop.
Some architectures allow the hardware to be configured to always wait for
memory operations to complete before continuing. This will result in the
wave stopping at the instruction immediately after the one that caused the
stop event. Enabling this mode can make execution of waves significantly
slower.
Expose this option through a new "amdgpu precise-memory" setting.
The precise memory setting is per inferior. The setting is transferred
from one inferior to another when using the clone-inferior command, or
when a new inferior is created following an exec or a fork.
It can be set before starting the inferior, in which case GDB will
attempt to apply what the user wants when attaching amd-dbgapi. If the
user has requested to enable precise memory, but it can't be enabled
(not all hardware supports it), GDB prints a warning.
If precise memory is disabled, GDB prints a warning when hitting a
memory exception (translated into GDB_SIGNAL_SEGV or GDB_SIGNAL_BUS),
saying that the stop location may not be precise.
Note that the precise memory setting also affects memory watchpoint
reporting, but the watchpoint support for AMD GPUs hasn't been
upstreamed to GDB yet. When we do upstream watchpoint support, GDB will
produce a similar warning message when stopping due to a watchpoint if
precise memory is disabled.
Add a handful of tests. Add a util proc
"hip_devices_support_precise_memory", which indicates if all devices
used for testing support that feature.
[1] https://github.com/ROCm-Developer-Tools/ROCdbgapi/blob/687374258a27b5aab1309a7e8ded719e2f1ed3b1/include/amd-dbgapi.h.in#L6300-L6317
Change-Id: Ife1a99c0e960513da375ced8f8afaf8e47a61b3f
Approved-By: Lancelot Six <lancelot.six@amd.com>
|
|
After commit 9d7d58e7262, the amdgpu target started printing
"thread exited" messages when pruning waves that had terminated.
...
[AMDGPU Wave ?:?:?:2045 (?,?,?)/? exited]
[AMDGPU Wave ?:?:?:2046 (?,?,?)/? exited]
[AMDGPU Wave ?:?:?:2047 (?,?,?)/? exited]
[AMDGPU Wave ?:?:?:2048 (?,?,?)/? exited]
...
The issue was that before commit 9d7d58e7262, delete_thread was silent
by default due to a bug that the commit fixed.
Replaced the amdgpu target call to delete_thread with a call to
delete_thread_silent.
Change-Id: Ie5d5a4c5be851f092d2315b2afa6a36a30a05245
Approved-By: Simon Marchi <simon.marchi@efficios.com>
|
|
Since b080fe54fb3 "gdb: add inferior-specific breakpoints", the
breakpoint class has an "inferior" member used to handle
inferior-specific breakpoints. This creates a compilation error
in amd_dbgapi_target_breakpoint::check_status which declares a local
variable "inferior *inf".
Fix this by using "struct inferior *inf" instead.
Change-Id: Icc4dc1ba96c7d3ff9d33f9cb384ffcf64eba26fb
Approved-By: Pedro Alves <pedro@palves.net>
|
|
When debugging a multi-process application where a parent spawns
multiple child processes using the ROCm runtime, I see the following
assertion failure:
../../gdb/amd-dbgapi-target.c:1071: internal-error: process_one_event: Assertion `runtime_state == AMD_DBGAPI_RUNTIME_STATE_UNLOADED' failed.
A problem internal to GDB has been detected,
further debugging may prove unreliable.
----- Backtrace -----
0x556e9a318540 gdb_internal_backtrace_1
../../gdb/bt-utils.c:122
0x556e9a318540 _Z22gdb_internal_backtracev
../../gdb/bt-utils.c:168
0x556e9a730224 internal_vproblem
../../gdb/utils.c:396
0x556e9a7304e0 _Z15internal_verrorPKciS0_P13__va_list_tag
../../gdb/utils.c:476
0x556e9a87aeb4 _Z18internal_error_locPKciS0_z
../../gdbsupport/errors.cc:58
0x556e9a29f446 process_one_event
../../gdb/amd-dbgapi-target.c:1071
0x556e9a29f446 process_event_queue
../../gdb/amd-dbgapi-target.c:1156
0x556e9a29faf2 _ZN17amd_dbgapi_target4waitE6ptid_tP17target_waitstatus10enum_flagsI16target_wait_flagE
../../gdb/amd-dbgapi-target.c:1262
0x556e9a6b0965 _Z11target_wait6ptid_tP17target_waitstatus10enum_flagsI16target_wait_flagE
../../gdb/target.c:2586
0x556e9a4c221f do_target_wait_1
../../gdb/infrun.c:3876
0x556e9a4d8489 operator()
../../gdb/infrun.c:3935
0x556e9a4d8489 do_target_wait
../../gdb/infrun.c:3964
0x556e9a4d8489 _Z20fetch_inferior_eventv
../../gdb/infrun.c:4365
0x556e9a87b915 gdb_wait_for_event
../../gdbsupport/event-loop.cc:694
0x556e9a87c3a9 gdb_wait_for_event
../../gdbsupport/event-loop.cc:593
0x556e9a87c3a9 _Z16gdb_do_one_eventi
../../gdbsupport/event-loop.cc:217
0x556e9a521689 start_event_loop
../../gdb/main.c:412
0x556e9a521689 captured_command_loop
../../gdb/main.c:476
0x556e9a523c04 captured_main
../../gdb/main.c:1320
0x556e9a523c04 _Z8gdb_mainP18captured_main_args
../../gdb/main.c:1339
0x556e9a24b1bf main
../../gdb/gdb.c:32
---------------------
../../gdb/amd-dbgapi-target.c:1071: internal-error: process_one_event: Assertion `runtime_state == AMD_DBGAPI_RUNTIME_STATE_UNLOADED' failed.
A problem internal to GDB has been detected,
Before diving into why this error appears, let's explore how things are
expected to work in normal circumstances. When a process being debugged
starts using the ROCm runtime, the following happens:
- The runtime registers itself to the driver.
- The driver creates a "runtime loaded" event and notifies the debugger
that a new event is available by writing to a file descriptor which is
registered in GDB's main event loop.
- GDB core calls the callback associated with this file descriptor
(dbgapi_notifier_handler). Because the amd-dbgapi-target is not
pushed at this point, the handler pulls the "runtime loaded" event
from the driver (this is the only event which can be available at this
point) and eventually pushes the amd-dbgapi-target on the inferior's
target stack.
In a nutshell, this is the expected AMDGPU runtime activation process.
From there, when new events are available regarding the GPU threads, the
same file descriptor is written to. The callback sees that the
amd-dbgapi-target is pushed so marks the amd_dbgapi_async_event_handler.
This will later cause amd_dbgapi_target::wait to be called. The wait
method pulls all the available events from the driver and handles them.
The wait method returns the information conveyed by the first event, the
other events are cached for later calls of the wait method.
Note that because we are under the wait method, we know that the
amd-dbgapi-target is pushed on the inferior target stack. This implies
that the runtime activation event has been seen already. As a
consequence, we cannot receive another event indicating that the runtime
gets activated. This is what the failing assertion checks.
In the case when we have multiple inferiors however, there is a flaw in
what have been described above. If one inferior (let's call it inferior
1) already has the amd-dbgapi-target pushed to its target stack and
another inferior (inferior 2) activates the ROCm runtime, here is what
can happen:
- The driver creates the runtime activation for inferior 2 and writes to
the associated file descriptor.
- GDB has inferior 1 selected and calls target_wait for some reason.
- This prompts amd_dbgapi_target::wait to be called. The method pulls
all events from the driver, including the runtime activation event for
inferior 2, leading to the assertion failure.
The fix for this problem is simple. To avoid such problem, we need to
make sure that amd_dbgapi_target::wait only pulls events for the current
inferior from the driver. This is what this patch implements.
This patch also includes a testcase which could fail before this patch.
This patch has been tested on a system with multiple GPUs which had more
chances to reproduce the original bug. It has also been tested on top
of the downstream ROCgdb port which has more AMDGPU related tests. The
testcase has been tested with `make check check-read1 check-readmore`.
Approved-By: Pedro Alves <pedro@palves.net>
|
|
Current implementation of amd_dbgapi_target::detach (inferior *, int)
does the following:
remove_breakpoints_inf (current_inferior ());
detach_amd_dbgapi (inf);
beneath ()->detach (inf, from_tty);
I find that using a mix of `current_inferior ()` and `inf` disturbing.
At this point, we know that both are the same (target_detach does assert
that `inf == current_inferior ()` before calling target_ops::detach).
To improve consistency, this patch replaces `current_inferior ()` with
`inf` in amd_dbgapi_target::detach.
Change-Id: I01b7ba2e661c25839438354b509d7abbddb7c5ed
Approved-By: Pedro Alves <pedro@palves.net>
|
|
Fix some more typos:
- distinquish -> distinguish
- actualy -> actually
- singe -> single
- frash -> frame
- chid -> child
- dissassembler -> disassembler
- uninitalized -> uninitialized
- precontidion -> precondition
- regsiters -> registers
- marge -> merge
- sate -> state
- garanteed -> guaranteed
- explictly -> explicitly
- prefices (nonstandard plural) -> prefixes
- bondary -> boundary
- formated -> formatted
- ithe -> the
- arrav -> array
- coresponding -> corresponding
- owend -> owned
- fials -> fails
- diasm -> disasm
- ture -> true
- tpye -> type
There's one code change, the name of macro SIG_CODE_BONDARY_FAULT changed to
SIG_CODE_BOUNDARY_FAULT.
Tested on x86_64-linux.
|
|
Fix a few typos:
- implemention -> implementation
- convertion(s) -> conversion(s)
- backlashes -> backslashes
- signoring -> ignoring
- (un)ambigious -> (un)ambiguous
- occured -> occurred
- hidding -> hiding
- temporarilly -> temporarily
- immediatelly -> immediately
- sillyness -> silliness
- similiar -> similar
- porkuser -> pokeuser
- thats -> that
- alway -> always
- supercede -> supersede
- accomodate -> accommodate
- aquire -> acquire
- priveleged -> privileged
- priviliged -> privileged
- priviledges -> privileges
- privilige -> privilege
- recieve -> receive
- (p)refered -> (p)referred
- succesfully -> successfully
- successfuly -> successfully
- responsability -> responsibility
- wether -> whether
- wich -> which
- disasbleable -> disableable
- descriminant -> discriminant
- construcstor -> constructor
- underlaying -> underlying
- underyling -> underlying
- structureal -> structural
- appearences -> appearances
- terciarily -> tertiarily
- resgisters -> registers
- reacheable -> reachable
- likelyhood -> likelihood
- intepreter -> interpreter
- disassemly -> disassembly
- covnersion -> conversion
- conviently -> conveniently
- atttribute -> attribute
- struction -> struct
- resonable -> reasonable
- popupated -> populated
- namespaxe -> namespace
- intialize -> initialize
- identifer(s) -> identifier(s)
- expection -> exception
- exectuted -> executed
- dungerous -> dangerous
- dissapear -> disappear
- completly -> completely
- (inter)changable -> (inter)changeable
- beakpoint -> breakpoint
- automativ -> automatic
- alocating -> allocating
- agressive -> aggressive
- writting -> writing
- reguires -> requires
- registed -> registered
- recuding -> reducing
- opeartor -> operator
- ommitted -> omitted
- modifing -> modifying
- intances -> instances
- imbedded -> embedded
- gdbaarch -> gdbarch
- exection -> execution
- direcive -> directive
- demanged -> demangled
- decidely -> decidedly
- argments -> arguments
- agrument -> argument
- amespace -> namespace
- targtet -> target
- supress(ed) -> suppress(ed)
- startum -> stratum
- squence -> sequence
- prompty -> prompt
- overlow -> overflow
- memember -> member
- languge -> language
- geneate -> generate
- funcion -> function
- exising -> existing
- dinking -> syncing
- destroh -> destroy
- clenaed -> cleaned
- changep -> changedp (name of variable)
- arround -> around
- aproach -> approach
- whould -> would
- symobl -> symbol
- recuse -> recurse
- outter -> outer
- freeds -> frees
- contex -> context
Tested on x86_64-linux.
Reviewed-By: Tom Tromey <tom@tromey.com>
|
|
Prior to this patch, it's not possible for GDB to debug GPU code in fork
children or after an exec. The amd-dbgapi target attaches to processes
when an inferior appears due to a "run" or "attach" command, but not
after a fork or exec. This patch adds support for that, such that it's
possible to for an inferior to fork and for GDB to debug the GPU code in
the child.
To achieve that, use the inferior_forked and inferior_execd observers.
In the case of fork, we have nothing to do if `child_inf` is nullptr,
meaning that GDB won't debug the child. We also don't attach if the
inferior has vforked. We are already attached to the parent's address
space, which is shared with the child, so trying to attach would cause
problems. And anyway, the inferior can't do anything other than exec or
exit, it certainly won't start GPU kernels before exec'ing.
In the case of exec, we detach from the exec'ing inferior and attach to
the following inferior. This works regardless of whether they are the
same or not. If they are the same, meaning the execution continues in
the existing inferior, we need to do a detach/attach anyway, as
amd-dbgapi needs to be aware of the new address space created by the
exec.
Note that we use observers and not target_ops::follow_{fork,exec} here.
When the amd-dbgapi target is compiled in, it will attach (in the
amd_dbgapi_process_attach sense, not the ptrace sense) to native
inferiors when they appear, but won't push itself on the inferior's
target stack just yet. It only pushes itself if the inferior
initializes the ROCm runtime. So, if a non-GPU-using inferior calls
fork, an amd_dbgapi_target::follow_fork method would not get called.
Same for exec. A previous version of the code had the amd-dbgapi target
pushed all the time, in which case we could use the target methods. But
we prefer having the target pushed only when necessary, it's less
intrusive when doing native debugging that doesn't involve the GPU.
Change-Id: I5819c151c371120da8bab2fa9cbfa8769ba1d6f9
Reviewed-By: Pedro Alves <pedro@palves.net>
|
|
Make find_thread_ptid (the overload that takes a process_stratum_target)
a method of process_stratum_target.
Change-Id: Ib190a925a83c6b93e9c585dc7c6ab65efbdd8629
Reviewed-By: Tom Tromey <tom@tromey.com>
|
|
The copyright years in the ROCm files (e.g. solib-rocm.c) are wrong,
they end in 2022 instead of 2023. I suppose because I posted (or at
least prepared) the patches in 2022 but merged them in 2023, and forgot
to update the year. I found a bunch of other files that are in the same
situation. Fix them all up.
Change-Id: Ia55f5b563606c2ba6a89046f22bc0bf1c0ff2e10
Reviewed-By: Tom Tromey <tom@tromey.com>
|
|
This patch adds the foundation for GDB to be able to debug programs
offloaded to AMD GPUs using the AMD ROCm platform [1]. The latest
public release of the ROCm release at the time of writing is 5.4, so
this is what this patch targets.
The ROCm platform allows host programs to schedule bits of code for
execution on GPUs or similar accelerators. The programs running on GPUs
are typically referred to as `kernels` (not related to operating system
kernels).
Programs offloaded with the AMD ROCm platform can be written in the HIP
language [2], OpenCL and OpenMP, but we're going to focus on HIP here.
The HIP language consists of a C++ Runtime API and kernel language.
Here's an example of a very simple HIP program:
#include "hip/hip_runtime.h"
#include <cassert>
__global__ void
do_an_addition (int a, int b, int *out)
{
*out = a + b;
}
int
main ()
{
int *result_ptr, result;
/* Allocate memory for the device to write the result to. */
hipError_t error = hipMalloc (&result_ptr, sizeof (int));
assert (error == hipSuccess);
/* Run `do_an_addition` on one workgroup containing one work item. */
do_an_addition<<<dim3(1), dim3(1), 0, 0>>> (1, 2, result_ptr);
/* Copy result from device to host. Note that this acts as a synchronization
point, waiting for the kernel dispatch to complete. */
error = hipMemcpyDtoH (&result, result_ptr, sizeof (int));
assert (error == hipSuccess);
printf ("result is %d\n", result);
assert (result == 3);
return 0;
}
This program can be compiled with:
$ hipcc simple.cpp -g -O0 -o simple
... where `hipcc` is the HIP compiler, shipped with ROCm releases. This
generates an ELF binary for the host architecture, containing another
ELF binary with the device code. The ELF for the device can be
inspected with:
$ roc-obj-ls simple
1 host-x86_64-unknown-linux file://simple#offset=8192&size=0
1 hipv4-amdgcn-amd-amdhsa--gfx906 file://simple#offset=8192&size=34216
$ roc-obj-extract 'file://simple#offset=8192&size=34216'
$ file simple-offset8192-size34216.co
simple-offset8192-size34216.co: ELF 64-bit LSB shared object, *unknown arch 0xe0* version 1, dynamically linked, with debug_info, not stripped
^
amcgcn architecture that my `file` doesn't know about ----ยด
Running the program gives the very unimpressive result:
$ ./simple
result is 3
While running, this host program has copied the device program into the
GPU's memory and spawned an execution thread on it. The goal of this
GDB port is to let the user debug host threads and these GPU threads
simultaneously. Here's a sample session using a GDB with this patch
applied:
$ ./gdb -q -nx --data-directory=data-directory ./simple
Reading symbols from ./simple...
(gdb) break do_an_addition
Function "do_an_addition" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (do_an_addition) pending.
(gdb) r
Starting program: /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff5db7640 (LWP 1082911)]
[New Thread 0x7ffef53ff640 (LWP 1082913)]
[Thread 0x7ffef53ff640 (LWP 1082913) exited]
[New Thread 0x7ffdecb53640 (LWP 1083185)]
[New Thread 0x7ffff54bf640 (LWP 1083186)]
[Thread 0x7ffdecb53640 (LWP 1083185) exited]
[Switching to AMDGPU Wave 2:2:1:1 (0,0,0)/0]
Thread 6 hit Breakpoint 1, do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
24 *out = a + b;
(gdb) info inferiors
Num Description Connection Executable
* 1 process 1082907 1 (native) /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
(gdb) info threads
Id Target Id Frame
1 Thread 0x7ffff5dc9240 (LWP 1082907) "simple" 0x00007ffff5e9410b in ?? () from /opt/rocm-5.4.0/lib/libhsa-runtime64.so.1
2 Thread 0x7ffff5db7640 (LWP 1082911) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
5 Thread 0x7ffff54bf640 (LWP 1083186) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
* 6 AMDGPU Wave 2:2:1:1 (0,0,0)/0 do_an_addition (
a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) bt
Python Exception <class 'gdb.error'>: Unhandled dwarf expression opcode 0xe1
#0 do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) continue
Continuing.
result is 3
warning: Temporarily disabling breakpoints for unloaded shared library "file:///home/smarchi/build/binutils-gdb-amdgpu/gdb/simple#offset=8192&size=67208"
[Thread 0x7ffff54bf640 (LWP 1083186) exited]
[Thread 0x7ffff5db7640 (LWP 1082911) exited]
[Inferior 1 (process 1082907) exited normally]
One thing to notice is the host and GPU threads appearing under
the same inferior. This is a design goal for us, as programmers tend to
think of the threads running on the GPU as part of the same program as
the host threads, so showing them in the same inferior in GDB seems
natural. Also, the host and GPU threads share a global memory space,
which fits the inferior model.
Another thing to notice is the error messages when trying to read
variables or printing a backtrace. This is expected for the moment,
since the AMD GPU compiler produces some DWARF that uses some
non-standard extensions:
https://llvm.org/docs/AMDGPUDwarfExtensionsForHeterogeneousDebugging.html
There were already some patches posted by Zoran Zaric earlier to make
GDB support these extensions:
https://inbox.sourceware.org/gdb-patches/20211105113849.118800-1-zoran.zaric@amd.com/
We think it's better to get the basic support for AMD GPU in first,
which will then give a better justification for GDB to support these
extensions.
GPU threads are named `AMDGPU Wave`: a wave is essentially a hardware
thread using the SIMT (single-instruction, multiple-threads) [3]
execution model.
GDB uses the amd-dbgapi library [4], included in the ROCm platform, for
a few things related to AMD GPU threads debugging. Different components
talk to the library, as show on the following diagram:
+---------------------------+ +-------------+ +------------------+
| GDB | amd-dbgapi target | <-> | AMD | | Linux kernel |
| +-------------------+ | Debugger | +--------+ |
| | amdgcn gdbarch | <-> | API | <=> | AMDGPU | |
| +-------------------+ | | | driver | |
| | solib-rocm | <-> | (dbgapi.so) | +--------+---------+
+---------------------------+ +-------------+
- The amd-dbgapi target is a target_ops implementation used to control
execution of GPU threads. While the debugging of host threads works
by using the ptrace / wait Linux kernel interface (as usual), control
of GPU threads is done through a special interface (dubbed `kfd`)
exposed by the `amdgpu` Linux kernel module. GDB doesn't interact
directly with `kfd`, but instead goes through the amd-dbgapi library
(AMD Debugger API on the diagram).
Since it provides execution control, the amd-dbgapi target should
normally be a process_stratum_target, not just a target_ops. More
on that later.
- The amdgcn gdbarch (describing the hardware architecture of the GPU
execution units) offloads some requests to the amd-dbgapi library,
so that knowledge about the various architectures doesn't need to be
duplicated and baked in GDB. This is for example for things like
the list of registers.
- The solib-rocm component is an solib provider that fetches the list of
code objects loaded on the device from the amd-dbgapi library, and
makes GDB read their symbols. This is very similar to other solib
providers that handle shared libraries, except that here the shared
libraries are the pieces of code loaded on the device.
Given that Linux host threads are managed by the linux-nat target, and
the GPU threads are managed by the amd-dbgapi target, having all threads
appear in the same inferior requires the two targets to be in that
inferior's target stack. However, there can only be one
process_stratum_target in a given target stack, since there can be only
one target per slot. To achieve it, we therefore resort the hack^W
solution of placing the amd-dbgapi target in the arch_stratum slot of
the target stack, on top of the linux-nat target. Doing so allows the
amd-dbgapi target to intercept target calls and handle them if they
concern GPU threads, and offload to beneath otherwise. See
amd_dbgapi_target::fetch_registers for a simple example:
void
amd_dbgapi_target::fetch_registers (struct regcache *regcache, int regno)
{
if (!ptid_is_gpu (regcache->ptid ()))
{
beneath ()->fetch_registers (regcache, regno);
return;
}
// handle it
}
ptids of GPU threads are crafted with the following pattern:
(pid, 1, wave id)
Where pid is the inferior's pid and "wave id" is the wave handle handed
to us by the amd-dbgapi library (in practice, a monotonically
incrementing integer). The idea is that on Linux systems, the
combination (pid != 1, lwp == 1) is not possible. lwp == 1 would always
belong to the init process, which would also have pid == 1 (and it's
improbable for the init process to offload work to the GPU and much less
for the user to debug it). We can therefore differentiate GPU and
non-GPU ptids this way. See ptid_is_gpu for more details.
Note that we believe that this scheme could break down in the context of
containers, where the initial process executed in a container has pid 1
(in its own pid namespace). For instance, if you were to execute a ROCm
program in a container, then spawn a GDB in that container and attach to
the process, it will likely not work. This is a known limitation. A
workaround for this is to have a dummy process (like a shell) fork and
execute the program of interest.
The amd-dbgapi target watches native inferiors, and "attaches" to them
using amd_dbgapi_process_attach, which gives it a notifier fd that is
registered in the event loop (see enable_amd_dbgapi). Note that this
isn't the same "attach" as in PTRACE_ATTACH, but being ptrace-attached
is a precondition for amd_dbgapi_process_attach to work. When the
debugged process enables the ROCm runtime, the amd-dbgapi target gets
notified through that fd, and pushes itself on the target stack of the
inferior. The amd-dbgapi target is then able to intercept target_ops
calls. If the debugged process disables the ROCm runtime, the
amd-dbgapi target unpushes itself from the target stack.
This way, the amd-dbgapi target's footprint stays minimal when debugging
a process that doesn't use the AMD ROCm platform, it does not intercept
target calls.
The amd-dbgapi library is found using pkg-config. Since enabling
support for the amdgpu architecture (amdgpu-tdep.c) depends on the
amd-dbgapi library being present, we have the following logic for
the interaction with --target and --enable-targets:
- if the user explicitly asks for amdgcn support with
--target=amdgcn-*-* or --enable-targets=amdgcn-*-*, we probe for
the amd-dbgapi and fail if not found
- if the user uses --enable-targets=all, we probe for amd-dbgapi,
enable amdgcn support if found, disable amdgcn support if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=yes,
we probe for amd-dbgapi, enable amdgcn if found and fail if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=no,
we do not probe for amd-dbgapi, disable amdgcn support
- otherwise, amd-dbgapi is not probed for and support for amdgcn is not
enabled
Finally, a simple test is included. It only tests hitting a breakpoint
in device code and resuming execution, pretty much like the example
shown above.
[1] https://docs.amd.com/category/ROCm_v5.4
[2] https://docs.amd.com/bundle/HIP-Programming-Guide-v5.4
[3] https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads
[4] https://docs.amd.com/bundle/ROCDebugger-API-Guide-v5.4
Change-Id: I591edca98b8927b1e49e4b0abe4e304765fed9ee
Co-Authored-By: Zoran Zaric <zoran.zaric@amd.com>
Co-Authored-By: Laurent Morichetti <laurent.morichetti@amd.com>
Co-Authored-By: Tony Tye <Tony.Tye@amd.com>
Co-Authored-By: Lancelot SIX <lancelot.six@amd.com>
Co-Authored-By: Pedro Alves <pedro@palves.net>
|