|
This patch adds the foundation for GDB to be able to debug programs
offloaded to AMD GPUs using the AMD ROCm platform [1]. The latest
public release of the ROCm release at the time of writing is 5.4, so
this is what this patch targets.
The ROCm platform allows host programs to schedule bits of code for
execution on GPUs or similar accelerators. The programs running on GPUs
are typically referred to as `kernels` (not related to operating system
kernels).
Programs offloaded with the AMD ROCm platform can be written in the HIP
language [2], OpenCL and OpenMP, but we're going to focus on HIP here.
The HIP language consists of a C++ Runtime API and kernel language.
Here's an example of a very simple HIP program:
#include "hip/hip_runtime.h"
#include <cassert>
__global__ void
do_an_addition (int a, int b, int *out)
{
*out = a + b;
}
int
main ()
{
int *result_ptr, result;
/* Allocate memory for the device to write the result to. */
hipError_t error = hipMalloc (&result_ptr, sizeof (int));
assert (error == hipSuccess);
/* Run `do_an_addition` on one workgroup containing one work item. */
do_an_addition<<<dim3(1), dim3(1), 0, 0>>> (1, 2, result_ptr);
/* Copy result from device to host. Note that this acts as a synchronization
point, waiting for the kernel dispatch to complete. */
error = hipMemcpyDtoH (&result, result_ptr, sizeof (int));
assert (error == hipSuccess);
printf ("result is %d\n", result);
assert (result == 3);
return 0;
}
This program can be compiled with:
$ hipcc simple.cpp -g -O0 -o simple
... where `hipcc` is the HIP compiler, shipped with ROCm releases. This
generates an ELF binary for the host architecture, containing another
ELF binary with the device code. The ELF for the device can be
inspected with:
$ roc-obj-ls simple
1 host-x86_64-unknown-linux file://simple#offset=8192&size=0
1 hipv4-amdgcn-amd-amdhsa--gfx906 file://simple#offset=8192&size=34216
$ roc-obj-extract 'file://simple#offset=8192&size=34216'
$ file simple-offset8192-size34216.co
simple-offset8192-size34216.co: ELF 64-bit LSB shared object, *unknown arch 0xe0* version 1, dynamically linked, with debug_info, not stripped
^
amcgcn architecture that my `file` doesn't know about ----ยด
Running the program gives the very unimpressive result:
$ ./simple
result is 3
While running, this host program has copied the device program into the
GPU's memory and spawned an execution thread on it. The goal of this
GDB port is to let the user debug host threads and these GPU threads
simultaneously. Here's a sample session using a GDB with this patch
applied:
$ ./gdb -q -nx --data-directory=data-directory ./simple
Reading symbols from ./simple...
(gdb) break do_an_addition
Function "do_an_addition" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (do_an_addition) pending.
(gdb) r
Starting program: /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff5db7640 (LWP 1082911)]
[New Thread 0x7ffef53ff640 (LWP 1082913)]
[Thread 0x7ffef53ff640 (LWP 1082913) exited]
[New Thread 0x7ffdecb53640 (LWP 1083185)]
[New Thread 0x7ffff54bf640 (LWP 1083186)]
[Thread 0x7ffdecb53640 (LWP 1083185) exited]
[Switching to AMDGPU Wave 2:2:1:1 (0,0,0)/0]
Thread 6 hit Breakpoint 1, do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
24 *out = a + b;
(gdb) info inferiors
Num Description Connection Executable
* 1 process 1082907 1 (native) /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
(gdb) info threads
Id Target Id Frame
1 Thread 0x7ffff5dc9240 (LWP 1082907) "simple" 0x00007ffff5e9410b in ?? () from /opt/rocm-5.4.0/lib/libhsa-runtime64.so.1
2 Thread 0x7ffff5db7640 (LWP 1082911) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
5 Thread 0x7ffff54bf640 (LWP 1083186) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
* 6 AMDGPU Wave 2:2:1:1 (0,0,0)/0 do_an_addition (
a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) bt
Python Exception <class 'gdb.error'>: Unhandled dwarf expression opcode 0xe1
#0 do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) continue
Continuing.
result is 3
warning: Temporarily disabling breakpoints for unloaded shared library "file:///home/smarchi/build/binutils-gdb-amdgpu/gdb/simple#offset=8192&size=67208"
[Thread 0x7ffff54bf640 (LWP 1083186) exited]
[Thread 0x7ffff5db7640 (LWP 1082911) exited]
[Inferior 1 (process 1082907) exited normally]
One thing to notice is the host and GPU threads appearing under
the same inferior. This is a design goal for us, as programmers tend to
think of the threads running on the GPU as part of the same program as
the host threads, so showing them in the same inferior in GDB seems
natural. Also, the host and GPU threads share a global memory space,
which fits the inferior model.
Another thing to notice is the error messages when trying to read
variables or printing a backtrace. This is expected for the moment,
since the AMD GPU compiler produces some DWARF that uses some
non-standard extensions:
https://llvm.org/docs/AMDGPUDwarfExtensionsForHeterogeneousDebugging.html
There were already some patches posted by Zoran Zaric earlier to make
GDB support these extensions:
https://inbox.sourceware.org/gdb-patches/20211105113849.118800-1-zoran.zaric@amd.com/
We think it's better to get the basic support for AMD GPU in first,
which will then give a better justification for GDB to support these
extensions.
GPU threads are named `AMDGPU Wave`: a wave is essentially a hardware
thread using the SIMT (single-instruction, multiple-threads) [3]
execution model.
GDB uses the amd-dbgapi library [4], included in the ROCm platform, for
a few things related to AMD GPU threads debugging. Different components
talk to the library, as show on the following diagram:
+---------------------------+ +-------------+ +------------------+
| GDB | amd-dbgapi target | <-> | AMD | | Linux kernel |
| +-------------------+ | Debugger | +--------+ |
| | amdgcn gdbarch | <-> | API | <=> | AMDGPU | |
| +-------------------+ | | | driver | |
| | solib-rocm | <-> | (dbgapi.so) | +--------+---------+
+---------------------------+ +-------------+
- The amd-dbgapi target is a target_ops implementation used to control
execution of GPU threads. While the debugging of host threads works
by using the ptrace / wait Linux kernel interface (as usual), control
of GPU threads is done through a special interface (dubbed `kfd`)
exposed by the `amdgpu` Linux kernel module. GDB doesn't interact
directly with `kfd`, but instead goes through the amd-dbgapi library
(AMD Debugger API on the diagram).
Since it provides execution control, the amd-dbgapi target should
normally be a process_stratum_target, not just a target_ops. More
on that later.
- The amdgcn gdbarch (describing the hardware architecture of the GPU
execution units) offloads some requests to the amd-dbgapi library,
so that knowledge about the various architectures doesn't need to be
duplicated and baked in GDB. This is for example for things like
the list of registers.
- The solib-rocm component is an solib provider that fetches the list of
code objects loaded on the device from the amd-dbgapi library, and
makes GDB read their symbols. This is very similar to other solib
providers that handle shared libraries, except that here the shared
libraries are the pieces of code loaded on the device.
Given that Linux host threads are managed by the linux-nat target, and
the GPU threads are managed by the amd-dbgapi target, having all threads
appear in the same inferior requires the two targets to be in that
inferior's target stack. However, there can only be one
process_stratum_target in a given target stack, since there can be only
one target per slot. To achieve it, we therefore resort the hack^W
solution of placing the amd-dbgapi target in the arch_stratum slot of
the target stack, on top of the linux-nat target. Doing so allows the
amd-dbgapi target to intercept target calls and handle them if they
concern GPU threads, and offload to beneath otherwise. See
amd_dbgapi_target::fetch_registers for a simple example:
void
amd_dbgapi_target::fetch_registers (struct regcache *regcache, int regno)
{
if (!ptid_is_gpu (regcache->ptid ()))
{
beneath ()->fetch_registers (regcache, regno);
return;
}
// handle it
}
ptids of GPU threads are crafted with the following pattern:
(pid, 1, wave id)
Where pid is the inferior's pid and "wave id" is the wave handle handed
to us by the amd-dbgapi library (in practice, a monotonically
incrementing integer). The idea is that on Linux systems, the
combination (pid != 1, lwp == 1) is not possible. lwp == 1 would always
belong to the init process, which would also have pid == 1 (and it's
improbable for the init process to offload work to the GPU and much less
for the user to debug it). We can therefore differentiate GPU and
non-GPU ptids this way. See ptid_is_gpu for more details.
Note that we believe that this scheme could break down in the context of
containers, where the initial process executed in a container has pid 1
(in its own pid namespace). For instance, if you were to execute a ROCm
program in a container, then spawn a GDB in that container and attach to
the process, it will likely not work. This is a known limitation. A
workaround for this is to have a dummy process (like a shell) fork and
execute the program of interest.
The amd-dbgapi target watches native inferiors, and "attaches" to them
using amd_dbgapi_process_attach, which gives it a notifier fd that is
registered in the event loop (see enable_amd_dbgapi). Note that this
isn't the same "attach" as in PTRACE_ATTACH, but being ptrace-attached
is a precondition for amd_dbgapi_process_attach to work. When the
debugged process enables the ROCm runtime, the amd-dbgapi target gets
notified through that fd, and pushes itself on the target stack of the
inferior. The amd-dbgapi target is then able to intercept target_ops
calls. If the debugged process disables the ROCm runtime, the
amd-dbgapi target unpushes itself from the target stack.
This way, the amd-dbgapi target's footprint stays minimal when debugging
a process that doesn't use the AMD ROCm platform, it does not intercept
target calls.
The amd-dbgapi library is found using pkg-config. Since enabling
support for the amdgpu architecture (amdgpu-tdep.c) depends on the
amd-dbgapi library being present, we have the following logic for
the interaction with --target and --enable-targets:
- if the user explicitly asks for amdgcn support with
--target=amdgcn-*-* or --enable-targets=amdgcn-*-*, we probe for
the amd-dbgapi and fail if not found
- if the user uses --enable-targets=all, we probe for amd-dbgapi,
enable amdgcn support if found, disable amdgcn support if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=yes,
we probe for amd-dbgapi, enable amdgcn if found and fail if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=no,
we do not probe for amd-dbgapi, disable amdgcn support
- otherwise, amd-dbgapi is not probed for and support for amdgcn is not
enabled
Finally, a simple test is included. It only tests hitting a breakpoint
in device code and resuming execution, pretty much like the example
shown above.
[1] https://docs.amd.com/category/ROCm_v5.4
[2] https://docs.amd.com/bundle/HIP-Programming-Guide-v5.4
[3] https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads
[4] https://docs.amd.com/bundle/ROCDebugger-API-Guide-v5.4
Change-Id: I591edca98b8927b1e49e4b0abe4e304765fed9ee
Co-Authored-By: Zoran Zaric <zoran.zaric@amd.com>
Co-Authored-By: Laurent Morichetti <laurent.morichetti@amd.com>
Co-Authored-By: Tony Tye <Tony.Tye@amd.com>
Co-Authored-By: Lancelot SIX <lancelot.six@amd.com>
Co-Authored-By: Pedro Alves <pedro@palves.net>
|