riscv-gnu-toolchain/gdb.git - Unnamed repository; edit this file 'description' to name the repository.

diff options

author	Pedro Alves <pedro@palves.net>	2023-12-01 17:45:21 +0000
committer	Pedro Alves <pedro@palves.net>	2023-12-20 21:20:20 +0000
commit	33179c5b57c9fc89c061deb1d387f972b8604b49 (patch)
tree	8b68cd0f5c846575def1369d11b6b8f061062173 /binutils/ChangeLog-2018
parent	7d1fd6713582f2436a9a569ac7a8700c75e35a42 (diff)
download	gdb-33179c5b57c9fc89c061deb1d387f972b8604b49.zip gdb-33179c5b57c9fc89c061deb1d387f972b8604b49.tar.gz gdb-33179c5b57c9fc89c061deb1d387f972b8604b49.tar.bz2

Fix handling of vanishing threads that were stepping/stopping

Downstream, AMD is carrying a testcase (gdb.rocm/continue-over-kernel-exit.exp) that exposes a couple issues with the amd-dbgapi target's handling of exited threads. The test can't be added upstream yet, unfortunately, due to dependency on DWARF extensions that can't be upstreamed yet. However, it can be found on the mailing list on the same series as this patch. The test spawns a kernel with a number of waves. The waves do nothing but exit. There is a breakpoint on the s_endpgm instruction. Once that breakpoint is hit, the test issues a "continue" command. We should see one breakpoint hit per wave, and then the whole program exiting. We do see that, however we also see this: [New AMDGPU Wave ?:?:?:1 (?,?,?)/?] [AMDGPU Wave ?:?:?:1 (?,?,?)/? exited] *repeat for other waves* ... [Thread 0x7ffff626f640 (LWP 3048491) exited] [Thread 0x7fffeb7ff640 (LWP 3048488) exited] [Inferior 1 (process 3048475) exited normally] That "New AMDGPU Wave" output comes from infrun.c itself adding the thread to the GDB thread list, because it got an event for a thread not on the thread list yet. The output shows "?"s instead of proper coordinates, because the event was a TARGET_WAITKIND_THREAD_EXITED, i.e., the wave was already gone when infrun.c added the thread to the thread list. That shouldn't ever happen for the amd-dbgapi target, threads should only ever be added by the backend. Note "New AMDGPU Wave ?:?:?:1" is for wave 1. What happened was that wave 1 terminated previously, and a previous call to amd_dbgapi_target::update_thread_list() noticed the wave had vanished and removed it from the GDB thread list. However, because the wave was stepping when it terminated (due to the displaced step over the s_endpgm) instruction, it is guaranteed that the amd-dbgapi library queues a WAVE_COMMAND_TERMINATED event for the exit. When we process that WAVE_COMMAND_TERMINATED event, in amd-dbgapi-target.c:process_one_event, we return it to the core as a TARGET_WAITKIND_THREAD_EXITED event: static void process_one_event (amd_dbgapi_event_id_t event_id, amd_dbgapi_event_kind_t event_kind) { ... if (status == AMD_DBGAPI_STATUS_ERROR_INVALID_WAVE_ID && event_kind == AMD_DBGAPI_EVENT_KIND_WAVE_COMMAND_TERMINATED) ws.set_thread_exited (0); ... } Recall the wave is already gone from the GDB thread list. So when GDB sees that TARGET_WAITKIND_THREAD_EXITED event for a thread it doesn't know about, it adds the thread to the thread list, resulting in that: [New AMDGPU Wave ?:?:?:1 (?,?,?)/?] and then, because it was a TARGET_WAITKIND_THREAD_EXITED event, GDB marks the thread exited right afterwards: [AMDGPU Wave ?:?:?:1 (?,?,?)/? exited] The fix is to make amd_dbgapi_target::update_thread_list() _not_ delete vanishing waves iff they were stepping or in progress of being stopped. These two cases are the ones dbgapi guarantees will result in a WAVE_COMMAND_TERMINATED event if the wave terminates: /** * A command for a wave was not able to complete because the wave has * terminated. * * Commands that can result in this event are ::amd_dbgapi_wave_stop and * ::amd_dbgapi_wave_resume in single step mode. Since the wave terminated * before stopping, this event will be reported instead of * ::AMD_DBGAPI_EVENT_KIND_WAVE_STOP. * * The wave that terminated is available by the ::AMD_DBGAPI_EVENT_INFO_WAVE * query. However, the wave will be invalid since it has already terminated. * It is the client's responsibility to know what command was being performed * and was unable to complete due to the wave terminating. */ AMD_DBGAPI_EVENT_KIND_WAVE_COMMAND_TERMINATED = 2, As the comment says, it's GDB's responsability to know whether the wave was stepping or being stopped. Since we now have a wave_info map with one entry for each wave, that seems like the place to store that information. However, I still decided to put all the coordinate information in its own structure. I.e., basically renamed the existing wave_info to wave_coordinates, and then added a new wave_info structure that holds the new state, plus a wave_coordinates object. This seemed cleaner as there are places where we only need to instantiate a wave_coordinates object. There's an extra twist. The testcase also exercises stopping at a new kernel right after the first kernel fully exits. In that scenario, we were hitting this assertion after the first kernel fully exits and the hit of the breakpoint at the second kernel is handled: [amd-dbgapi] process_event_queue: Pulled event from dbgapi: event_id.handle = 26, event_kind = WAVE_STOP [amd-dbgapi-lib] suspending queue_3, queue_2, queue_1 (refresh wave list) ../../src/gdb/amd-dbgapi-target.c:1625: internal-error: amd_dbgapi_thread_deleted: Assertion `it != info->wave_info_map.end ()' failed. A problem internal to GDB has been detected, further debugging may prove unreliable. This is the exact same problem as above, just a different manifestation. In this scenario, we end up in update_thread_list successfully deleting the exited thread (because it was no longer the current thread) that was incorrectly added by infrun.c. Because it was added by infrun.c and not by amd-dbgapi-target.c:add_gpu_thread, it doesn't have an entry in the wave_info map, so amd_dbgapi_thread_deleted trips on this assertion: gdb_assert (it != info->wave_info_map.end ()); here: ... -> stop_all_threads -> update_thread_list -> target_update_thread_list -> amd_dbgapi_target::update_thread_list -> thread_db_target::update_thread_list -> linux_nat_target::update_thread_list -> delete_exited_threads -> delete_thread -> delete_thread_1 -> gdb::observers::observable<thread_info*>::notify -> amd_dbgapi_thread_deleted -> internal_error_loc The testcase thus tries both running to exit after the first kernel exits, and running to a breakpoint in a second kernel after the first kernel exits. Approved-By: Lancelot Six <lancelot.six@amd.com> (amdgpu) Change-Id: I43a66f060c35aad1fe0d9ff022ce2afd0537f028

Diffstat (limited to 'binutils/ChangeLog-2018')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: