Multi-inferior fixes by simark · Pull Request #106 · ROCm/ROCgdb

simark · 2026-04-29T15:35:26Z

This is a small patch series that we had internally for a while, it needs to be rebased and rechecked.

This function can't return a NULL pointer, so make it return a reference instead. Change-Id: I0970d6d0757181291b300bd840037a48330a7fbb

The `target_ops::async_wait_fd()` method allows targets to provide one file descriptor GDB core can listen to to receive async events. The amd-dbgapi target needs to be able to report two: its own fd and the underlying linux-nat target's fd. To accomodate this need, this patch changes `target_ops::async_wait_fd()` to return a vector of file descriptors. After this patch, wait_one is still calling async_wait_fds on the process target, so this will not use amd-dbgapi target's file descriptor just yet, this will be done in a follow-up patch. Change-Id: I971be69c4737ea4e69d93d9608039223653d528c

The problem fixed by this patch was identified while investigating SWDEV-440024. The following command reproduces the problem, leading to a hang during the `run` command: $ ./gdb -q -nx --data-directory=data-directory \ -ex "set breakpoint pending on" \ -ex "b bit_extract_kernel" \ -ex "target native" \ -ex "add-inferior" \ -ex "inferior 2" \ -ex "file samples/0_Intro/bit_extract/build/bit_extract" \ -ex run The story is: - Inferior 1 is connected to the native Linux target but not executing anything (I believe it would be the same result if it was stopped somewhere while debugging a non-GPU program, but pushing the native target manually is just easier). - Inferior 2 runs to a breakpoint in GPU code (in my test, there are two waves, both hitting the breakpoint). - stop_all_threads gets called post-breakpoint hit. - A stop request gets sent for the wave that didn't hit the breakpoint. - `wait_one` starts by polling for events using inferior 1. This hits the Linux native target, which replies TARGET_WAITKIND_NO_RESUMED. - `wait_one` disables async using `target_async (false)`, which has the effect of calling `target_ops::async (false)` on all inferiors sharing the native target, both inferiors in our case. - `wait_one` considers inferior 2, but skips over it since `target->is_async_p ()` returns false (where `target` is the Linux native target). - We are in an infinite loop inside `stop_all_threads`, because the amd-dbgapi target would like to return an event announcing the stop of the second wave, but it can't because `wait_one` will never poll it. Fix it by making it possible for the Linux native target to be !async, while the amd-dbgapi target, placed above, is async. The changes are: - Make `amd_dbgapi_target` implement `target_ops::is_async_p`. It returns true if itself is async or the beneath target is async. - In `wait_one`, when getting TARGET_WAITKIND_NO_RESUMED, call `inf->top_target ()->async (false)` instead of `target_async (false)`. In the reproducer, this has the effect of only calling `async (false)` on the first inferior. This leaves inferior 2 with the amd-dbgapi target async, and the Linux native target !async. One thing that `target_async` does is enable/disable infrun's async (calls `infrun_async`). I don't think it's needed in the context of `wait_one` / `stop_all_threads`. - Not really related to the bug fix, but complementary to the previous point: change `reenable_target_async` to call `inf->top_target ()->async (true)` instead of `target_async (1)`. When re-enabling async for the purpose of `stop_all_threads`, we don't need to enable infrun's own async (we don't consume pending / stored waitstatuses at this point). But also, in `reenable_target_async`, we loop over all inferiors. And then, in `target_async`, we loop over all inferiors sharing the current inferior's process target. It's not the end of the world, but loop over inferiors more than we need to, and more calls to `is_async` than we need. - In `wait_one`, call `is_async_p` on the inferior's top target to know if we should skip over that inferior, instead of on the inferior's process target. In our reproducer, it makes it so this call hits the amd-dbgapi target for inferior 2. And that returns true if the amd-dbgapi still has an event to report, even if the Linux native target beneath has been set to !async. - In `wait_one` again, call `async_wait_fds` on the inferior's top target, instead of on the inferior's process target. This makes it so that the returned vector of FDs contains amd-dbgapi's FDs, and `wait_one` can poll them. Add a new test that does: - Set up an inferior connected to the native target, but not running anything. - Set up an inferior debugging a process using the GPU, stopped somewhere. - Try to run an inferior starting two waves hitting a breakpoint GPU code. After the first breakpoint hit report, resume execution to see the second breakpoint hit. The test is a bit more complex than the reproducer described above, like we don't really need 3 inferiors to reproduce the problem. But it matches more closely the initial bug report, and it ends up testing more scenarios, so I think it's worth it. Since the outcome depends on the order of the inferiors, the test tries all permutations of which inferior gets which role. Change-Id: I5f84904e695278e1d4bd085873ec058e4a639faf Bug: SWDEV-440024

…n amd_dbgapi_target::wait I saw this while working on a previous patch related to SWDEV-440024. It isn't necessary for the fix, but the current code looks wrong to me. The current inferior is not relevant to determine which event `target_ops::wait` should return. When called with ptid == minus_one_ptid, the target may return an event for any inferior it controls. When there isn't an event ready to return (consume_one_event returns an event ptid == minus_one_ptid), `amd_dbgapi_target::wait` calls `process_event_queue` to pull events from amd-dbgapi, passing the process id associated to the current inferior. What if that process doesn't have anything to report, but another process has? We want the amd-dbgapi target to return the event for that other process, if `amd_dbgapi_target::wait` was called with `ptid == minus_one_ptid`. My first idea was to pass `AMD_DBGAPI_PROCESS_NONE` to `process_event_queue`. This is wrong, because it risks pulling a "runtime enabled" event, and `process_one_event` doesn't expect to receive that (these events are handled through another path). Instead, call `process_event_queue` for all processes matching `ptid` that have the runtime enabled. I noticed that `process_event_queue` is always called with a specific process id, never `AMD_DBGAPI_PROCESS_NONE`, so simplify the assertion there a little bit. Change-Id: I90a5ef819f4ec3d4597972a9eac6f61ddb219c26

…reads I saw this while working on a previous patch related to SWDEV-440024. It isn't necessary for the fix of that bug, but that looked like potential issue. Let's say we have inferior 1 with the linux-nat target, and inferior 2 with the amd-dbgapi and linux-nat targets. The loop using `all_non_exited_process_targets` + `switch_to_target_no_thread` will only call `update_thread_list` on inferior 1, and therefore amd-dbgapi's `update_thread_list` method will never get called. Iterate over all inferiors and call `update_thread_list` on each, to ensure that we end up calling it on all targets. Change-Id: Ib8d816fd42a72ce47c47d508808c6a0868cc395d

simark added 5 commits April 24, 2024 10:35

gdb/amd-dbgapi: make get_amd_dbgapi_inferior_info return a reference

23dddd9

This function can't return a NULL pointer, so make it return a reference instead. Change-Id: I0970d6d0757181291b300bd840037a48330a7fbb

simark requested a review from a team as a code owner April 29, 2026 15:35

simark marked this pull request as draft April 29, 2026 17:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-inferior fixes#106

Multi-inferior fixes#106
simark wants to merge 5 commits intoROCm:amd-stagingfrom
simark:multi-inferior-fixes

simark commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

simark commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant