[watch/rebar 3/3] New gdb.rocm/watch-managed-device-host.exp test by palves · Pull Request #82 · ROCm/ROCgdb

palves · 2026-04-15T13:20:54Z

This adds a new testcase that exercises triggering watchpoints on
device and host writes, of device/vram and host/system memory, using
hipMallocManaged. I.e.:

device write of vram
device write of system memory
host write of vram
host write of system memory

It tests all the scenarios above in combination with setting
watchpoint both from device or from host context.

Unlike gdb.rocm/watch-gpu-global-from-host.exp, since this uses
managed memory, this one works on non-ReBAR systems too.

Tested on Linux gfx942, Linux gfx1030 with ReBAR off, and on Windows
gfx1201.

Change-Id: I4d167e42b0a712583d842045274d6db64ba7a7a0

Stack:

⚠️ Part of a stack created by spr. Do not merge manually using the UI - doing so may have unexpected results.

lancesix · 2026-04-15T13:39:55Z

+  CHECK (hipDeviceSynchronize ());
+
+  /* Re-establish residency in case debugger or kernel access caused
+     migration.  */


Is this something you experienced?

The only migration I know of that could be caused by debugger is CWSR related. In such case, for no-REBAR systems, wouldn't this mean that running a program under the debugger rather than free could cause it to malfunction?

Alright, long answer, as it took some investigation to get there.

Is this something you experienced?

I only suspected it, but I've confirmed it now, both on my non-rebar system, and on gfx942 (discrete gpu). There are migrations, both triggered by the code itself, and by the debugger.

E.g., if I step over this line of host code, when the memory is resident in the GPU:

/* Trigger watchpoint from the host. */
*managed_ptr = 2;

while I record amdgpu driver migration-related events with:

sudo trace-cmd record -e amdgpu:amdgpu_bo_move -e amdgpu:amdgpu_vm_update_ptes -e amdgpu:amdgpu_vm_set_ptes

I see:

kworker/31:0-3847354 [031] 97467.661015: amdgpu_bo_move: bo=0xffff98e541431c00, from=2, to=-1, size=4096 kworker/31:0-3847354 [031] 97467.661802: amdgpu_vm_update_ptes: pid:3938119 vm_ctx:0x9b1 start:0x07ffff7fbc end:0x07ffff7fbd, flags:0x3000000000067, incr:4096, dst: 3958366208 kworker/31:0-3847354 [031] 97467.661803: amdgpu_vm_set_ptes: pe=83fecc3de0, addr=00ebefe000, incr=4096, flags=3000000000067, count=1, immediate=0

amdgpu_bo_move tells us that physical memory is being relocated.

amdgpu_vm_set_ptes / amdgpu_vm_update_ptes after a move indicates that the page table entries are being updated to tell the GPU that the pointer no longer lives in VRAM, it's now in the system RAM.

The pointer's address is:

(gdb) p managed_ptr $1 = (int *) 0x7ffff7fbc000

In the trace, the start address is 0x07ffff7fbc. That is the page address (size=4096) with last three zeros stripped out.

In the amdgpu_bo_move line, from=2 is VRAM, and to=-1 is system RAM. Size=4096 shows the driver moved exactly one 4KB page.

The lines following the move are the driver updating the GPU's PTEs. In:

amdgpu_vm_update_ptes: start:0x07ffff7fbc ... flags:0x3000000000067

... the 0x3 indicates that the page is now mapped in system memory and the GPU must access it via the PCIe bus (IOMMU).

Note: the "bo=0xffff98e541431c00" address in the amdgpu_bo_move line confused me at first. "bo" stands for "Buffer Object", and is the address of an internal structure that the driver uses.

So when the code execute *managed_ptr = 2 from the host, the CPU tried to write to a page that was currently "resident" in the GPU's VRAM. On my non-ReBAR system, the CPU can't write directly to VRAM in this managed context, so the kernel intercepted the fault, moved that specific 4KB page from the GPU to the host, and then let the CPU finish the write.

If I instead run to that same line, before the CPU writes to the pointer, and issue from GDB:

(gdb) print *managed_ptr

... while collecting trace-cmd logs, then I see the same move in the trace-cmd logs. I.e., just the read from GDB triggers migration from the GPU to the host.

I believe this happens because we read GPU memory using /proc/pid/mem. I assume that goes via the CPU page table, and it triggers the same page faults and migration as a CPU access in the inferior does.

GDB reads the watched expression's current value after the watchpoint trigger, which is what made me think that the debugger would cause migration, even without explicit "print *managed_ptr" or some such. I think I've now confirmed it.

If I test the other way around, i.e. run the testcase program with "host" argument, so that memory is resident on the host, and then step over the:

*ptr = 1;

... line in the kernel while collecting logs, then I see NO migration.

I guess this means that because the GPU is always able to access CPU pages across the PCIe bus (unlike CPU accessing the GPU which needs to go via a BAR window), that's what happens. The kernel is not
involved, so I see nothing in the trace-cmd logs. But I don't think this is guaranteed, I think the driver could well set things up to trigger a page fault and force migration in some circunstances, it may
depend on hardware, driver version, etc.

In such case, for no-REBAR systems, wouldn't this mean that running a program under the debugger
rather than free could cause it to malfunction?

I don't think so. The migration is transparent, handled at a very low level, and userspace does not see any difference, other than performance due to pages migrating when they woulndn't without the
debugger. I think it's similar in a way to if the kernel decides to swap out some memory to file, and then you peek at it with the debugger, causing it to be swapped back in. You just don't see any of
that.

Thanks for the investigation. I'll get back to that review probably tomorrow.

This adds a new testcase that exercises triggering watchpoints on device and host writes, of device/vram and host/system memory, using hipMallocManaged. I.e.: - device write of vram - device write of system memory - host write of vram - host write of system memory It tests all the scenarios above in combination with setting watchpoint both from device or from host context. Unlike gdb.rocm/watch-gpu-global-from-host.exp, since this uses managed memory, this one works on non-ReBAR systems too. Tested on Linux gfx942, Linux gfx1030 with ReBAR off, and on Windows gfx1201. Change-Id: I4d167e42b0a712583d842045274d6db64ba7a7a0 commit-id:d518a1da

palves requested a review from a team as a code owner April 15, 2026 13:20

palves changed the title ~~New gdb.rocm/watch-managed-device-host.exp test~~ [3/3] New gdb.rocm/watch-managed-device-host.exp test Apr 15, 2026

palves mentioned this pull request Apr 15, 2026

[watch/rebar 1/3] Fix gdb.rocm/watch-gpu-global-from-host.exp on non-ReBAR systems #80

Open

palves changed the title ~~[3/3] New gdb.rocm/watch-managed-device-host.exp test~~ [watch/rebar 3/3] New gdb.rocm/watch-managed-device-host.exp test Apr 15, 2026

palves mentioned this pull request Apr 15, 2026

[watch/rebar 2/3] Ensure WAIT_MEM waits for trap #81

Open

lancesix reviewed Apr 15, 2026

View reviewed changes

palves force-pushed the spr/amd-staging/d518a1da branch from e8ca81f to ef6e485 Compare April 16, 2026 18:22

palves force-pushed the spr/amd-staging/6dc0d30a branch from b5d2bee to 15d8d9d Compare April 16, 2026 18:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[watch/rebar 3/3] New gdb.rocm/watch-managed-device-host.exp test#82

[watch/rebar 3/3] New gdb.rocm/watch-managed-device-host.exp test#82
palves wants to merge 1 commit intospr/amd-staging/6dc0d30afrom
spr/amd-staging/d518a1da

palves commented Apr 15, 2026 •

edited

Loading

Uh oh!

lancesix Apr 15, 2026

Uh oh!

palves Apr 16, 2026

Uh oh!

lancesix Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

palves commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lancesix Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

palves Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

lancesix Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

palves commented Apr 15, 2026 •

edited

Loading