Skip to content

[watch/rebar 3/3] New gdb.rocm/watch-managed-device-host.exp test#82

Open
palves wants to merge 1 commit intospr/amd-staging/6dc0d30afrom
spr/amd-staging/d518a1da
Open

[watch/rebar 3/3] New gdb.rocm/watch-managed-device-host.exp test#82
palves wants to merge 1 commit intospr/amd-staging/6dc0d30afrom
spr/amd-staging/d518a1da

Conversation

@palves
Copy link
Copy Markdown
Collaborator

@palves palves commented Apr 15, 2026

This adds a new testcase that exercises triggering watchpoints on
device and host writes, of device/vram and host/system memory, using
hipMallocManaged. I.e.:

  • device write of vram
  • device write of system memory
  • host write of vram
  • host write of system memory

It tests all the scenarios above in combination with setting
watchpoint both from device or from host context.

Unlike gdb.rocm/watch-gpu-global-from-host.exp, since this uses
managed memory, this one works on non-ReBAR systems too.

Tested on Linux gfx942, Linux gfx1030 with ReBAR off, and on Windows
gfx1201.

Change-Id: I4d167e42b0a712583d842045274d6db64ba7a7a0


Stack:

⚠️ Part of a stack created by spr. Do not merge manually using the UI - doing so may have unexpected results.

@palves palves requested a review from a team as a code owner April 15, 2026 13:20
@palves palves changed the title New gdb.rocm/watch-managed-device-host.exp test [3/3] New gdb.rocm/watch-managed-device-host.exp test Apr 15, 2026
@palves palves changed the title [3/3] New gdb.rocm/watch-managed-device-host.exp test [watch/rebar 3/3] New gdb.rocm/watch-managed-device-host.exp test Apr 15, 2026
CHECK (hipDeviceSynchronize ());

/* Re-establish residency in case debugger or kernel access caused
migration. */
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this something you experienced?

The only migration I know of that could be caused by debugger is CWSR related. In such case, for no-REBAR systems, wouldn't this mean that running a program under the debugger rather than free could cause it to malfunction?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, long answer, as it took some investigation to get there.

Is this something you experienced?

I only suspected it, but I've confirmed it now, both on my non-rebar system, and on gfx942 (discrete gpu). There are migrations, both triggered by the code itself, and by the debugger.

E.g., if I step over this line of host code, when the memory is resident in the GPU:

/* Trigger watchpoint from the host.  */
  *managed_ptr = 2;

while I record amdgpu driver migration-related events with:

sudo trace-cmd record -e amdgpu:amdgpu_bo_move -e amdgpu:amdgpu_vm_update_ptes -e amdgpu:amdgpu_vm_set_ptes

I see:

kworker/31:0-3847354 [031] 97467.661015: amdgpu_bo_move:       bo=0xffff98e541431c00, from=2, to=-1, size=4096
kworker/31:0-3847354 [031] 97467.661802: amdgpu_vm_update_ptes: pid:3938119 vm_ctx:0x9b1 start:0x07ffff7fbc end:0x07ffff7fbd, flags:0x3000000000067, incr:4096, dst: 3958366208
kworker/31:0-3847354 [031] 97467.661803: amdgpu_vm_set_ptes:   pe=83fecc3de0, addr=00ebefe000, incr=4096, flags=3000000000067, count=1, immediate=0

amdgpu_bo_move tells us that physical memory is being relocated.

amdgpu_vm_set_ptes / amdgpu_vm_update_ptes after a move indicates that the page table entries are being updated to tell the GPU that the pointer no longer lives in VRAM, it's now in the system RAM.

The pointer's address is:

(gdb) p managed_ptr
$1 = (int *) 0x7ffff7fbc000

In the trace, the start address is 0x07ffff7fbc. That is the page address (size=4096) with last three zeros stripped out.

In the amdgpu_bo_move line, from=2 is VRAM, and to=-1 is system RAM. Size=4096 shows the driver moved exactly one 4KB page.

The lines following the move are the driver updating the GPU's PTEs. In:

amdgpu_vm_update_ptes: start:0x07ffff7fbc ... flags:0x3000000000067

... the 0x3 indicates that the page is now mapped in system memory and the GPU must access it via the PCIe bus (IOMMU).

Note: the "bo=0xffff98e541431c00" address in the amdgpu_bo_move line confused me at first. "bo" stands for "Buffer Object", and is the address of an internal structure that the driver uses.

So when the code execute *managed_ptr = 2 from the host, the CPU tried to write to a page that was currently "resident" in the GPU's VRAM. On my non-ReBAR system, the CPU can't write directly to VRAM in this managed context, so the kernel intercepted the fault, moved that specific 4KB page from the GPU to the host, and then let the CPU finish the write.

If I instead run to that same line, before the CPU writes to the pointer, and issue from GDB:

(gdb) print *managed_ptr

... while collecting trace-cmd logs, then I see the same move in the trace-cmd logs. I.e., just the read from GDB triggers migration from the GPU to the host.

I believe this happens because we read GPU memory using /proc/pid/mem. I assume that goes via the CPU page table, and it triggers the same page faults and migration as a CPU access in the inferior does.

GDB reads the watched expression's current value after the watchpoint trigger, which is what made me think that the debugger would cause migration, even without explicit "print *managed_ptr" or some such. I think I've now confirmed it.

If I test the other way around, i.e. run the testcase program with "host" argument, so that memory is resident on the host, and then step over the:

*ptr = 1;

... line in the kernel while collecting logs, then I see NO migration.

I guess this means that because the GPU is always able to access CPU pages across the PCIe bus (unlike CPU accessing the GPU which needs to go via a BAR window), that's what happens. The kernel is not
involved, so I see nothing in the trace-cmd logs. But I don't think this is guaranteed, I think the driver could well set things up to trigger a page fault and force migration in some circunstances, it may
depend on hardware, driver version, etc.

In such case, for no-REBAR systems, wouldn't this mean that running a program under the debugger
rather than free could cause it to malfunction?

I don't think so. The migration is transparent, handled at a very low level, and userspace does not see any difference, other than performance due to pages migrating when they woulndn't without the
debugger. I think it's similar in a way to if the kernel decides to swap out some memory to file, and then you peek at it with the debugger, causing it to be swapped back in. You just don't see any of
that.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the investigation. I'll get back to that review probably tomorrow.

This adds a new testcase that exercises triggering watchpoints on
device and host writes, of device/vram and host/system memory, using
hipMallocManaged.  I.e.:

 - device write of vram
 - device write of system memory
 - host write of vram
 - host write of system memory

It tests all the scenarios above in combination with setting
watchpoint both from device or from host context.

Unlike gdb.rocm/watch-gpu-global-from-host.exp, since this uses
managed memory, this one works on non-ReBAR systems too.

Tested on Linux gfx942, Linux gfx1030 with ReBAR off, and on Windows
gfx1201.

Change-Id: I4d167e42b0a712583d842045274d6db64ba7a7a0
commit-id:d518a1da
@palves palves force-pushed the spr/amd-staging/d518a1da branch from e8ca81f to ef6e485 Compare April 16, 2026 18:22
@palves palves force-pushed the spr/amd-staging/6dc0d30a branch from b5d2bee to 15d8d9d Compare April 16, 2026 18:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants