Skip to content

[watch/rebar 2/3] Ensure WAIT_MEM waits for trap#81

Open
palves wants to merge 1 commit intospr/amd-staging/c99ccb02from
spr/amd-staging/6dc0d30a
Open

[watch/rebar 2/3] Ensure WAIT_MEM waits for trap#81
palves wants to merge 1 commit intospr/amd-staging/c99ccb02from
spr/amd-staging/6dc0d30a

Conversation

@palves
Copy link
Copy Markdown
Collaborator

@palves palves commented Apr 15, 2026

The next patch adds a testcase that uses WAIT_MEM to wait for a
watchpoint hit with precise memory off, like so:

global void
kern (int *ptr)
{
*ptr = 1; // trigger watchpoint
WAIT_MEM;
}

On gfx1030 on Linux, I observe that the watchpoint would not trigger.
If I add several more WAIT_MEM calls, I see the watchpoint trigger a
few instructions further down.

The problem is that on GFX10, WAIT_MEM expands to a single:

s_waitcnt_vscnt null, 0

which only waits for vector stores to retire. In practice this is not
sufficient to ensure that a memory watchpoint is delivered before
execution continues. For example, in:

flat_store_dword v[0:1], v2
s_waitcnt_vscnt null, 0

ROCgdb still observes the watchpoint trap several instructions later:

s_waitcnt_vscnt null, 0
s_waitcnt_vscnt null, 0
s_waitcnt_vscnt null, 0 <== trap taken here

My understanding is that even when s_waitcnt_vscnt completes, it only
means the memory controller has acknowledged that the store is "done".
However, the signal that a watchpoint was hit has to travel from the
memory hierarchy back to the Wavefront Controller to trigger the
exception.

By the time the wavefront controller receives the "stop!" signal, the
sequencer has already fetched and potentially executed the next few
instructions.

Running the same testcase on Windows with a gfx1201, I observe the
exact same thing, we need a few more instructions for the trap to be
reported.

Fix this by adding a short sequence of s_nop instructions to delay
forward progress, giving the hardware time to deliver the watchpoint
trap before exiting WAIT_MEM.

While at it, on gfx10, also wait for vmcnt(0) and lgkmcnt(0), ensuring
that reads and scalar writes are waited for too, so that the gfx10
version has the same semantics as the other architecture
implementations. Otherwise, if WAIT_MEM is meant for being useful
with writes only, then I think it'd be better to rename
e.g. WAIT_MEM_STORE or something like that.

Change-Id: Ia119ceba6d9f1e2c8ec1eaf619cb066286ea6dd3


Stack:

⚠️ Part of a stack created by spr. Do not merge manually using the UI - doing so may have unexpected results.

@palves palves requested a review from a team as a code owner April 15, 2026 13:20
@palves palves changed the title Ensure WAIT_MEM waits for trap [watch/rebar 2/3] Ensure WAIT_MEM waits for trap Apr 15, 2026
The next patch adds a testcase that uses WAIT_MEM to wait for a
watchpoint hit with precise memory off, like so:

   __global__ void
  kern (int *ptr)
  {
    *ptr = 1;  // trigger watchpoint
    WAIT_MEM;
  }

On gfx1030 on Linux, I observe that the watchpoint would not trigger.
If I add several more WAIT_MEM calls, I see the watchpoint trigger a
few instructions further down.

The problem is that on GFX10, WAIT_MEM expands to a single:

  s_waitcnt_vscnt null, 0

which only waits for vector stores to retire.  In practice this is not
sufficient to ensure that a memory watchpoint is delivered before
execution continues. For example, in:

  flat_store_dword v[0:1], v2
  s_waitcnt_vscnt null, 0

ROCgdb still observes the watchpoint trap several instructions later:

  s_waitcnt_vscnt null, 0
  s_waitcnt_vscnt null, 0
  s_waitcnt_vscnt null, 0   <== trap taken here

My understanding is that even when s_waitcnt_vscnt completes, it only
means the memory controller has acknowledged that the store is "done".
However, the signal that a watchpoint was hit has to travel from the
memory hierarchy back to the Wavefront Controller to trigger the
exception.

By the time the wavefront controller receives the "stop!" signal, the
sequencer has already fetched and potentially executed the next few
instructions.

Running the same testcase on Windows with a gfx1201, I observe the
exact same thing, we need a few more instructions for the trap to be
reported.

Fix this by adding a short sequence of s_nop instructions to delay
forward progress, giving the hardware time to deliver the watchpoint
trap before exiting WAIT_MEM.

While at it, on gfx10, also wait for vmcnt(0) and lgkmcnt(0), ensuring
that reads and scalar writes are waited for too, so that the gfx10
version has the same semantics as the other architecture
implementations.  Otherwise, if WAIT_MEM is meant for being useful
with writes only, then I think it'd be better to rename
e.g. WAIT_MEM_STORE or something like that.

Change-Id: Ia119ceba6d9f1e2c8ec1eaf619cb066286ea6dd3
commit-id:6dc0d30a
@palves palves force-pushed the spr/amd-staging/c99ccb02 branch from 7ddf723 to 241989a Compare April 16, 2026 18:22
@palves palves force-pushed the spr/amd-staging/6dc0d30a branch from b5d2bee to 15d8d9d Compare April 16, 2026 18:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant