Skip to content

New test gdb.rocm/write-all-registers.exp#91

Open
palves wants to merge 1 commit intoamd-stagingfrom
spr/amd-staging/23de25e0
Open

New test gdb.rocm/write-all-registers.exp#91
palves wants to merge 1 commit intoamd-stagingfrom
spr/amd-staging/23de25e0

Conversation

@palves
Copy link
Copy Markdown
Collaborator

@palves palves commented Apr 17, 2026

This adds a test that writes to all GPU registers, confirming that:

  • For registers with non-read-only bits, written values are really
    written and can be read back.
  • Writing to read-only registers is ignored.

Tested on Linux, on gfx90a, gfx942, gfx1030, and gfx1201.
Tested on Windows, on gfx1201.

Change-Id: I76599cb02bd0cdbe504398763d48a8112ad21bab

@palves palves requested a review from a team as a code owner April 17, 2026 20:48
@palves palves changed the title New test gdb.rocm/write-all-registers.cpp New test gdb.rocm/write-all-registers.exp Apr 17, 2026
@palves palves force-pushed the spr/amd-staging/23de25e0 branch 2 times, most recently from f6a5379 to 1a456cc Compare April 17, 2026 23:14
Copy link
Copy Markdown
Collaborator

@lumachad lumachad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Copy link
Copy Markdown
Collaborator

@lancesix lancesix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments, let me know what you think.

set reg_size [register_size $reg]

# Set register to all ones. Read-only bits will stay
# at zero.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a nit.

We have registers with read-only bits which do not have to be 0s. We can have 1s as well which are read-only.

set reg_size [register_size $reg]

# Set register to all ones. Read-only bits will stay
# at zero.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noting, the read-only bits in "special" registers do not have to be 0s. Those can be ones as well.

For example, gfx12 has "STATUS.W64" to tell if a wave is in wave32 or wave64 mode. When using wave64, this bit is 1 and readonly.

That being said, I don't think we currently expose any register where this could be an issue, but we could in the future add some completely read-only (no bit can be written) registers, in which case we'll need to adjust this testcase accordingly.


edit: scratch_base / flat_scratch might be good candidates of registers to become "fully read-only".

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noting, the read-only bits in "special" registers do not have to be 0s. Those can be ones as well.

For example, gfx12 has "STATUS.W64" to tell if a wave is in wave32 or wave64 mode. When using wave64, this bit is 1 and readonly.

Did we just find a bug then? With this:

--- c/gdb/testsuite/gdb.rocm/write-all-registers.cpp
+++ w/gdb/testsuite/gdb.rocm/write-all-registers.cpp
@@ -26,7 +26,7 @@ kern ()
 int
 main ()
 {
-  kern<<<1, 1>>> ();
+  kern<<<1, 64>>> ();
   CHECK (hipDeviceSynchronize ());
   return 0;
 }
diff --git c/gdb/testsuite/gdb.rocm/write-all-registers.exp w/gdb/testsuite/gdb.rocm/write-all-registers.exp
index 9e2fe7bb00c..aa4167615b9 100644
--- c/gdb/testsuite/gdb.rocm/write-all-registers.exp
+++ w/gdb/testsuite/gdb.rocm/write-all-registers.exp
@@ -22,7 +22,10 @@ require allow_hipcc_tests
 
 standard_testfile .cpp
 
-if {[build_executable "failed to prepare" $testfile $srcfile {debug hip}]} {
+set opts {debug hip}
+lappend opts additional_flags=-mwavefrontsize64
+
+if {[build_executable "failed to prepare" $testfile $srcfile $opts]} {
     return
 }

I see 64 lanes in one wave, and the WAVE64 bit is set, but I can clear it from gdb and it sticks:

(gdb) p $status
$1 = [ TRAP_EN VCCZ VALID WAVE64 ]
(gdb) p $status = 0
$2 = [ ]
(gdb) p $status
$3 = [ ]

So in wave 64 mode the testcase as is still passes (except for the buggy state_priv).

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so, yes.

The definite test should be to do:

(gdb) p $status = 0
...
(gdb) si
....
(gdb) p $status

But re-looking at the doc (sect 3.4.1), the $status register is fully read-only. We should not be able to update it directly.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I issue a "si" after clearing the bit, the step succeeds normally, I still have 64 lanes, and the wave64 is back set to 1 again.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did anything change in the $status value? I expect it should be the same value (unless the single step changed something which impact status, like changing $vcc could change $status.vccz).

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMD_DBGAPI_REGISTER_PROPERTY_READONLY_BITS is set for that register.

However, in dbgapi, each architecture returns the read-only bits for each register with the register_read_only_mask virtual method. For the status register though, that is currently returning null. Sounds like that should return an actual mask for this register.

I'm a bit puzzled looking for where does the register end up with AMD_DBGAPI_REGISTER_PROPERTY_READONLY_BITS set. I don't see a breakpoint hit in gfx12_architecture_t::register_properties for that register. I'll look a little deeper.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting the READONLY_BITS property for registers with non-nulltpr read_only_mask is done here:

amd_dbgapi_register_properties_t
amdgcn_architecture_t::register_properties (amdgpu_regnum_t regnum) const
{
  amd_dbgapi_register_properties_t properties
    = register_read_only_mask (regnum) != nullptr
        ? AMD_DBGAPI_REGISTER_PROPERTY_READONLY_BITS
        : AMD_DBGAPI_REGISTER_PROPERTY_NONE;

  ...
}

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was confusing me earlier is that there's pseudo_status and status, two different register numbers.
I have something that works now, but I'm now wondering whether pseudo_state_priv should be writable too.
From the same docs, state_priv can only be written from the trap handler (priv=1), so how come you wanted to be able to write to it from rocgdb, leading to that dbgapi patch? Can one use rocgdb to debug the trap handler?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the same docs, state_priv can only be written from the trap handler (priv=1), so how come you wanted to be able to write to it from rocgdb, leading to that dbgapi patch? Can one use rocgdb to debug the trap handler?

I've convinced myself that GDB+dbgapi should treat state_priv as read-only too.

See dbgapi patch at ROCm/rocm-systems#5211, and the comments/questions in patch 2/2 of that series. Your dbgapi patch is still useful if we have internal dbgapi writes to state_priv, though I didn't find any.

I've uploaded a new version of the testcase that assumes that dbgapi patch is applied. Since I'm ignoring writes to state_priv, the testcase no longer covers the bug you fixed.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I agree for state_priv There is the HALT bit in this register that we might want to be able to control.

Shader code can do something s_sethalt 1 which would result in the wave being halted, and then being able to write to the state_priv.halt bit from the debugger would be the only way to resume execution.

I'll go and check the patch you have in dbgapi.

save_restore_register $reg {
gdb_test "p/x \$$reg = $all_ones_val" " = $all_ones_val"
# Read back the value, which should have writable bits
# set to one, read-only bits set to zero.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same, this comment is probably inaccurate.

To really get the set of read-only registers, you'd need to

  • write all 1s, and collect a mask of bits staying at 0.
  • write all 0s and collect a mask of bits staying at 1.
  • OR those two masks.

This should give a mask of read-only bits, and only a very restricted set of regs (which we do not expose today) can have that mask equal to $all_ones_val.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now improved along those lines. Thanks, I had only originally considered the reserved (always zero) bits.

# Registers that have bits that are always zero.
proc is_special {reg} {
# The single list contains all the special registers from all
# supported architectures, sorted in alphabetical order.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A good dynamic approach to discover such registers is probably to look at info registers system.

For example, on gfx1031, I have:

(gdb) info registers system
mode           0x3f0               [ FP_ROUND.32=NEAREST_EVEN FP_ROUND.64_16=NEAREST_EVEN FP_DENORM.32=FLUSH_NONE FP_DENORM.64_16=FLUSH_NONE DX10_CLAMP IEEE ]
trapsts        0x40000000          [ EXCP_CYCLE=0 EXCP_GROUP_MASK=0 DP_RATE=2 ]
flat_scratch   0x7ffee4a00000      140732734111744
status         0x98041             [ SCC SPI_PRIO=0 USER_PRIO=0 TRAP_EN TTRACE_SIMD_EN VALID PERF_EN ]

Note that we see here flat_scratch (known as `scratch_base for gfx12) which I think should probably be fully read-only. Where the scratch for a wave is is not something we should edit, it is a property of a wave we should not be able to touch.

pc is probably another register with read-only bits which is not listed here, but it is handled in a special way anyway.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've now tweaked the dbgapi patch to consider flat_scratch / scratch_base read-only too, and updated the gdb patch. Juggling too many machines at once, missed that one... I haven't yet tested on gfx11.

@lancesix
Copy link
Copy Markdown
Collaborator

Also, I am surprised that the CI passes. There is a known issue for state_priv in gfx12xx (ROCm/rocm-systems#5149) which should have made the tests fail.

@lumachad
Copy link
Copy Markdown
Collaborator

lumachad commented Apr 20, 2026

Also, I am surprised that the CI passes. There is a known issue for state_priv in gfx12xx (ROCm/rocm-systems#5149) which should have made the tests fail.

We don't run rocgdb tests on gfx12xx at the moment. So that would need to be exercised manually.

@lumachad
Copy link
Copy Markdown
Collaborator

Just a FYI, I observed the number of tests differing between a run with O0 and another with O3. About half as many tests passing on O3 compared to O0. Not sure if that's by design or not.

@lancesix
Copy link
Copy Markdown
Collaborator

Just a FYI, I observed the number of tests differing between a run with O0 and another with O3. About half as many tests passing on O3 compared to O0. Not sure if that's by design or not.

We had work internally to enable the testsuite at 01/02/03, some of that implied disabling some tests. Not sure how much of that work landed (but we should eventually resurrect that).

@palves
Copy link
Copy Markdown
Collaborator Author

palves commented Apr 20, 2026

Also, I am surprised that the CI passes. There is a known issue for state_priv in gfx12xx (ROCm/rocm-systems#5149) which should have made the tests fail.

Just for the record, the test does fail for state_priv on gfx12 currently (both linux and windows the same way), that's the bug that triggered writing the testcase in the first place.

@palves palves force-pushed the spr/amd-staging/23de25e0 branch from 1a456cc to 9b0e45f Compare April 20, 2026 20:12
This adds a test that writes to all GPU registers, confirming that:

- For registers with non-read-only bits, written values are really
  written and can be read back.
- Writing to read-only registers is ignored.

Tested on Linux, on gfx90a, gfx942, gfx1030, and gfx1201.
Tested on Windows, on gfx1201.

Change-Id: I76599cb02bd0cdbe504398763d48a8112ad21bab
commit-id:23de25e0
@palves palves force-pushed the spr/amd-staging/23de25e0 branch from 9b0e45f to 7c4988a Compare April 20, 2026 20:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants