[Build] Fix flaky test_clock_accuracy by using LCG workload with in-loop timing#436
[Build] Fix flaky test_clock_accuracy by using LCG workload with in-loop timing#436hughperkins wants to merge 1 commit intomainfrom
Conversation
The test was flaky (~10% failure rate) due to two issues: - block_dim=1 caused each thread to run on a different SM with slightly different clock rates, breaking the proportionality check at high i - Cold instruction cache on first kernel execution inflated a[0] Replace the sin-based workload with an LCG, use parallel execution within a single warp, capture timing inside the loop via a data-dependent store to prevent compiler hoisting, and warm up the instruction cache by delaying the start timestamp to j=10.
|
Technically, Opus physically wrote the code, but I had to painfully spoon feed it pretty much every line in practice. (it wanted to serialize the loop, and I think it's more elegant to run it on a warp, considering it's a 32-size loop in the first place). Anyway... I have read every line added in this PR, and reviewed the lines. I take responsibilty for the lines added and removed in this PR, and won't blame any issues on Opus. |
| x = (1664527 * x + 1013904223) % 2147483647 | ||
| if j == 10: | ||
| start = qd.clock_counter() | ||
| if x > 10: |
There was a problem hiding this comment.
needs a comment explaining why we are doing this x comparison
| if j == 10: | ||
| start = qd.clock_counter() | ||
| if x > 10: | ||
| a[i] = qd.clock_counter() - start |
There was a problem hiding this comment.
needs a comment explaining why we write inside the loop, not after it.
| start = qd.i64(0) | ||
| for j in range((i + 1) * 50000): | ||
| x = (1664527 * x + 1013904223) % 2147483647 | ||
| if j == 10: |
There was a problem hiding this comment.
nees a comment explaining why we dont sart the loop till j=1, and why we do it inside the loop
| x = state[i] | ||
| start = qd.i64(0) | ||
| for j in range((i + 1) * 50000): | ||
| x = (1664527 * x + 1013904223) % 2147483647 |
There was a problem hiding this comment.
needs a comment saying that this is an LCG, and why we use it
| start = qd.clock_counter() | ||
| if x > 10: | ||
| a[i] = qd.clock_counter() - start | ||
| state[i] = x |
There was a problem hiding this comment.
needs a commment saying why we write the result back to state
| a[i] = qd.clock_counter() - start | ||
|
|
||
| foo() | ||
| x = state[i] |
There was a problem hiding this comment.
needs a comment saying why we read x from state initially
| def foo(): | ||
| qd.loop_config(block_dim=1) | ||
| def measure_sequence_timings(): | ||
| for i in range(32): |
There was a problem hiding this comment.
needs a comment saying the significant of shape 32
The test was flaky (~10% failure rate) due to two issues:
Replace the sin-based workload with an LCG, use parallel execution within a single warp, capture timing inside the loop via a data-dependent store to prevent compiler hoisting, and warm up the instruction cache by delaying the start timestamp to j=10.
Issue: #
Brief Summary
copilot:summary
Walkthrough
copilot:walkthrough