-
Notifications
You must be signed in to change notification settings - Fork 77
Remove bank conflicts in transpose scheduler #5909
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
9707e93 to
a59914a
Compare
|
!test |
2 similar comments
|
!test |
|
!test |
|
Review updated until commit d6e28f5 Description
|
| Relevant files | |||
|---|---|---|---|
| Tests |
| ||
| Documentation |
|
PR Reviewer Guide
Here are some key observations to aid the review process:
| 🧪 PR contains tests |
| 🔒 No security concerns identified |
| ⚡ Recommended focus areas for review |
Test Scope
|
Test failures
-
(Medium, 3)
Shape mismatch in thunderfx higher-order inplace alias update test (nvFuser, CUDA)Test Name A100 GB200 H100 Source thunder.tests.test_update_aliases.test_higher_order_inplace_alias_update_nvfuser_cuda_thunder.dtypes.float32 ❌ ❌ ❌ -
(Medium, 1)
InstanceNorm numerical mismatch in Thunder vs Torch (nvFuser, test_ops.test_core_vs_torch_consistency)Test Name A100 Source thunder.tests.test_ops.test_core_vs_torch_consistency_instance_norm_nvfuser_cuda_thunder.dtypes.float32 ❌ -
(Medium, 1)
Thunder vs. eager output mismatch in nanoGPT autograd test (test_networks)Test Name H100 Source thunder.tests.test_networks.test_nanogpt_complete_autograd_nvfuser_cuda_thunder.dtypes.float32 ❌
75fc873 to
becd6fb
Compare
|
!test |
Greptile OverviewGreptile SummaryThis PR adds a manual scheduling test demonstrating bank-conflict-free transpose operations using XOR swizzle patterns. The implementation uses 32x32 tiling with shared memory swizzling to eliminate bank conflicts in GPU memory accesses. Key changes:
The PR description notes this is a preparatory change - the manual scheduling approach will be integrated into the auto-scheduler in future PRs. The test serves as proof-of-concept showing the technique is effective on GB200 hardware. Confidence Score: 4/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant Test as SwizzleNoBankConflict Test
participant Fusion as Fusion IR
participant Input as Input Tensor (tv0)
participant InputCache as Shared Memory Cache
participant OutputCache as Output Cache
participant Output as Output Tensor (tv1)
participant BankConflict as Bank Conflict Analyzer
Test->>Fusion: Create fusion with transpose(tv0, 0, 1)
Test->>Input: Create input tensor [262144, 5120]
Test->>Fusion: Add input and output to fusion
Note over Test,Fusion: Step 1: Create caches
Test->>InputCache: tv0->cacheAfter()
Test->>OutputCache: tv1->cacheBefore()
Test->>InputCache: setMemoryType(Shared)
Note over Test,Fusion: Step 2: Tile both groups (32x32 tiles)
Test->>OutputCache: Split, reorder, merge (output layout)
Test->>Output: Split, reorder, merge (output layout)
Test->>Input: Split, reorder, merge (input layout)
Test->>InputCache: Split, reorder, merge (input layout)
Note over Test,InputCache: Step 3: Apply XOR swizzle to avoid bank conflicts
Test->>InputCache: split(3, vectorize_factor)
Test->>InputCache: split(2, vectorize_factor)
Test->>InputCache: swizzle(XOR, 2, 4)
Test->>InputCache: merge & parallelize (Unroll, TIDx, Vectorize)
Test->>InputCache: setAllocationDomain()
Note over Test,Output: Step 4: Schedule output tensors
Test->>OutputCache: Reorder, merge, split, parallelize
Test->>Output: Reorder, merge, split, parallelize with vectorization
Note over Test,Fusion: Step 5: Compile and verify
Test->>Fusion: inlineMost()
Test->>Fusion: compile()
Test->>BankConflict: getBankConflictInfo()
BankConflict-->>Test: empty() = true (no conflicts)
Test->>Fusion: run()
Test->>Test: testValidate()
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 files reviewed, no comments
Remove bank conflicts in transpose scheduler using XOR swizzle.
This PR only addes a manual schedule test to investigate the method, will add to auto scheduler in following PRs.
Note:

Aslo experimented with different tile shapes, vectorization factors, and shared-memory swizzle patterns to reduce or eliminate bank conflicts. Shared-memory swizzling provided consistently high performance and is portable across hardware, whereas 256-bit vectorization is only supported on certain architectures. Perf on GB200, float4, transpose of 262144 x 5120