Remove bank conflicts in transpose scheduler #5909

liqiangxl · 2026-02-02T19:42:38Z

Remove bank conflicts in transpose scheduler using XOR swizzle.
This PR only addes a manual schedule test to investigate the method, will add to auto scheduler in following PRs.

Note:
Aslo experimented with different tile shapes, vectorization factors, and shared-memory swizzle patterns to reduce or eliminate bank conflicts. Shared-memory swizzling provided consistently high performance and is portable across hardware, whereas 256-bit vectorization is only supported on certain architectures. Perf on GB200, float4, transpose of 262144 x 5120

liqiangxl · 2026-02-02T19:57:23Z

!test

liqiangxl · 2026-02-02T20:04:31Z

!test

liqiangxl · 2026-02-03T01:48:39Z

!test

github-actions · 2026-02-03T17:40:36Z

Review updated until commit d6e28f5

Description

Add comprehensive test case SwizzleNoBankConflict demonstrating bank conflict elimination in transpose operations
Implement XOR swizzle scheduling technique to avoid shared memory bank conflicts
Include bank conflict validation using getBankConflictInfo to verify zero conflicts
Add detailed documentation showing 32x32 transpose access patterns across global memory, shared memory, and registers

Changes walkthrough

Relevant files

Tests

test_transpose.cpp `Add bank conflict-free transpose test with XOR swizzle` tests/cpp/test_transpose.cpp Added includes for bank conflict analysis and type utilities Implemented new `SwizzleNoBankConflict` test case with manual transpose scheduling Applied XOR swizzle on shared memory tensor to eliminate bank conflicts Added validation to confirm zero bank conflicts in compiled kernel	+93/-0

Documentation

transpose_access_map.md `Document transpose access patterns with bank conflict analysis` doc/dev/transpose_access_map.md Created comprehensive documentation of 32x32 transpose access patterns Visualized thread-to-bank mapping across global memory, shared memory, and registers Demonstrated how XOR swizzle eliminates bank conflicts within warps Showed coalesced global memory access patterns for input and output	+159/-0

PR Reviewer Guide

Here are some key observations to aid the review process:

🧪 PR contains tests
🔒 No security concerns identified
⚡ Recommended focus areas for review
Test Scope The test is hardcoded for a specific matrix size (262144 x 5120). While this demonstrates the technique effectively, consider if additional test cases with different sizes would strengthen the validation, especially for edge cases or different aspect ratios. TEST_F(TransposeTest, SwizzleNoBankConflict) { auto fusion_ptr = std::make_unique<Fusion>(); FusionGuard fg(fusion_ptr.get()); Fusion& fusion = fusion_ptr; auto dtype = DataType::Float; auto tv0 = makeContigConcreteTensor({262144, 5120}, dtype); fusion.addInput(tv0); auto tv1 = transpose(tv0, 0, 1); fusion.addOutput(tv1); auto options = at::TensorOptions().dtype(data_type_to_aten(dtype)).device(at::kCUDA, 0); at::Tensor input0 = at::randn({262144, 5120}, options); auto input_cache = tv0->cacheAfter(); auto output_cache = tv1->cacheBefore(); input_cache->setMemoryType(MemoryType::Shared); // Step-1, tiling and parallelizing non-tile dimensions int64_t tile_size1 = 32, tile_size2 = 32; // Group 1 (output-side layout [y, x]). for (auto tv : {output_cache, tv1}) { // [y, x] -> [y/tile_size2, tile_size2, x/tile_size1, tile_size1] tv->split(1, tile_size1); tv->split(0, tile_size2); // [x/tile_size1, y/tile_size2, tile_size1, tile_size2] tv->reorder({{0, 1}, {1, 3}, {2, 0}, {3, 2}}); // [x/tile_size1 y/tile_size2, tile_size1, tile_size2] tv->merge(0); tv->split(0, 1); tv->axis(1)->parallelize(ParallelType::Unswitch); tv->axis(0)->parallelize(ParallelType::BIDx); } // Group 2 (input-side layout [x, y]). for (auto tv : {tv0, input_cache}) { // [x, y] -> [x/tile_size1, tile_size1, y/tile_size2, tile_size2] tv->split(1, tile_size2); tv->split(0, tile_size1); // [x/tile_size1, y/tile_size2, tile_size1, tile_size2] tv->reorder({{1, 2}, {2, 1}}); // [x/tile_size1 * y/tile_size2, tile_size1, tile_size2] tv->merge(0); tv->split(0, 1); tv->axis(1)->parallelize(ParallelType::Unswitch); tv->axis(0)->parallelize(ParallelType::BIDx); } // Step-2, schedule input shared cache to avoid bank conflict int64_t pos = 2; int64_t vectorize_factor = 16 / dataTypeSizeByte(dtype), threads_per_block = 128; // Schedule input shared cache. // [BIDx, Unswitch, tile_size1, tile_size2] input_cache->split(3, vectorize_factor); // [BIDx, Unswitch, tile_size1, tile_size2/vectorize_factor, // vectorize_factor] input_cache->split(2, vectorize_factor); // [BIDx, Unswitch, tile_size1/vectorize_factor, vectorize_factor, // tile_size2/vectorize_factor, vectorize_factor] input_cache->swizzle(SwizzleType::XOR, 2, 4); input_cache->merge(2); input_cache->merge(2); input_cache->split(2, threads_per_block); // [BIDx, Unswitch, Unroll, TIDx, Vectorize] input_cache->setAllocationDomain(input_cache->getLoopDomain(), true); input_cache->axis(2)->parallelize(ParallelType::Unroll); input_cache->axis(3)->parallelize(ParallelType::TIDx); input_cache->axis(4)->parallelize(ParallelType::Vectorize); // Step-3, schedule output cache for (auto tv : {output_cache, tv1}) { tv->reorder({{-2, -1}}); // [..., tile2, tile1] tv->merge(pos); tv->split(pos, vectorize_factor); tv->split(pos, threads_per_block); tv->axis(2)->parallelize(ParallelType::Unroll); tv->axis(3)->parallelize(ParallelType::TIDx); if (tv == tv1) { tv->axis(4)->parallelize(ParallelType::Vectorize); } } inlineMost(); KernelExecutor ke; ke.compile(&fusion, {input0}); ASSERT_TRUE(getBankConflictInfo(ke.compiledKernel()->kernel()).empty()); auto outputs = ke.run({input0}); testValidate(&fusion, outputs, {input0}, __LINE__, __FILE__); }

Test failures

(Medium, 3) Shape mismatch in thunderfx higher-order inplace alias update test (nvFuser, CUDA)

Test Name A100 GB200 H100 Source

thunder.tests.test_update_aliases.test_higher_order_inplace_alias_update_nvfuser_cuda_thunder.dtypes.float32 ❌ ❌ ❌
(Medium, 1) InstanceNorm numerical mismatch in Thunder vs Torch (nvFuser, test_ops.test_core_vs_torch_consistency)

Test Name A100 Source

thunder.tests.test_ops.test_core_vs_torch_consistency_instance_norm_nvfuser_cuda_thunder.dtypes.float32 ❌
(Medium, 1) Thunder vs. eager output mismatch in nanoGPT autograd test (test_networks)

Test Name H100 Source

thunder.tests.test_networks.test_nanogpt_complete_autograd_nvfuser_cuda_thunder.dtypes.float32 ❌

liqiangxl · 2026-02-03T17:41:20Z

!test

greptile-apps · 2026-02-03T17:43:51Z

Greptile Overview

Greptile Summary

This PR adds a manual scheduling test demonstrating bank-conflict-free transpose operations using XOR swizzle patterns. The implementation uses 32x32 tiling with shared memory swizzling to eliminate bank conflicts in GPU memory accesses.

Key changes:

Added new test SwizzleNoBankConflict that manually schedules a transpose operation with XOR swizzle
Test validates that the scheduling produces zero bank conflicts using getBankConflictInfo()
Added documentation file showing the thread access patterns for the 32x32 transpose
The documentation visualizes input gmem reads, shared memory writes, shared memory reads, and output gmem writes to demonstrate coalesced access patterns

The PR description notes this is a preparatory change - the manual scheduling approach will be integrated into the auto-scheduler in future PRs. The test serves as proof-of-concept showing the technique is effective on GB200 hardware.

Confidence Score: 4/5

This PR is safe to merge - it only adds a test and documentation without modifying existing functionality
The changes are low-risk as they only add new test code and documentation. The test follows established patterns in the codebase and includes proper validation. Score is 4 rather than 5 because manual scheduling tests can be sensitive to specific hardware configurations and the complex scheduling logic should be verified on target hardware
No files require special attention - both changes are straightforward additions

Important Files Changed

Filename	Overview
doc/dev/transpose_access_map.md	New documentation showing thread access patterns for bank-conflict-free transpose using XOR swizzle
tests/cpp/test_transpose.cpp	New test demonstrating bank-conflict-free transpose using XOR swizzle pattern with manual scheduling

Sequence Diagram

sequenceDiagram
    participant Test as SwizzleNoBankConflict Test
    participant Fusion as Fusion IR
    participant Input as Input Tensor (tv0)
    participant InputCache as Shared Memory Cache
    participant OutputCache as Output Cache
    participant Output as Output Tensor (tv1)
    participant BankConflict as Bank Conflict Analyzer
    
    Test->>Fusion: Create fusion with transpose(tv0, 0, 1)
    Test->>Input: Create input tensor [262144, 5120]
    Test->>Fusion: Add input and output to fusion
    
    Note over Test,Fusion: Step 1: Create caches
    Test->>InputCache: tv0->cacheAfter()
    Test->>OutputCache: tv1->cacheBefore()
    Test->>InputCache: setMemoryType(Shared)
    
    Note over Test,Fusion: Step 2: Tile both groups (32x32 tiles)
    Test->>OutputCache: Split, reorder, merge (output layout)
    Test->>Output: Split, reorder, merge (output layout)
    Test->>Input: Split, reorder, merge (input layout)
    Test->>InputCache: Split, reorder, merge (input layout)
    
    Note over Test,InputCache: Step 3: Apply XOR swizzle to avoid bank conflicts
    Test->>InputCache: split(3, vectorize_factor)
    Test->>InputCache: split(2, vectorize_factor)
    Test->>InputCache: swizzle(XOR, 2, 4)
    Test->>InputCache: merge & parallelize (Unroll, TIDx, Vectorize)
    Test->>InputCache: setAllocationDomain()
    
    Note over Test,Output: Step 4: Schedule output tensors
    Test->>OutputCache: Reorder, merge, split, parallelize
    Test->>Output: Reorder, merge, split, parallelize with vectorization
    
    Note over Test,Fusion: Step 5: Compile and verify
    Test->>Fusion: inlineMost()
    Test->>Fusion: compile()
    Test->>BankConflict: getBankConflictInfo()
    BankConflict-->>Test: empty() = true (no conflicts)
    Test->>Fusion: run()
    Test->>Test: testValidate()

greptile-apps

_{2 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

liqiangxl force-pushed the llu/transpose_non_tma branch from 9707e93 to a59914a Compare February 2, 2026 19:55

add a test to demo bank conflict free transpose using swizzle

becd6fb

liqiangxl force-pushed the llu/transpose_non_tma branch from 75fc873 to becd6fb Compare February 3, 2026 17:40

Merge branch 'main' into llu/transpose_non_tma

d6e28f5

liqiangxl marked this pull request as ready for review February 3, 2026 17:41

greptile-apps bot reviewed Feb 3, 2026

View reviewed changes

liqiangxl marked this pull request as draft February 3, 2026 22:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove bank conflicts in transpose scheduler #5909

Remove bank conflicts in transpose scheduler #5909

liqiangxl commented Feb 2, 2026 •

edited

Loading

Uh oh!

liqiangxl commented Feb 2, 2026

Uh oh!

liqiangxl commented Feb 2, 2026

Uh oh!

liqiangxl commented Feb 3, 2026

Uh oh!

github-actions bot commented Feb 3, 2026 •

edited by xwang233

Loading

Changes walkthrough

PR Reviewer Guide

Test failures

Uh oh!

liqiangxl commented Feb 3, 2026

Uh oh!

greptile-apps bot commented Feb 3, 2026

Uh oh!

greptile-apps bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Remove bank conflicts in transpose scheduler #5909

Are you sure you want to change the base?

Remove bank conflicts in transpose scheduler #5909

Conversation

liqiangxl commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

liqiangxl commented Feb 2, 2026

Uh oh!

liqiangxl commented Feb 2, 2026

Uh oh!

liqiangxl commented Feb 3, 2026

Uh oh!

github-actions bot commented Feb 3, 2026 • edited by xwang233 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes walkthrough

PR Reviewer Guide

Test failures

Uh oh!

liqiangxl commented Feb 3, 2026

Uh oh!

greptile-apps bot commented Feb 3, 2026

Greptile Overview

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

liqiangxl commented Feb 2, 2026 •

edited

Loading

github-actions bot commented Feb 3, 2026 •

edited by xwang233

Loading