Skip to content

Conversation

@liqiangxl
Copy link
Collaborator

@liqiangxl liqiangxl commented Feb 2, 2026

Remove bank conflicts in transpose scheduler using XOR swizzle.
This PR only addes a manual schedule test to investigate the method, will add to auto scheduler in following PRs.

Note:
Aslo experimented with different tile shapes, vectorization factors, and shared-memory swizzle patterns to reduce or eliminate bank conflicts. Shared-memory swizzling provided consistently high performance and is portable across hardware, whereas 256-bit vectorization is only supported on certain architectures. Perf on GB200, float4, transpose of 262144 x 5120
image

@liqiangxl liqiangxl force-pushed the llu/transpose_non_tma branch from 9707e93 to a59914a Compare February 2, 2026 19:55
@liqiangxl
Copy link
Collaborator Author

!test

2 similar comments
@liqiangxl
Copy link
Collaborator Author

!test

@liqiangxl
Copy link
Collaborator Author

!test

@github-actions
Copy link

github-actions bot commented Feb 3, 2026

Review updated until commit d6e28f5

Description

  • Add comprehensive test case SwizzleNoBankConflict demonstrating bank conflict elimination in transpose operations

  • Implement XOR swizzle scheduling technique to avoid shared memory bank conflicts

  • Include bank conflict validation using getBankConflictInfo to verify zero conflicts

  • Add detailed documentation showing 32x32 transpose access patterns across global memory, shared memory, and registers

Changes walkthrough

Relevant files
Tests
test_transpose.cpp
Add bank conflict-free transpose test with XOR swizzle     

tests/cpp/test_transpose.cpp

  • Added includes for bank conflict analysis and type utilities
  • Implemented new SwizzleNoBankConflict test case with manual transpose
    scheduling
  • Applied XOR swizzle on shared memory tensor to eliminate bank
    conflicts
  • Added validation to confirm zero bank conflicts in compiled kernel
  • +93/-0   
    Documentation
    transpose_access_map.md
    Document transpose access patterns with bank conflict analysis

    doc/dev/transpose_access_map.md

  • Created comprehensive documentation of 32x32 transpose access patterns
  • Visualized thread-to-bank mapping across global memory, shared memory,
    and registers
  • Demonstrated how XOR swizzle eliminates bank conflicts within warps
  • Showed coalesced global memory access patterns for input and output
  • +159/-0 

    PR Reviewer Guide

    Here are some key observations to aid the review process:

    🧪 PR contains tests
    🔒 No security concerns identified
    ⚡ Recommended focus areas for review
    Test Scope

    The test is hardcoded for a specific matrix size (262144 x 5120). While this demonstrates the technique effectively, consider if additional test cases with different sizes would strengthen the validation, especially for edge cases or different aspect ratios.

    TEST_F(TransposeTest, SwizzleNoBankConflict) {
      auto fusion_ptr = std::make_unique<Fusion>();
      FusionGuard fg(fusion_ptr.get());
      Fusion& fusion = *fusion_ptr;
    
      auto dtype = DataType::Float;
      auto tv0 = makeContigConcreteTensor({262144, 5120}, dtype);
      fusion.addInput(tv0);
      auto tv1 = transpose(tv0, 0, 1);
      fusion.addOutput(tv1);
    
      auto options =
          at::TensorOptions().dtype(data_type_to_aten(dtype)).device(at::kCUDA, 0);
      at::Tensor input0 = at::randn({262144, 5120}, options);
    
      auto input_cache = tv0->cacheAfter();
      auto output_cache = tv1->cacheBefore();
      input_cache->setMemoryType(MemoryType::Shared);
    
      // Step-1, tiling and parallelizing non-tile dimensions
      int64_t tile_size1 = 32, tile_size2 = 32;
      // Group 1 (output-side layout [y, x]).
      for (auto tv : {output_cache, tv1}) {
        // [y, x] -> [y/tile_size2, tile_size2, x/tile_size1, tile_size1]
        tv->split(1, tile_size1);
        tv->split(0, tile_size2);
        // [x/tile_size1, y/tile_size2, tile_size1, tile_size2]
        tv->reorder({{0, 1}, {1, 3}, {2, 0}, {3, 2}});
        // [x/tile_size1 * y/tile_size2, tile_size1, tile_size2]
        tv->merge(0);
        tv->split(0, 1);
        tv->axis(1)->parallelize(ParallelType::Unswitch);
        tv->axis(0)->parallelize(ParallelType::BIDx);
      }
      // Group 2 (input-side layout [x, y]).
      for (auto tv : {tv0, input_cache}) {
        // [x, y] -> [x/tile_size1, tile_size1, y/tile_size2, tile_size2]
        tv->split(1, tile_size2);
        tv->split(0, tile_size1);
        // [x/tile_size1, y/tile_size2, tile_size1, tile_size2]
        tv->reorder({{1, 2}, {2, 1}});
        // [x/tile_size1 * y/tile_size2, tile_size1, tile_size2]
        tv->merge(0);
        tv->split(0, 1);
        tv->axis(1)->parallelize(ParallelType::Unswitch);
        tv->axis(0)->parallelize(ParallelType::BIDx);
      }
    
      // Step-2, schedule input shared cache to avoid bank conflict
      int64_t pos = 2;
      int64_t vectorize_factor = 16 / dataTypeSizeByte(dtype),
              threads_per_block = 128;
      // Schedule input shared cache.
      // [BIDx, Unswitch, tile_size1, tile_size2]
      input_cache->split(3, vectorize_factor);
      // [BIDx, Unswitch, tile_size1, tile_size2/vectorize_factor,
      // vectorize_factor]
      input_cache->split(2, vectorize_factor);
      // [BIDx, Unswitch, tile_size1/vectorize_factor, vectorize_factor,
      // tile_size2/vectorize_factor, vectorize_factor]
      input_cache->swizzle(SwizzleType::XOR, 2, 4);
      input_cache->merge(2);
      input_cache->merge(2);
      input_cache->split(2, threads_per_block);
      // [BIDx, Unswitch, Unroll, TIDx, Vectorize]
      input_cache->setAllocationDomain(input_cache->getLoopDomain(), true);
      input_cache->axis(2)->parallelize(ParallelType::Unroll);
      input_cache->axis(3)->parallelize(ParallelType::TIDx);
      input_cache->axis(4)->parallelize(ParallelType::Vectorize);
    
      // Step-3, schedule output cache
      for (auto tv : {output_cache, tv1}) {
        tv->reorder({{-2, -1}});
        // [..., tile2, tile1]
        tv->merge(pos);
        tv->split(pos, vectorize_factor);
        tv->split(pos, threads_per_block);
        tv->axis(2)->parallelize(ParallelType::Unroll);
        tv->axis(3)->parallelize(ParallelType::TIDx);
        if (tv == tv1) {
          tv->axis(4)->parallelize(ParallelType::Vectorize);
        }
      }
      inlineMost();
      KernelExecutor ke;
      ke.compile(&fusion, {input0});
      ASSERT_TRUE(getBankConflictInfo(ke.compiledKernel()->kernel()).empty());
      auto outputs = ke.run({input0});
      testValidate(&fusion, outputs, {input0}, __LINE__, __FILE__);
    }

    Test failures

    • (Medium, 3) Shape mismatch in thunderfx higher-order inplace alias update test (nvFuser, CUDA)

      Test Name A100 GB200 H100 Source
      thunder.tests.test_update_aliases.test_higher_order_inplace_alias_update_nvfuser_cuda_thunder.dtypes.float32
    • (Medium, 1) InstanceNorm numerical mismatch in Thunder vs Torch (nvFuser, test_ops.test_core_vs_torch_consistency)

      Test Name A100 Source
      thunder.tests.test_ops.test_core_vs_torch_consistency_instance_norm_nvfuser_cuda_thunder.dtypes.float32
    • (Medium, 1) Thunder vs. eager output mismatch in nanoGPT autograd test (test_networks)

      Test Name H100 Source
      thunder.tests.test_networks.test_nanogpt_complete_autograd_nvfuser_cuda_thunder.dtypes.float32

    @liqiangxl liqiangxl force-pushed the llu/transpose_non_tma branch from 75fc873 to becd6fb Compare February 3, 2026 17:40
    @liqiangxl liqiangxl marked this pull request as ready for review February 3, 2026 17:41
    @liqiangxl
    Copy link
    Collaborator Author

    !test

    @greptile-apps
    Copy link
    Contributor

    greptile-apps bot commented Feb 3, 2026

    Greptile Overview

    Greptile Summary

    This PR adds a manual scheduling test demonstrating bank-conflict-free transpose operations using XOR swizzle patterns. The implementation uses 32x32 tiling with shared memory swizzling to eliminate bank conflicts in GPU memory accesses.

    Key changes:

    • Added new test SwizzleNoBankConflict that manually schedules a transpose operation with XOR swizzle
    • Test validates that the scheduling produces zero bank conflicts using getBankConflictInfo()
    • Added documentation file showing the thread access patterns for the 32x32 transpose
    • The documentation visualizes input gmem reads, shared memory writes, shared memory reads, and output gmem writes to demonstrate coalesced access patterns

    The PR description notes this is a preparatory change - the manual scheduling approach will be integrated into the auto-scheduler in future PRs. The test serves as proof-of-concept showing the technique is effective on GB200 hardware.

    Confidence Score: 4/5

    • This PR is safe to merge - it only adds a test and documentation without modifying existing functionality
    • The changes are low-risk as they only add new test code and documentation. The test follows established patterns in the codebase and includes proper validation. Score is 4 rather than 5 because manual scheduling tests can be sensitive to specific hardware configurations and the complex scheduling logic should be verified on target hardware
    • No files require special attention - both changes are straightforward additions

    Important Files Changed

    Filename Overview
    doc/dev/transpose_access_map.md New documentation showing thread access patterns for bank-conflict-free transpose using XOR swizzle
    tests/cpp/test_transpose.cpp New test demonstrating bank-conflict-free transpose using XOR swizzle pattern with manual scheduling

    Sequence Diagram

    sequenceDiagram
        participant Test as SwizzleNoBankConflict Test
        participant Fusion as Fusion IR
        participant Input as Input Tensor (tv0)
        participant InputCache as Shared Memory Cache
        participant OutputCache as Output Cache
        participant Output as Output Tensor (tv1)
        participant BankConflict as Bank Conflict Analyzer
        
        Test->>Fusion: Create fusion with transpose(tv0, 0, 1)
        Test->>Input: Create input tensor [262144, 5120]
        Test->>Fusion: Add input and output to fusion
        
        Note over Test,Fusion: Step 1: Create caches
        Test->>InputCache: tv0->cacheAfter()
        Test->>OutputCache: tv1->cacheBefore()
        Test->>InputCache: setMemoryType(Shared)
        
        Note over Test,Fusion: Step 2: Tile both groups (32x32 tiles)
        Test->>OutputCache: Split, reorder, merge (output layout)
        Test->>Output: Split, reorder, merge (output layout)
        Test->>Input: Split, reorder, merge (input layout)
        Test->>InputCache: Split, reorder, merge (input layout)
        
        Note over Test,InputCache: Step 3: Apply XOR swizzle to avoid bank conflicts
        Test->>InputCache: split(3, vectorize_factor)
        Test->>InputCache: split(2, vectorize_factor)
        Test->>InputCache: swizzle(XOR, 2, 4)
        Test->>InputCache: merge & parallelize (Unroll, TIDx, Vectorize)
        Test->>InputCache: setAllocationDomain()
        
        Note over Test,Output: Step 4: Schedule output tensors
        Test->>OutputCache: Reorder, merge, split, parallelize
        Test->>Output: Reorder, merge, split, parallelize with vectorization
        
        Note over Test,Fusion: Step 5: Compile and verify
        Test->>Fusion: inlineMost()
        Test->>Fusion: compile()
        Test->>BankConflict: getBankConflictInfo()
        BankConflict-->>Test: empty() = true (no conflicts)
        Test->>Fusion: run()
        Test->>Test: testValidate()
    
    Loading

    Copy link
    Contributor

    @greptile-apps greptile-apps bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    2 files reviewed, no comments

    Edit Code Review Agent Settings | Greptile

    @liqiangxl liqiangxl marked this pull request as draft February 3, 2026 22:22
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

    Labels

    None yet

    Projects

    None yet

    Development

    Successfully merging this pull request may close these issues.

    1 participant