Skip to content

Conversation

@liqiangxl
Copy link
Collaborator

@liqiangxl liqiangxl commented Feb 3, 2026

Two extensions to transpose benchmark in benchmarks/python/test_transpose.py

(1) Adds coverage for copy vs. view transpose
Previously, we only exercised view transpose, which returns a non-contiguous tensor and the pointwise scheduler is used. As a result, the transpose scheduler was never actually used.
This PR adds .contiguous() to enforce a contiguous output layout, which triggers a copy-based transpose.
For manually defined fusions, a segment_set was added to the fusion to avoid the pre-segmentation pass (AllocationDomainPass) changing transpose output layout to ensure the copy transpose path is taken.

For view transpose, the output has a allocation domain of (iS11{i0}, iS10{i1}) which is same as input

T5_g_float[iS10{i1}, iS11{i0}]
 logical domain : (iS10{i1}, iS11{i0})
 allocation domain : (iS11{i0}, iS10{i1})
 contiguity: t t
 loop domain : (iS10{i1}, iS11{i0})

Final fusion is:

Segmented_Fusion{ 
groups: 
  pointwise{0, 1, 2, 3}
edges: 

group details:
g{(pointwise)
group id: 0
inputs:
  T0_g_float[iS0{i0}, iS1{i1}] float
  T1_g_float[iS12{i0}, iS13{i1}] float
outputs:
  T5_g_float[iS10{i1}, iS11{i0}] float


T2_l_float[iS4{i0}, iS5{i1}]
   = T0_g_float[iS0{i0}, iS1{i1}]
   + T1_g_float[iS12{i0}, iS13{i1}];
(0)
T3_l_float[iS7{i1}, iS6{i0}]
   = Set.Permute( T2_l_float[iS4{i0}, iS5{i1}], cache_op=Streaming )
(1)
T4_l_bool[iS8{i1}, iS9{i0}]
   = T3_l_float[iS7{i1}, iS6{i0}]
   > double(0);
(2)
T5_g_float[iS10{i1}, iS11{i0}]
   = where(T4_l_bool[iS8{i1}, iS9{i0}]
  , T3_l_float[iS7{i1}, iS6{i0}]
  , double(0));
(3)
}

} //Segmented_Fusion

For copy transpose, the output is T6, it has a transposed allocation domain : (iS12{i1}, iS13{i0}):

T6_g_float[iS12{i1}, iS13{i0}]
   = SegmenterSet( T5_l_float[iS10{i1}, iS11{i0}] )

T5_l_float[iS10{i1}, iS11{i0}]
 logical domain : (iS10{i1}, iS11{i0})
 contiguity: t t
 loop domain : (iS10{i1}, iS11{i0})
T6_g_float[iS12{i1}, iS13{i0}]
 logical domain : (iS12{i1}, iS13{i0})
 allocation domain : (iS12{i1}, iS13{i0})
 contiguity: t t
 loop domain : (iS12{i1}, iS13{i0})

Final fusion is:

Segmented_Fusion{ 
groups: 
  transpose{0, 1, 2, 3, 4}
edges: 

group details:
g{(transpose)
group id: 0
inputs:
  T0_g_float[iS0{i0}, iS1{i1}] float
  T1_g_float[iS14{i0}, iS15{i1}] float
outputs:
  T6_g_float[iS12{i1}, iS13{i0}] float


T2_l_float[iS4{i0}, iS5{i1}]
   = T0_g_float[iS0{i0}, iS1{i1}]
   + T1_g_float[iS14{i0}, iS15{i1}];
(0)
T3_g_float[iS7{i1}, iS6{i0}]
   = Set.Permute( T2_l_float[iS4{i0}, iS5{i1}], cache_op=Streaming )
(1)
T4_g_bool[iS8{i1}, iS9{i0}]
   = T3_g_float[iS7{i1}, iS6{i0}]
   > double(0);
(2)
T5_g_float[iS10{i1}, iS11{i0}]
   = where(T4_g_bool[iS8{i1}, iS9{i0}]
  , T3_g_float[iS7{i1}, iS6{i0}]
  , double(0));
(3)
T6_g_float[iS12{i1}, iS13{i0}]
   = SegmenterSet( T5_g_float[iS10{i1}, iS11{i0}] )
(4)
}

} //Segmented_Fusion

(2) Generalizes fusion input ranks to 2D
Previously, fusion inputs were limited to 3D shapes, with roughly 100 test cases per data type. This PR expands coverage to include 2D input shapes as well.

@github-actions
Copy link

github-actions bot commented Feb 3, 2026

Review updated until commit e1dcbe7

Description

  • Extends transpose benchmark to support both 2D and 3D tensor inputs

  • Adds copy vs view transpose testing using contiguous() and segment_set

  • Implements segment_set to enforce transpose scheduler over pointwise scheduler

  • Parameterizes tests to cover all axes combinations for both 2D and 3D cases

Changes walkthrough

Relevant files
Enhancement
test_transpose.py
Extend transpose benchmark with 2D inputs and copy transpose

benchmarks/python/test_transpose.py

  • Modified transpose_fusion to accept is_copy_transpose and rank
    parameters
  • Added segment_set logic to enforce copy transpose path when needed
  • Extended transpose_fwd_fn to handle both copy and view transpose cases
  • Updated test parameterization to include 2D/3D dims and copy/view
    transpose flags
  • Added _generate_transpose_params helper for comprehensive test
    coverage
  • +65/-22 

    PR Reviewer Guide

    Here are some key observations to aid the review process:

    🧪 PR contains tests
    ⚡ Recommended focus areas for review
    Function Signature Change

    The transpose_fusion function now requires two new mandatory parameters (is_copy_transpose and rank). While all current calls have been updated, this is a breaking API change that could affect external users of this function.

    def transpose_fusion(
        fd: FusionDefinition,
        dtype: DataType,
        is_copy_transpose: bool,
        axes: list,
        rank: int,
    ):
    Parameter Validation

    The new is_copy_transpose parameter is used to control both the segment_set logic and the contiguous() call in transpose_fwd_fn. Ensure this parameter is properly validated and handles edge cases correctly.

    if is_copy_transpose:
        T10 = fd.ops.segment_set(T9)
        fd.add_output(T10)
    else:
        fd.add_output(T9)

    Test failures

    • (Medium, 3) Shape mismatch in thunderfx higher_order_inplace_alias_update (thunder/tests/test_update_aliases)

      Test Name A100 GB200 H100 Source
      thunder.tests.test_update_aliases.test_higher_order_inplace_alias_update_nvfuser_cuda_thunder.dtypes.float32

    @liqiangxl liqiangxl marked this pull request as ready for review February 3, 2026 15:26
    @liqiangxl liqiangxl requested review from Priya2698 and naoyam February 3, 2026 15:26
    @greptile-apps
    Copy link
    Contributor

    greptile-apps bot commented Feb 3, 2026

    Greptile Overview

    Greptile Summary

    This PR extends the transpose benchmark to cover both copy and view transpose operations, and adds support for 2D input shapes.

    Key Changes:

    • Added is_copy_transpose parameter to distinguish between copy-based transpose (using .contiguous()) and view-based transpose (swapped strides only)
    • For copy transpose: added segment_set operation in fusion definition to prevent presegmentation passes from optimizing output to a view, ensuring the transpose scheduler is used instead of pointwise scheduler
    • Generalized input tensor rank from fixed 3D to parameterized 2D/3D via new rank parameter
    • Introduced _generate_transpose_params() helper to generate test parameters for both 2D (axes: (0,1)) and 3D (axes: (0,1), (0,2), (1,2)) cases
    • Updated both nvFuser and baseline benchmark tests to support the new parameters

    Impact:

    • Significantly expands test coverage by adding copy transpose path and 2D shapes
    • Ensures transpose scheduler is properly exercised during benchmarking
    • The changes maintain backward compatibility with existing 3D test cases

    Confidence Score: 4/5

    • This PR is safe to merge with minimal risk - changes are well-scoped to benchmark expansion
    • The implementation correctly extends the transpose benchmark with proper parameter handling for both copy and view transpose modes, and 2D/3D input shapes. The logic for conditionally applying segment_set and .contiguous() is sound. Previous review comments about formatting/spelling have been noted but don't affect functionality.
    • No files require special attention

    Important Files Changed

    Filename Overview
    benchmarks/python/test_transpose.py Extended transpose benchmark with 2D input support and copy vs. view transpose coverage; no critical issues found

    Sequence Diagram

    sequenceDiagram
        participant Test as Test Runner
        participant Gen as _generate_transpose_params()
        participant Fusion as transpose_fusion()
        participant Forward as transpose_fwd_fn()
        participant NVF as nvFuser Runtime
        
        Note over Test,NVF: Test Execution Flow
        
        Test->>Gen: Generate test parameters
        Gen->>Gen: For dims in (2, 3)
        Gen->>Gen: Generate sizes via generate_input_sizes(dims)
        Gen->>Gen: Create axes_list based on dims
        Gen-->>Test: Return (size, axes, dims) tuples
        
        Test->>Test: Parametrize with dtype, is_copy_transpose
        Test->>Fusion: Define fusion (dtype, is_copy_transpose, axes, rank)
        
        Fusion->>Fusion: Define 2 input tensors with rank dimensions
        Fusion->>Fusion: Add tensors, permute, apply ReLU
        
        alt is_copy_transpose == True
            Fusion->>Fusion: Apply segment_set(T9) → T10
            Fusion->>NVF: add_output(T10)
            Note over Fusion,NVF: Forces copy transpose<br/>(transpose scheduler)
        else is_copy_transpose == False
            Fusion->>NVF: add_output(T9)
            Note over Fusion,NVF: View transpose<br/>(pointwise scheduler)
        end
        
        Test->>Forward: Validate with eager_output
        Forward->>Forward: torch.transpose(inputs[0] + inputs[1])
        Forward->>Forward: Apply ReLU
        
        alt is_copy_transpose == True
            Forward->>Forward: Apply .contiguous()
            Note over Forward: Materializes contiguous copy
        end
        
        Forward-->>Test: Return result
        Test->>NVF: Execute benchmark
        NVF-->>Test: Performance metrics
    
    Loading

    Copy link
    Contributor

    @greptile-apps greptile-apps bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    1 file reviewed, 1 comment

    Edit Code Review Agent Settings | Greptile

    Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
    Copy link
    Contributor

    @greptile-apps greptile-apps bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    1 file reviewed, no comments

    Edit Code Review Agent Settings | Greptile

    Copy link
    Collaborator

    @naoyam naoyam left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    LGTM

    @liqiangxl
    Copy link
    Collaborator Author

    !test

    Copy link
    Contributor

    @greptile-apps greptile-apps bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    1 file reviewed, no comments

    Edit Code Review Agent Settings | Greptile

    @pytest.mark.parametrize("size,axes,dims", _generate_transpose_params())
    @pytest.mark.parametrize("dtype", FLOAT_DTYPES)
    @pytest.mark.parametrize("axes", [(0, 1), (0, 2), (1, 2)])
    @pytest.mark.parametrize(
    Copy link
    Collaborator

    @Priya2698 Priya2698 Feb 3, 2026

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Do we need to benchmark view transpose? Should we remove it instead?

    Copy link
    Collaborator Author

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    I don't know, it's not an expensive benchmark, so I just leave it as is in this PR.

    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Got it. Please work with @xwang233 for dashboard integration.


    @pytest.mark.parametrize("executor", DEFAULT_EXECUTORS)
    @pytest.mark.parametrize("size", generate_input_sizes(dims=3))
    @pytest.mark.parametrize("size,axes,dims", _generate_transpose_params())
    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    IIRC, I used 3D inputs to match C++ benchmark. If 2D inputs are sufficient for benchmarking, we should remove the 3D benchmarking. This should also simplify the dashboard for this benchmark

    Copy link
    Collaborator Author

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    should keep 3D for different axes

    Co-authored-by: Priya Mishra <52657555+Priya2698@users.noreply.github.com>
    Copy link
    Contributor

    @greptile-apps greptile-apps bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    1 file reviewed, 2 comments

    Edit Code Review Agent Settings | Greptile

    Comment on lines 37 to 38
    # add segmenter set to avoid presegment passes setting the output as a view of the input without any data movement. It leads to pointwise instead of transpose scheduler.
    #we can also expose OptimizationPassGuard to python frontend and disable presegmentation passes to enforce output to be contiguous and then transpose scheduler will be used.
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Break these long comments into multiple lines for better readability

    Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

    Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
    Copy link
    Contributor

    @greptile-apps greptile-apps bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    1 file reviewed, no comments

    Edit Code Review Agent Settings | Greptile

    Copy link
    Collaborator

    @Priya2698 Priya2698 left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    LGTM. If you find that the view transpose variant is not meaningful anymore, please remove it in a future follow-up.

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

    Labels

    None yet

    Projects

    None yet

    Development

    Successfully merging this pull request may close these issues.

    3 participants