Skip to content

Conversation

@liqiangxl
Copy link
Collaborator

To solve issue #5883 for better performance

@liqiangxl
Copy link
Collaborator Author

!test --diff

@github-actions
Copy link

github-actions bot commented Jan 27, 2026

Review updated until commit 2be55a2

Description

  • Add domain restriction to transpose scheduler to prevent mis-scheduling broadcast operations

  • Check if grouped tensors are due to permutation vs broadcast using allocation domain mapping

  • Use pointwise scheduler for broadcast cases instead of transpose scheduler for better performance

  • Update tests to verify correct scheduler selection for broadcast scenarios

Changes walkthrough

Relevant files
Bug fix
domain_map.cpp
Add broadcast detection to transpose scheduler                     

csrc/scheduler/tools/domain_map.cpp

  • Added #include for range-based operations
  • Enhanced hasAtLeastTwoValidGroups function with broadcast detection
    logic
  • Added check to determine if groups are due to permutation vs broadcast
  • Return false for broadcast groups to use pointwise scheduler instead
  • Added validation ensuring all_mapped implies any_bcast
  • +40/-1   
    Tests
    test_transpose.cpp
    Update tests for broadcast scheduler selection                     

    tests/cpp/test_transpose.cpp

  • Added #include "type.h" for data type operations
  • Modified FusionScheduleBroadcastOnly test to use FusionExecutorCache
  • Updated test to verify PointWise scheduler is selected for broadcast
    cases
  • Modified FusionScheduleTransposeMissingDim test similarly
  • Added new NoTransposeMaverick17B test for specific broadcast scenario
  • +81/-13 

    PR Reviewer Guide

    Here are some key observations to aid the review process:

    🧪 PR contains tests
    ⚡ Recommended focus areas for review
    Logic validation

    The new logic correctly identifies broadcast cases vs permutation cases by checking if all allocation domains are mapped. The validation error message provides helpful debugging information. The approach is sound and addresses the performance issue mentioned in the PR description.

    Test coverage

    The tests have been properly updated to use FusionExecutorCache and verify that the pointwise scheduler is chosen for broadcast cases. The new test NoTransposeMaverick17B specifically validates the fix for the Maverick 17B case mentioned in the comments.

    Test failures (partial, pipeline still running)

    • (Medium, 1) Scalar numerical mismatches in Thunder nanoGPT autograd tests (test_networks.py)

      Test Name GB200 Source
      thunder.tests.test_networks.test_nanogpt_complete_autograd_nvfuser_cuda_thunder.dtypes.float32
    • (Medium, 1) Shape mismatch in Thunder inplace alias update (CUDA)

      Test Name GB200 Source
      thunder.tests.test_update_aliases.test_higher_order_inplace_alias_update_nvfuser_cuda_thunder.dtypes.float32

    @liqiangxl
    Copy link
    Collaborator Author

    !test

    @liqiangxl
    Copy link
    Collaborator Author

    !test

    @liqiangxl
    Copy link
    Collaborator Author

    !test

    @liqiangxl liqiangxl marked this pull request as ready for review January 29, 2026 16:14
    @greptile-apps
    Copy link
    Contributor

    greptile-apps bot commented Jan 29, 2026

    Greptile Overview

    Greptile Summary

    This PR adds an additional domain restriction to the transpose scheduler to improve performance for broadcast-only fusions. The key change prevents the transpose scheduler from being selected when two reference tensors have all their allocation domains mapped to each other, which indicates the grouping is due to broadcast rather than actual transposition.

    Key Changes:

    • Added <ranges> include to domain_map.cpp for C++20 ranges support
    • Enhanced hasAtLeastTwoValidGroups() to check if all allocation domains are mapped between references
    • When all domains are mapped, verifies at least one contains broadcast dimensions and returns false to use pointwise scheduler instead
    • Updated two existing tests (FusionScheduleBroadcastOnly, FusionScheduleTransposeMissingDim) to verify pointwise scheduler selection
    • Added new regression test NoTransposeMaverick17B for the specific issue case

    Performance Impact:
    The comments indicate pointwise scheduler achieves 61% performance vs transpose's 39% for broadcast-only cases, validating this optimization.

    Confidence Score: 4/5

    • This PR is safe to merge with minimal risk after verifying the NVF_ERROR assumption holds in production
    • The logic change is well-documented and addresses a specific performance issue. The defensive NVF_ERROR check validates an assumption but could potentially fail in edge cases not covered by tests. The test coverage is good with three updated/added tests.
    • Pay attention to csrc/scheduler/tools/domain_map.cpp - ensure the NVF_ERROR assumption validation doesn't trigger unexpectedly in production

    Important Files Changed

    Filename Overview
    csrc/scheduler/tools/domain_map.cpp Added broadcast detection logic to prevent transpose scheduler from handling broadcast-only fusions, directing them to pointwise scheduler instead
    tests/cpp/test_transpose.cpp Updated existing tests to verify pointwise scheduler selection and added new regression test for broadcast case

    Sequence Diagram

    sequenceDiagram
        participant Scheduler as Scheduler Selection
        participant TDM as TransposeDomainMap
        participant CA as ComputeAtMap
        
        Scheduler->>TDM: hasAtLeastTwoValidGroups(fusion)
        TDM->>TDM: groupInputsOutputsByInnerDim()
        alt Less than 2 groups
            TDM-->>Scheduler: false (not transpose)
        else 2+ groups found
            TDM->>TDM: findReferenceFor(group1)
            TDM->>TDM: findReferenceFor(group2)
            alt No valid references
                TDM-->>Scheduler: false (not transpose)
            else Valid references found
                TDM->>TDM: getMappedAllocDimIn(ref1, innermost2)
                alt No mapped dimension
                    TDM-->>Scheduler: false (not transpose)
                else Mapped dimension exists
                    TDM->>TDM: getMaybeAllocationDomain() for both refs
                    TDM->>CA: Check if all domains mapped
                    CA-->>TDM: all_mapped result
                    alt all_mapped == true
                        TDM->>TDM: Check for broadcast dimensions
                        TDM->>TDM: Validate assumption (NVF_ERROR)
                        TDM-->>Scheduler: false (broadcast case, use pointwise)
                    else all_mapped == false
                        TDM-->>Scheduler: true (valid transpose)
                    end
                end
            end
        end
    
    Loading

    Copy link
    Contributor

    @greptile-apps greptile-apps bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    2 files reviewed, no comments

    Edit Code Review Agent Settings | Greptile

    @liqiangxl liqiangxl requested review from naoyam and rdspring1 January 29, 2026 18:03
    return false;
    }

    // For grouping caused by permutation, the corresponding loop domains should
    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Allocation domains instead of loop domains?

    Copy link
    Collaborator Author

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    yes, revised.

    }

    // For grouping caused by permutation, the corresponding loop domains should
    // not be all mapped to each other. If they are, it means the two groups are
    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Can you add an example pattern here as a comment?

    Copy link
    Collaborator Author

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    added.

      // For example, in TransposeTest.NoTransposeMaverick17B, two inputs
      // are tv0[i0, i1] and tv1[i2, b3] where i0/i2 and i1/b3 are mapped to each
      // other. However, tv0 and tv1 are in two different groups because of the
      // broadcast. In this case, we should use the pointwise scheduler instead of
      // the transpose scheduler.
    

    Copy link
    Contributor

    @greptile-apps greptile-apps bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    2 files reviewed, no comments

    Edit Code Review Agent Settings | Greptile

    @liqiangxl
    Copy link
    Collaborator Author

    !test

    Copy link
    Contributor

    @greptile-apps greptile-apps bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    2 files reviewed, no comments

    Edit Code Review Agent Settings | Greptile

    @liqiangxl liqiangxl requested a review from naoyam February 3, 2026 19:25
    Copy link
    Collaborator

    @naoyam naoyam left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    LGTM

    // are tv0[i0, i1] and tv1[i2, b3] where i0/i2 and i1/b3 are mapped to each
    // other. However, tv0 and tv1 are in two different groups because of the
    // broadcast. In this case, we should use the pointwise scheduler instead of
    // the transpose scheduler.
    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    I'm ok with this but feel like the original grouping should be changed instead.

    IIUC, the grouping identifies whether there exist two differing groups. The added logic here basically overrides the grouping in the case with broadcast. I think it would be clearer if the initial grouping itself should consider the case and does not produce two groups.

    Copy link
    Collaborator Author

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    I see your point. This check is currently inside hasAtLeastTwoValidGroups(). The function filters out invalid groups. We could consider inlining this logic into groupInputsOutputsByInnerDim(), but keeping the functions separate is also reasonable.

    @liqiangxl liqiangxl merged commit fc60235 into main Feb 3, 2026
    61 of 64 checks passed
    @liqiangxl liqiangxl deleted the llu/transpose1 branch February 3, 2026 22:07
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

    Labels

    None yet

    Projects

    None yet

    Development

    Successfully merging this pull request may close these issues.

    2 participants