add extra domain restriction to transpose scheduler #5884

liqiangxl · 2026-01-27T17:00:39Z

To solve issue #5883 for better performance

liqiangxl · 2026-01-27T17:01:30Z

!test --diff

github-actions · 2026-01-27T17:04:07Z

Review updated until commit 2be55a2

Description

Add domain restriction to transpose scheduler to prevent mis-scheduling broadcast operations
Check if grouped tensors are due to permutation vs broadcast using allocation domain mapping
Use pointwise scheduler for broadcast cases instead of transpose scheduler for better performance
Update tests to verify correct scheduler selection for broadcast scenarios

Changes walkthrough

Relevant files

Bug fix

domain_map.cpp `Add broadcast detection to transpose scheduler` csrc/scheduler/tools/domain_map.cpp Added `#include` for range-based operations Enhanced `hasAtLeastTwoValidGroups` function with broadcast detection logic Added check to determine if groups are due to permutation vs broadcast Return false for broadcast groups to use pointwise scheduler instead Added validation ensuring all_mapped implies any_bcast	+40/-1

Tests

test_transpose.cpp `Update tests for broadcast scheduler selection` tests/cpp/test_transpose.cpp Added `#include "type.h"` for data type operations Modified `FusionScheduleBroadcastOnly` test to use FusionExecutorCache Updated test to verify PointWise scheduler is selected for broadcast cases Modified `FusionScheduleTransposeMissingDim` test similarly Added new `NoTransposeMaverick17B` test for specific broadcast scenario	+81/-13

PR Reviewer Guide

Here are some key observations to aid the review process:

🧪 PR contains tests
⚡ Recommended focus areas for review
Logic validation The new logic correctly identifies broadcast cases vs permutation cases by checking if all allocation domains are mapped. The validation error message provides helpful debugging information. The approach is sound and addresses the performance issue mentioned in the PR description. Test coverage The tests have been properly updated to use FusionExecutorCache and verify that the pointwise scheduler is chosen for broadcast cases. The new test NoTransposeMaverick17B specifically validates the fix for the Maverick 17B case mentioned in the comments.

Test failures (partial, pipeline still running)

(Medium, 1) Scalar numerical mismatches in Thunder nanoGPT autograd tests (test_networks.py)

Test Name GB200 Source

thunder.tests.test_networks.test_nanogpt_complete_autograd_nvfuser_cuda_thunder.dtypes.float32 ❌
(Medium, 1) Shape mismatch in Thunder inplace alias update (CUDA)

Test Name GB200 Source

thunder.tests.test_update_aliases.test_higher_order_inplace_alias_update_nvfuser_cuda_thunder.dtypes.float32 ❌

… llu/transpose1

liqiangxl · 2026-01-27T19:51:58Z

!test

liqiangxl · 2026-01-28T03:05:24Z

!test

liqiangxl · 2026-01-28T15:59:11Z

!test

greptile-apps · 2026-01-29T16:20:33Z

Greptile Overview

Greptile Summary

This PR adds an additional domain restriction to the transpose scheduler to improve performance for broadcast-only fusions. The key change prevents the transpose scheduler from being selected when two reference tensors have all their allocation domains mapped to each other, which indicates the grouping is due to broadcast rather than actual transposition.

Key Changes:

Added <ranges> include to domain_map.cpp for C++20 ranges support
Enhanced hasAtLeastTwoValidGroups() to check if all allocation domains are mapped between references
When all domains are mapped, verifies at least one contains broadcast dimensions and returns false to use pointwise scheduler instead
Updated two existing tests (FusionScheduleBroadcastOnly, FusionScheduleTransposeMissingDim) to verify pointwise scheduler selection
Added new regression test NoTransposeMaverick17B for the specific issue case

Performance Impact:
The comments indicate pointwise scheduler achieves 61% performance vs transpose's 39% for broadcast-only cases, validating this optimization.

Confidence Score: 4/5

This PR is safe to merge with minimal risk after verifying the NVF_ERROR assumption holds in production
The logic change is well-documented and addresses a specific performance issue. The defensive NVF_ERROR check validates an assumption but could potentially fail in edge cases not covered by tests. The test coverage is good with three updated/added tests.
Pay attention to csrc/scheduler/tools/domain_map.cpp - ensure the NVF_ERROR assumption validation doesn't trigger unexpectedly in production

Important Files Changed

Filename	Overview
csrc/scheduler/tools/domain_map.cpp	Added broadcast detection logic to prevent transpose scheduler from handling broadcast-only fusions, directing them to pointwise scheduler instead
tests/cpp/test_transpose.cpp	Updated existing tests to verify pointwise scheduler selection and added new regression test for broadcast case

Sequence Diagram

sequenceDiagram
    participant Scheduler as Scheduler Selection
    participant TDM as TransposeDomainMap
    participant CA as ComputeAtMap
    
    Scheduler->>TDM: hasAtLeastTwoValidGroups(fusion)
    TDM->>TDM: groupInputsOutputsByInnerDim()
    alt Less than 2 groups
        TDM-->>Scheduler: false (not transpose)
    else 2+ groups found
        TDM->>TDM: findReferenceFor(group1)
        TDM->>TDM: findReferenceFor(group2)
        alt No valid references
            TDM-->>Scheduler: false (not transpose)
        else Valid references found
            TDM->>TDM: getMappedAllocDimIn(ref1, innermost2)
            alt No mapped dimension
                TDM-->>Scheduler: false (not transpose)
            else Mapped dimension exists
                TDM->>TDM: getMaybeAllocationDomain() for both refs
                TDM->>CA: Check if all domains mapped
                CA-->>TDM: all_mapped result
                alt all_mapped == true
                    TDM->>TDM: Check for broadcast dimensions
                    TDM->>TDM: Validate assumption (NVF_ERROR)
                    TDM-->>Scheduler: false (broadcast case, use pointwise)
                else all_mapped == false
                    TDM-->>Scheduler: true (valid transpose)
                end
            end
        end
    end

greptile-apps

_{2 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

naoyam · 2026-01-30T18:54:37Z

csrc/scheduler/tools/domain_map.cpp

+    return false;
+  }
+
+  // For grouping caused by permutation, the corresponding loop domains should


Allocation domains instead of loop domains?

yes, revised.

naoyam · 2026-01-30T18:55:05Z

csrc/scheduler/tools/domain_map.cpp

+  }
+
+  // For grouping caused by permutation, the corresponding loop domains should
+  // not be all mapped to each other. If they are, it means the two groups are


Can you add an example pattern here as a comment?

added.

// For example, in TransposeTest.NoTransposeMaverick17B, two inputs // are tv0[i0, i1] and tv1[i2, b3] where i0/i2 and i1/b3 are mapped to each // other. However, tv0 and tv1 are in two different groups because of the // broadcast. In this case, we should use the pointwise scheduler instead of // the transpose scheduler.

greptile-apps

_{2 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

liqiangxl · 2026-02-02T13:41:56Z

!test

greptile-apps

_{2 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

naoyam

LGTM

naoyam · 2026-02-03T19:37:25Z

csrc/scheduler/tools/domain_map.cpp

+  // are tv0[i0, i1] and tv1[i2, b3] where i0/i2 and i1/b3 are mapped to each
+  // other. However, tv0 and tv1 are in two different groups because of the
+  // broadcast. In this case, we should use the pointwise scheduler instead of
+  // the transpose scheduler.


I'm ok with this but feel like the original grouping should be changed instead.

IIUC, the grouping identifies whether there exist two differing groups. The added logic here basically overrides the grouping in the case with broadcast. I think it would be clearer if the initial grouping itself should consider the case and does not produce two groups.

I see your point. This check is currently inside hasAtLeastTwoValidGroups(). The function filters out invalid groups. We could consider inlining this logic into groupInputsOutputsByInnerDim(), but keeping the functions separate is also reasonable.

liqiangxl and others added 2 commits January 27, 2026 07:47

transpose

d563b3f

Merge branch 'main' into llu/transpose1

ad0d2df

liqiangxl and others added 3 commits January 27, 2026 11:51

use number of elements

567dcb1

Merge branch 'llu/transpose1' of https://github.com/nvidia/fuser into…

ccab13b

… llu/transpose1

Merge branch 'main' into llu/transpose1

a206c4d

liqiangxl added 2 commits January 27, 2026 18:26

bcast map

d807b3c

bcast map

86a826c

liqiangxl and others added 2 commits January 28, 2026 07:58

use allocation domain

4ff918d

Merge branch 'main' into llu/transpose1

8bff70c

liqiangxl marked this pull request as ready for review January 29, 2026 16:14

greptile-apps bot reviewed Jan 29, 2026

View reviewed changes

liqiangxl requested review from naoyam and rdspring1 January 29, 2026 18:03

naoyam reviewed Jan 30, 2026

View reviewed changes

comment

cbeb32d

greptile-apps bot reviewed Jan 30, 2026

View reviewed changes

Merge branch 'main' into llu/transpose1

2be55a2

greptile-apps bot reviewed Feb 2, 2026

View reviewed changes

liqiangxl requested a review from naoyam February 3, 2026 19:25

naoyam approved these changes Feb 3, 2026

View reviewed changes

liqiangxl merged commit fc60235 into main Feb 3, 2026
61 of 64 checks passed

liqiangxl deleted the llu/transpose1 branch February 3, 2026 22:07

add extra domain restriction to transpose scheduler #5884

add extra domain restriction to transpose scheduler #5884

Conversation

liqiangxl commented Jan 27, 2026

Uh oh!

liqiangxl commented Jan 27, 2026

Uh oh!

github-actions bot commented Jan 27, 2026 • edited by xwang233 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes walkthrough

PR Reviewer Guide

Test failures (partial, pipeline still running)

Uh oh!

liqiangxl commented Jan 27, 2026

Uh oh!

liqiangxl commented Jan 28, 2026

Uh oh!

liqiangxl commented Jan 28, 2026

Uh oh!

greptile-apps bot commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Overview

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

naoyam Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

liqiangxl Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

naoyam Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

liqiangxl Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

liqiangxl commented Feb 2, 2026

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

naoyam left a comment

Choose a reason for hiding this comment

Uh oh!

naoyam Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

liqiangxl Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions bot commented Jan 27, 2026 •

edited by xwang233

Loading

greptile-apps bot commented Jan 29, 2026 •

edited

Loading