Skip to content

Layernorm regression: Token threading requires loop parallel store optimization #146

@maleadt

Description

@maleadt

#89 added an alias-aware token ordering pass that correctly separates memory operations by alias set, improving matmul (-15%) and batch matmul (-7%). However, it regresses the layernorm forward kernel by 1.67x (0.24 ms → 0.40 ms at 4096×4096).

Our pass architecture (alias analysis, LAST_OP/LAST_STORE tracking, eager join_tokens) matches cuTile Python's token_order.py exactly. The difference is one optimization Python has that we don't: loop parallel store.

When a TileStore in a for-loop uses the induction variable as its index (non-overlapping across iterations), the store can use the token from before the loop instead of a loop-carried token. This breaks the dependency chain through the loop. Once the store is parallelized, the token carries and joins for read-only alias sets become dead code, and DCE removes them.

Without this optimization, layernorm's normalize loop (loads X, W, B; stores Y — 4 non-aliasing arrays) generates 5 loop-carried tokens with join_tokens after every load. Python generates 0.

Benchmark impact (RTX 5080):

Kernel Before pass After pass Δ
Layernorm fwd 0.24 ms 0.40 ms +67% (regression)
Matrix Multiply 3.73 ms 3.19 ms -15% (improvement)
Batch MatMul 0.61 ms 0.57 ms -7% (improvement)

Fix: Port _try_loop_parallel_store from res/cutile-python/src/cuda/tile/_passes/token_order.py (lines 425–541) and add a DCE pass on the structured IR after token ordering. See PLAN2.md for details.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions