-
Notifications
You must be signed in to change notification settings - Fork 10
Description
#89 added an alias-aware token ordering pass that correctly separates memory operations by alias set, improving matmul (-15%) and batch matmul (-7%). However, it regresses the layernorm forward kernel by 1.67x (0.24 ms → 0.40 ms at 4096×4096).
Our pass architecture (alias analysis, LAST_OP/LAST_STORE tracking, eager join_tokens) matches cuTile Python's token_order.py exactly. The difference is one optimization Python has that we don't: loop parallel store.
When a TileStore in a for-loop uses the induction variable as its index (non-overlapping across iterations), the store can use the token from before the loop instead of a loop-carried token. This breaks the dependency chain through the loop. Once the store is parallelized, the token carries and joins for read-only alias sets become dead code, and DCE removes them.
Without this optimization, layernorm's normalize loop (loads X, W, B; stores Y — 4 non-aliasing arrays) generates 5 loop-carried tokens with join_tokens after every load. Python generates 0.
Benchmark impact (RTX 5080):
| Kernel | Before pass | After pass | Δ |
|---|---|---|---|
| Layernorm fwd | 0.24 ms | 0.40 ms | +67% (regression) |
| Matrix Multiply | 3.73 ms | 3.19 ms | -15% (improvement) |
| Batch MatMul | 0.61 ms | 0.57 ms | -7% (improvement) |
Fix: Port _try_loop_parallel_store from res/cutile-python/src/cuda/tile/_passes/token_order.py (lines 425–541) and add a DCE pass on the structured IR after token ordering. See PLAN2.md for details.