Add alias-aware token threading for memory operations. by shreyas-omkar · Pull Request #89 · JuliaGPU/cuTile.jl

shreyas-omkar · 2026-02-18T11:19:00Z

Feat #1

Introduce alias analysis based token threading:

Group pointers into alias sets.
Maintain per-alias-set token chains.
Thread tokens only between potentially aliasing operations.
Conservatively fall back to the global set for unknown pointers.
Preserve existing control flow token merging semantics.

Enables independent memory operations to execute without unnecessary serialization.

maleadt

Did you test this with a concrete example that would benefit from it?

src/compiler/codegen/alias_analysis.jl

Introduce alias analysis based token threading: - Group pointers into alias sets. - Maintain per-alias-set token chains. - Thread tokens only between potentially aliasing operations. - Conservatively fall back to the global set for unknown pointers. - Preserve existing control-flow token merging semantics. Enables independent memory operations to execute without unnecessary serialization.

Move token threading from inline codegen to a `token_order_pass!` that runs on StructuredIRCode before bytecode emission. The pass: - Inserts MakeTokenNode at function entry - Adds token arguments to memory operations (loads/stores/atomics) - Inserts JoinTokensNode and TokenResultNode for token flow tracking - Uses alias analysis to give independent arrays independent tokens This decouples token ordering decisions from codegen, matching cuTile Python's architecture (res/cutile-python/src/cuda/tile/_passes/token_order.py). Control flow token threading (loops, branches) is still handled by codegen conservatively; the pass only transforms straight-line code before the first control flow op. Per-alias loop carries will be added in a follow-up. Key changes: - New: codegen/irutils.jl — SSAMap mutation helpers (insert_before!, etc.) - New: codegen/passes/ directory for IR passes - Moved: alias_analysis.jl, token_keys.jl, token_order.jl → passes/ - Simplified: memory.jl, views.jl, atomics.jl — read token from IR args via extract_token_arg!(), fall back to ctx.token inside control flow - Simplified: control_flow.jl — removed token_map save/restore, kept single-token loop carry (conservative) - Removed: ctx.token_map, ctx.global_token, ctx.alias_result from CGCtx Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The key insight: adding one line to `tile_type_for_julia!` to handle TokenType eliminates all token-specific conditionals from codegen. Loop emitters, getfield extraction, and type mapping become uniform — tokens are just another type flowing through the same code paths. Changes: - tile_type_for_julia! maps TokenType → Token(tt) (the 1-line fix) - extract_tile_shape handles TokenType (returns ScalarShape) - token_order_pass! now recurses into loops and branches: - Adds per-alias-set token carries (init_values, BlockArgs, terminators) - Updates SSAMap types via update_type! to include TokenType parameters - Inserts Core.getfield for token result extraction after loops/ifs - control_flow.jl simplified: no is_token_type branches, trusts parent_result_type - Terminators no longer manually append ctx.token — pass handles it - ctx.token removed from CGCtx entirely This is a WIP: 196/202 codegen tests pass. 6 integration tests with complex loop patterns (spinloop, nested loops) have BoundsErrors from update_type! producing incorrect parameter counts — to be fixed next. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Three fixes: 1. Build loop result types from block args (authoritative source) instead of from old_type.typ which may be Nothing for void loops — fixes BoundsError in emit_loop_getfield! for nested spinloop patterns. 2. Thread parent_loop_effects through IfOp branches so that ContinueOp/BreakOp inside IfOp (common for LoopOp→IfOp while-loop patterns) get their token exit values appended. This was the cause of the "continue op operand mismatch" errors for for-loops with memory ops in the body. 3. Add ForOp body fallback ContinueOp (matching LoopOp) for completeness. All 202 codegen tests pass. GPU execution of spinlock (hang.jl) still has a token carry issue — break/continue inside the inner loop carry initial block args instead of the updated CAS result tokens. To be debugged next. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Two fixes for WhileOp token threading: 1. After transforming the before block (which contains the CAS), propagate the updated token_map to after_token_map. Previously, after_token_map used the initial BlockArgs, so the YieldOp (which becomes ContinueOp in codegen) carried stale tokens instead of the CAS result. 2. Extend ConditionOp.args with exit tokens (like ContinueOp/BreakOp). The codegen-generated BreakOp reads from cond_op.args, so token values must be present there for the break path to carry the right tokens. Result: both break and continue in the spinlock loop now carry %result_token_12 (CAS acquire token) instead of the initial block args. Status: 202/202 codegen, hang.jl passes, 1581/1585 full suite. Remaining: 3 early-return device tests + layernorm example. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

hoist_returns! replaces ReturnNode terminators in IfOp branches with empty YieldOp(). If the token pass runs first and extends the IfOp with token yields, hoist_returns! wipes them out — causing "then branch does not yield anything" errors. Fix: run hoist_returns! first so the token pass sees normalized YieldOps. Full suite: 1586/1587 pass (layernorm dW mismatch is pre-existing). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The has_release_order() stub returned false, meaning release-ordered atomics didn't join with all LAST_OP tokens. Per the Tile IR memory model: "When you use a release operation, you need to token-order all memory events that must stay before the release to the release itself." Without this, the release atomic_xchg in spinlock patterns didn't depend on the data store's token — the store's writes weren't guaranteed visible before the lock release, causing data corruption in the layernorm backward kernel. Fix: extract memory_order from atomic call args in the IR and pass it through to collect_join_tokens_ir, which already had the release join logic (line 150-152) but was never triggered. All 1586 tests pass including layernorm backward. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Based on code review findings: - Move resolve_call() to irutils.jl (shared by alias_analysis + token_order) - Use resolve_call() in alias_analysis instead of inline call/invoke normalization - Extract insert_token_result_getfields!() helper — replaces 3 copy-pasted blocks (~25 lines each) in transform_loop!, WhileOp, and IfOp transforms - Remove dead after_arg assignments in WhileOp (side-effect-only calls) - Remove duplicate old_type = get(...) line in IfOp transform - Remove const IRToken = Any alias (no type safety value) Net: -48 lines, 3 copy-paste blocks → 1 shared function. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

1. ACQUIRE_TOKEN_KEY was updated for all atomics instead of only acquire/acq_rel-ordered ones, over-constraining relaxed atomics. 2. has_acquire effect was set unconditionally for all atomics in compute_block_memory_effects!, causing unnecessary token carries. 3. ALIAS_UNIVERSE was treated as overlapping with nothing instead of everything, potentially missing token dependencies for unknown aliases. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

maleadt · 2026-03-26T12:17:01Z

Follow-up changes:

Extract token ordering from inline codegen into a separate IR pass (token_order_pass!). Codegen no longer maintains ctx.token, but emits what the pass wrote into the IR.
Introduce IR node types (MakeTokenNode, JoinTokensNode, TokenResultNode) so tokens are first-class in the structured IR.
Memory intrinsics receive their input token as an appended argument (from the pass) and store result tokens in ctx.result_tokens instead of mutating ctx.token.
Add per-alias token carries through loops and branches (init values, block args, terminators).
Support release/acquire memory ordering on atomics (release joins all LAST_OP tokens; acquire updates global ACQUIRE token).

Sadly doesn't improve FFT performance...

shreyas-omkar marked this pull request as draft February 18, 2026 11:19

shreyas-omkar force-pushed the main branch 2 times, most recently from 69e7601 to 432cec3 Compare February 19, 2026 19:53

maleadt reviewed Mar 2, 2026

View reviewed changes

src/compiler/codegen/alias_analysis.jl Outdated Show resolved Hide resolved

src/compiler/codegen/alias_analysis.jl Outdated Show resolved Hide resolved

src/compiler/codegen/alias_analysis.jl Outdated Show resolved Hide resolved

shreyas-omkar force-pushed the main branch 4 times, most recently from ba4f9a7 to d522a1d Compare March 21, 2026 14:09

shreyas-omkar marked this pull request as ready for review March 21, 2026 14:12

shreyas-omkar force-pushed the main branch from d522a1d to eb662d9 Compare March 22, 2026 13:07

shreyas-omkar and others added 9 commits March 26, 2026 06:47

maleadt force-pushed the main branch from eb662d9 to a8abe60 Compare March 26, 2026 11:45

maleadt self-assigned this Mar 26, 2026

Add comments.

b50c3ca

maleadt merged commit b9a9b91 into JuliaGPU:main Mar 26, 2026
9 checks passed

This was referenced Mar 26, 2026

Align FFT examples. #145

Merged

Layernorm regression: Token threading requires loop parallel store optimization #146

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add alias-aware token threading for memory operations.#89

Add alias-aware token threading for memory operations.#89
maleadt merged 10 commits intoJuliaGPU:mainfrom
shreyas-omkar:main

shreyas-omkar commented Feb 18, 2026 •

edited

Loading

Uh oh!

maleadt left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

maleadt commented Mar 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

shreyas-omkar commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maleadt left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

maleadt commented Mar 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

shreyas-omkar commented Feb 18, 2026 •

edited

Loading