Add alias-aware token threading for memory operations.#89
Merged
maleadt merged 10 commits intoJuliaGPU:mainfrom Mar 26, 2026
Merged
Add alias-aware token threading for memory operations.#89maleadt merged 10 commits intoJuliaGPU:mainfrom
maleadt merged 10 commits intoJuliaGPU:mainfrom
Conversation
69e7601 to
432cec3
Compare
maleadt
reviewed
Mar 2, 2026
Member
maleadt
left a comment
There was a problem hiding this comment.
Did you test this with a concrete example that would benefit from it?
ba4f9a7 to
d522a1d
Compare
Introduce alias analysis based token threading: - Group pointers into alias sets. - Maintain per-alias-set token chains. - Thread tokens only between potentially aliasing operations. - Conservatively fall back to the global set for unknown pointers. - Preserve existing control-flow token merging semantics. Enables independent memory operations to execute without unnecessary serialization.
Move token threading from inline codegen to a `token_order_pass!` that runs on StructuredIRCode before bytecode emission. The pass: - Inserts MakeTokenNode at function entry - Adds token arguments to memory operations (loads/stores/atomics) - Inserts JoinTokensNode and TokenResultNode for token flow tracking - Uses alias analysis to give independent arrays independent tokens This decouples token ordering decisions from codegen, matching cuTile Python's architecture (res/cutile-python/src/cuda/tile/_passes/token_order.py). Control flow token threading (loops, branches) is still handled by codegen conservatively; the pass only transforms straight-line code before the first control flow op. Per-alias loop carries will be added in a follow-up. Key changes: - New: codegen/irutils.jl — SSAMap mutation helpers (insert_before!, etc.) - New: codegen/passes/ directory for IR passes - Moved: alias_analysis.jl, token_keys.jl, token_order.jl → passes/ - Simplified: memory.jl, views.jl, atomics.jl — read token from IR args via extract_token_arg!(), fall back to ctx.token inside control flow - Simplified: control_flow.jl — removed token_map save/restore, kept single-token loop carry (conservative) - Removed: ctx.token_map, ctx.global_token, ctx.alias_result from CGCtx Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The key insight: adding one line to `tile_type_for_julia!` to handle TokenType eliminates all token-specific conditionals from codegen. Loop emitters, getfield extraction, and type mapping become uniform — tokens are just another type flowing through the same code paths. Changes: - tile_type_for_julia! maps TokenType → Token(tt) (the 1-line fix) - extract_tile_shape handles TokenType (returns ScalarShape) - token_order_pass! now recurses into loops and branches: - Adds per-alias-set token carries (init_values, BlockArgs, terminators) - Updates SSAMap types via update_type! to include TokenType parameters - Inserts Core.getfield for token result extraction after loops/ifs - control_flow.jl simplified: no is_token_type branches, trusts parent_result_type - Terminators no longer manually append ctx.token — pass handles it - ctx.token removed from CGCtx entirely This is a WIP: 196/202 codegen tests pass. 6 integration tests with complex loop patterns (spinloop, nested loops) have BoundsErrors from update_type! producing incorrect parameter counts — to be fixed next. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Three fixes: 1. Build loop result types from block args (authoritative source) instead of from old_type.typ which may be Nothing for void loops — fixes BoundsError in emit_loop_getfield! for nested spinloop patterns. 2. Thread parent_loop_effects through IfOp branches so that ContinueOp/BreakOp inside IfOp (common for LoopOp→IfOp while-loop patterns) get their token exit values appended. This was the cause of the "continue op operand mismatch" errors for for-loops with memory ops in the body. 3. Add ForOp body fallback ContinueOp (matching LoopOp) for completeness. All 202 codegen tests pass. GPU execution of spinlock (hang.jl) still has a token carry issue — break/continue inside the inner loop carry initial block args instead of the updated CAS result tokens. To be debugged next. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two fixes for WhileOp token threading: 1. After transforming the before block (which contains the CAS), propagate the updated token_map to after_token_map. Previously, after_token_map used the initial BlockArgs, so the YieldOp (which becomes ContinueOp in codegen) carried stale tokens instead of the CAS result. 2. Extend ConditionOp.args with exit tokens (like ContinueOp/BreakOp). The codegen-generated BreakOp reads from cond_op.args, so token values must be present there for the break path to carry the right tokens. Result: both break and continue in the spinlock loop now carry %result_token_12 (CAS acquire token) instead of the initial block args. Status: 202/202 codegen, hang.jl passes, 1581/1585 full suite. Remaining: 3 early-return device tests + layernorm example. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
hoist_returns! replaces ReturnNode terminators in IfOp branches with empty YieldOp(). If the token pass runs first and extends the IfOp with token yields, hoist_returns! wipes them out — causing "then branch does not yield anything" errors. Fix: run hoist_returns! first so the token pass sees normalized YieldOps. Full suite: 1586/1587 pass (layernorm dW mismatch is pre-existing). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The has_release_order() stub returned false, meaning release-ordered atomics didn't join with all LAST_OP tokens. Per the Tile IR memory model: "When you use a release operation, you need to token-order all memory events that must stay before the release to the release itself." Without this, the release atomic_xchg in spinlock patterns didn't depend on the data store's token — the store's writes weren't guaranteed visible before the lock release, causing data corruption in the layernorm backward kernel. Fix: extract memory_order from atomic call args in the IR and pass it through to collect_join_tokens_ir, which already had the release join logic (line 150-152) but was never triggered. All 1586 tests pass including layernorm backward. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Based on code review findings: - Move resolve_call() to irutils.jl (shared by alias_analysis + token_order) - Use resolve_call() in alias_analysis instead of inline call/invoke normalization - Extract insert_token_result_getfields!() helper — replaces 3 copy-pasted blocks (~25 lines each) in transform_loop!, WhileOp, and IfOp transforms - Remove dead after_arg assignments in WhileOp (side-effect-only calls) - Remove duplicate old_type = get(...) line in IfOp transform - Remove const IRToken = Any alias (no type safety value) Net: -48 lines, 3 copy-paste blocks → 1 shared function. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1. ACQUIRE_TOKEN_KEY was updated for all atomics instead of only acquire/acq_rel-ordered ones, over-constraining relaxed atomics. 2. has_acquire effect was set unconditionally for all atomics in compute_block_memory_effects!, causing unnecessary token carries. 3. ALIAS_UNIVERSE was treated as overlapping with nothing instead of everything, potentially missing token dependencies for unknown aliases. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Member
|
Follow-up changes:
Sadly doesn't improve FFT performance... |
This was referenced Mar 26, 2026
Merged
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Feat #1
Introduce alias analysis based token threading:
Enables independent memory operations to execute without unnecessary serialization.