Skip to content

Add alias-aware token threading for memory operations.#89

Merged
maleadt merged 10 commits intoJuliaGPU:mainfrom
shreyas-omkar:main
Mar 26, 2026
Merged

Add alias-aware token threading for memory operations.#89
maleadt merged 10 commits intoJuliaGPU:mainfrom
shreyas-omkar:main

Conversation

@shreyas-omkar
Copy link
Contributor

@shreyas-omkar shreyas-omkar commented Feb 18, 2026

Feat #1

Introduce alias analysis based token threading:

  • Group pointers into alias sets.
  • Maintain per-alias-set token chains.
  • Thread tokens only between potentially aliasing operations.
  • Conservatively fall back to the global set for unknown pointers.
  • Preserve existing control flow token merging semantics.

Enables independent memory operations to execute without unnecessary serialization.

@shreyas-omkar shreyas-omkar marked this pull request as draft February 18, 2026 11:19
@shreyas-omkar shreyas-omkar force-pushed the main branch 2 times, most recently from 69e7601 to 432cec3 Compare February 19, 2026 19:53
Copy link
Member

@maleadt maleadt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you test this with a concrete example that would benefit from it?

@shreyas-omkar shreyas-omkar force-pushed the main branch 4 times, most recently from ba4f9a7 to d522a1d Compare March 21, 2026 14:09
@shreyas-omkar shreyas-omkar marked this pull request as ready for review March 21, 2026 14:12
shreyas-omkar and others added 9 commits March 26, 2026 06:47
 Introduce alias analysis based token threading:

- Group pointers into alias sets.
- Maintain per-alias-set token chains.
- Thread tokens only between potentially aliasing operations.
- Conservatively fall back to the global set for unknown pointers.
- Preserve existing control-flow token merging semantics.

Enables independent memory operations to execute without unnecessary
serialization.
Move token threading from inline codegen to a `token_order_pass!` that
runs on StructuredIRCode before bytecode emission. The pass:
- Inserts MakeTokenNode at function entry
- Adds token arguments to memory operations (loads/stores/atomics)
- Inserts JoinTokensNode and TokenResultNode for token flow tracking
- Uses alias analysis to give independent arrays independent tokens

This decouples token ordering decisions from codegen, matching cuTile
Python's architecture (res/cutile-python/src/cuda/tile/_passes/token_order.py).
Control flow token threading (loops, branches) is still handled by codegen
conservatively; the pass only transforms straight-line code before the first
control flow op. Per-alias loop carries will be added in a follow-up.

Key changes:
- New: codegen/irutils.jl — SSAMap mutation helpers (insert_before!, etc.)
- New: codegen/passes/ directory for IR passes
- Moved: alias_analysis.jl, token_keys.jl, token_order.jl → passes/
- Simplified: memory.jl, views.jl, atomics.jl — read token from IR args
  via extract_token_arg!(), fall back to ctx.token inside control flow
- Simplified: control_flow.jl — removed token_map save/restore, kept
  single-token loop carry (conservative)
- Removed: ctx.token_map, ctx.global_token, ctx.alias_result from CGCtx

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The key insight: adding one line to `tile_type_for_julia!` to handle
TokenType eliminates all token-specific conditionals from codegen.
Loop emitters, getfield extraction, and type mapping become uniform
— tokens are just another type flowing through the same code paths.

Changes:
- tile_type_for_julia! maps TokenType → Token(tt) (the 1-line fix)
- extract_tile_shape handles TokenType (returns ScalarShape)
- token_order_pass! now recurses into loops and branches:
  - Adds per-alias-set token carries (init_values, BlockArgs, terminators)
  - Updates SSAMap types via update_type! to include TokenType parameters
  - Inserts Core.getfield for token result extraction after loops/ifs
- control_flow.jl simplified: no is_token_type branches, trusts parent_result_type
- Terminators no longer manually append ctx.token — pass handles it
- ctx.token removed from CGCtx entirely

This is a WIP: 196/202 codegen tests pass. 6 integration tests with
complex loop patterns (spinloop, nested loops) have BoundsErrors from
update_type! producing incorrect parameter counts — to be fixed next.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Three fixes:
1. Build loop result types from block args (authoritative source) instead of
   from old_type.typ which may be Nothing for void loops — fixes BoundsError
   in emit_loop_getfield! for nested spinloop patterns.

2. Thread parent_loop_effects through IfOp branches so that ContinueOp/BreakOp
   inside IfOp (common for LoopOp→IfOp while-loop patterns) get their token
   exit values appended. This was the cause of the "continue op operand mismatch"
   errors for for-loops with memory ops in the body.

3. Add ForOp body fallback ContinueOp (matching LoopOp) for completeness.

All 202 codegen tests pass. GPU execution of spinlock (hang.jl) still has a
token carry issue — break/continue inside the inner loop carry initial block
args instead of the updated CAS result tokens. To be debugged next.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two fixes for WhileOp token threading:

1. After transforming the before block (which contains the CAS), propagate
   the updated token_map to after_token_map. Previously, after_token_map
   used the initial BlockArgs, so the YieldOp (which becomes ContinueOp
   in codegen) carried stale tokens instead of the CAS result.

2. Extend ConditionOp.args with exit tokens (like ContinueOp/BreakOp).
   The codegen-generated BreakOp reads from cond_op.args, so token values
   must be present there for the break path to carry the right tokens.

Result: both break and continue in the spinlock loop now carry
%result_token_12 (CAS acquire token) instead of the initial block args.

Status: 202/202 codegen, hang.jl passes, 1581/1585 full suite.
Remaining: 3 early-return device tests + layernorm example.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
hoist_returns! replaces ReturnNode terminators in IfOp branches with
empty YieldOp(). If the token pass runs first and extends the IfOp with
token yields, hoist_returns! wipes them out — causing "then branch does
not yield anything" errors.

Fix: run hoist_returns! first so the token pass sees normalized YieldOps.

Full suite: 1586/1587 pass (layernorm dW mismatch is pre-existing).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The has_release_order() stub returned false, meaning release-ordered
atomics didn't join with all LAST_OP tokens. Per the Tile IR memory
model: "When you use a release operation, you need to token-order all
memory events that must stay before the release to the release itself."

Without this, the release atomic_xchg in spinlock patterns didn't
depend on the data store's token — the store's writes weren't
guaranteed visible before the lock release, causing data corruption
in the layernorm backward kernel.

Fix: extract memory_order from atomic call args in the IR and pass it
through to collect_join_tokens_ir, which already had the release join
logic (line 150-152) but was never triggered.

All 1586 tests pass including layernorm backward.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Based on code review findings:

- Move resolve_call() to irutils.jl (shared by alias_analysis + token_order)
- Use resolve_call() in alias_analysis instead of inline call/invoke normalization
- Extract insert_token_result_getfields!() helper — replaces 3 copy-pasted
  blocks (~25 lines each) in transform_loop!, WhileOp, and IfOp transforms
- Remove dead after_arg assignments in WhileOp (side-effect-only calls)
- Remove duplicate old_type = get(...) line in IfOp transform
- Remove const IRToken = Any alias (no type safety value)

Net: -48 lines, 3 copy-paste blocks → 1 shared function.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1. ACQUIRE_TOKEN_KEY was updated for all atomics instead of only
   acquire/acq_rel-ordered ones, over-constraining relaxed atomics.
2. has_acquire effect was set unconditionally for all atomics in
   compute_block_memory_effects!, causing unnecessary token carries.
3. ALIAS_UNIVERSE was treated as overlapping with nothing instead of
   everything, potentially missing token dependencies for unknown aliases.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@maleadt
Copy link
Member

maleadt commented Mar 26, 2026

Follow-up changes:

  • Extract token ordering from inline codegen into a separate IR pass (token_order_pass!). Codegen no longer maintains ctx.token, but emits what the pass wrote into the IR.
  • Introduce IR node types (MakeTokenNode, JoinTokensNode, TokenResultNode) so tokens are first-class in the structured IR.
  • Memory intrinsics receive their input token as an appended argument (from the pass) and store result tokens in ctx.result_tokens instead of mutating ctx.token.
  • Add per-alias token carries through loops and branches (init values, block args, terminators).
  • Support release/acquire memory ordering on atomics (release joins all LAST_OP tokens; acquire updates global ACQUIRE token).

Sadly doesn't improve FFT performance...

@maleadt maleadt merged commit b9a9b91 into JuliaGPU:main Mar 26, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants