[tx] Implement context parallelism in tx with ring attention using `ppermute` by tanmaysachan · Pull Request #1149 · NovaSky-AI/SkyRL

tanmaysachan · 2026-02-16T07:00:45Z

Addresses #1056

This PR implements Context Parallelism (CP).
CP is now an axis alongside fsdp/ep/tp, with parity coverage against CP=1.

[AI wrote the description]

What’s included

Added context_parallel_size to model/backend configs.
Added CLI support via --cp-size in tx/run/train.py.
Extended mesh construction to 4D: ("fsdp", "ep", "tp", "cp").
Implemented CP-aware attention path:
- ring-style KV exchange via ppermute
- streaming softmax accumulation across CP shards
- CP-aware default token positions.
Updated model attention callsites (Qwen3/Llama3/DeepSeek) to pass positions and use CP-aware attention helper.
Updated KV-cache layer updates for CP ownership/local position handling.
Updated sequence-length bucketing utility to support CP-aware padding (round_up_seq_len(..., cp=...)).
Refactored JAX backend train/sample paths for CP:
- shard-map based model pass with CP partition specs
- CP-specific sample pass with cached compilation
- unified generate() path with optional decode runner override.
Added/updated tests for CP config and runtime behavior:
- CP forward/backward execution
- CP sample execution
- CP parity (sample outputs, prompt logprobs, forward loss/logprobs, backward gradients)
- attention test updates for positions API.

Validation

Added targeted CP tests in tests/tinker/test_jax_backend.py.
Updated tests/models/test_qwen3_config.py and tests/gpu/test_attention.py for new CP/positions behavior.

Tested over multi-cpu configs.

Tasks

Further simplify decode/prefill runner interfaces and reduce remaining wrapper boilerplate.

tanmaysachan · 2026-02-20T12:47:31Z

PR ready for review.

One ugliness is the wrapping of prefill_and_decode in a shard_map in JaxBackendImpl. Do let me know if a cleaner design is preferred

gemini-code-assist

Code Review

The pull request introduces Context Parallelism (CP) to the tx framework, adding support for 4D mesh construction (fsdp, ep, tp, cp) and CP-aware attention mechanisms. The changes include modifications to model configurations, CLI arguments, attention layer logic (including a new _ring_attention function), KV-cache updates, and sequence length bucketing. New tests have been added to validate CP functionality, including forward/backward execution, sampling, and parity checks against CP=1. The changes are well-structured and address the stated objective of implementing context parallelism.

gemini-code-assist · 2026-02-20T12:48:31Z

skyrl-tx/tx/layers/attention.py

+    # TODO: constraints for running ring attention
+    if cp > 1 and (is_causal or q.shape[1] == 1):


The TODO: constraints for running ring attention indicates incomplete functionality or known limitations. It's important to either address these constraints or document them clearly for future development and users. If there are specific conditions under which _ring_attention should not be used, they should be explicitly handled or raised as errors.

gemini-code-assist · 2026-02-20T12:48:31Z

skyrl-tx/tests/gpu/test_attention.py

+    positions = jnp.broadcast_to(jnp.arange(q.shape[1], dtype=jnp.int32), (q.shape[0], q.shape[1]))
+    result = dot_product_attention(q, k, v, mask, is_causal=is_causal, head_dim=head_dim, positions=positions)


The positions array is created using jnp.broadcast_to which might not be necessary if jnp.arange already produces the desired shape. Consider simplifying this to positions = jnp.arange(q.shape[1], dtype=jnp.int32)[None, :] if the batch dimension is always 1 for this test case, or ensuring broadcast_to is truly needed for varying batch sizes in other test scenarios.

positions = jnp.broadcast_to(jnp.arange(q.shape[1], dtype=jnp.int32), (q.shape[0], q.shape[1])) result = dot_product_attention(q, k, v, mask, is_causal=is_causal, head_dim=head_dim, positions=positions)

gemini-code-assist · 2026-02-20T12:48:31Z

skyrl-tx/tx/layers/attention.py

+    kv_repeat = q.shape[2] // k.shape[2]
+    k_block = jnp.repeat(k, kv_repeat, axis=2)
+    v_block = jnp.repeat(v, kv_repeat, axis=2)
+    mask_block = attention_mask


The comment GQA handling: expand KV heads to match query heads. is slightly misleading. jnp.repeat is used here to repeat the KV heads to match the number of query heads, which is a common practice in GQA to effectively broadcast KV to Q. However, the comment could be more precise by stating that it's expanding the KV heads to match the query heads for broadcasting purposes.

skyrl-tx/tx/layers/attention.py

skyrl-tx/tx/tinker/backends/jax.py

gemini-code-assist · 2026-02-20T12:48:31Z

skyrl-tx/tx/tinker/backends/jax.py

                max_len = round_up_seq_len(max((len(seq) for seq in batch_prompts), default=0))
                input_ids = pad_batch(batch_prompts, max_len, np.int32)
                attention_mask = pad_batch([[1] * len(seq) for seq in batch_prompts], max_len, np.int32)
+                positions = np.arange(max_len, dtype=np.int32)[None, :]


The positions array is created using np.arange(max_len, dtype=np.int32)[None, :]. While this works for the current use case, it might be more robust to use default_positions(input_ids) from tx.layers.attention to ensure consistency with how positions are generated elsewhere, especially if default_positions handles CP-specific offsets.

positions = default_positions(input_ids)

Same as the other - we need to be under a shard_map for cp axis_index to exist. This positions can also optionally be left as None to be filled by the generator.

gemini-code-assist · 2026-02-20T12:48:31Z

skyrl-tx/tx/utils/generator.py

+        cp = jax.sharding.get_abstract_mesh().shape.get("cp", 1)
+        local_capacity = k_cache.shape[1]
+        update_positions = positions[:, 0] % local_capacity
+        owners = positions[:, 0] // local_capacity
+        axis_idx = jax.lax.axis_index("cp") if cp > 1 else 0
+        should_update = owners == axis_idx


The update_at_pos function uses jnp.where(do_update, updated, cache_slice). This conditional update is correct for CP, but it introduces a potential performance overhead if do_update is frequently False for a large portion of the batch. While necessary for correctness, it's worth noting as a potential area for optimization if profiling reveals it to be a bottleneck, perhaps by using jax.lax.cond or jax.lax.select if the conditions are static enough.

gemini-code-assist · 2026-02-20T12:48:31Z

skyrl-tx/tx/utils/generator.py

+        if positions is None:
+            positions = jnp.arange(prompt_length, dtype=jnp.int32)[None, :]


The positions array is initialized with jnp.arange(prompt_length, dtype=jnp.int32)[None, :] if positions is None. To maintain consistency and leverage the CP-aware logic, it would be better to use default_positions(input_ids) from tx.layers.attention here, as it correctly handles CP shard offsets.

if positions is None: positions = default_positions(input_ids)

The axis_index fpr cp might not exist. default_positions is used when we are under a shard_map

devin-ai-integration

Devin Review found 1 new potential issue.

View 11 additional findings in Devin Review.

devin-ai-integration · 2026-02-20T13:15:31Z

skyrl-tx/tx/utils/generator.py

🔴 Decode loop initialization uses local attention_mask sum instead of global under CP, producing wrong positions and logits on non-owning shards

When _prefill_and_decode runs inside shard_map with CP > 1, the attention_mask is sharded along the CP axis. Each shard only sees its local portion. The line last_token_idx = attention_mask.sum(axis=1) - 1 computes the LOCAL count of valid tokens in each shard's slice, not the global count. This causes two critical issues:

Root Cause and Impact

Consider a prompt of length 20, padded to max_len=32, with cp=2. Each CP shard gets 16 positions:

Shard 0 (positions 0–15): attention_mask.sum() = 16, last_token_idx = 15 → extracts logits from global position 15 (NOT the actual last valid token at position 19)

Shard 1 (positions 16–31): attention_mask.sum() = 4, last_token_idx = 3 → extracts logits from global position 19 (correct)

Shard 0 extracts logits from the wrong position, samples a potentially different token, and starts the decode loop at position 16 instead of 20. Meanwhile, shard 1 produces the correct output. Since the generated tokens output spec is P("fsdp", None) (replicated across CP) with check_vma=False, the inconsistency is silently ignored and the result is undefined — whichever shard's values happen to be read will determine the output.

For short prompts fitting entirely within shard 0 (as in the PR's parity tests with length-5 prompts padded to 32), shard 0 holds all valid tokens and coincidentally produces the correct result, masking this bug. But any prompt spanning multiple CP shards will produce incorrect generation.

Additionally, decode positions derived from last_token_idx (s.last_positions + 1) diverge across shards, causing wrong RoPE embeddings and KV cache writes on the non-owning shards.

(Refers to line 199)

Prompt for agents

In skyrl-tx/tx/utils/generator.py, the _prefill_and_decode function at line 199 computes last_token_idx = attention_mask.sum(axis=1) - 1. Under CP (context parallelism via shard_map), attention_mask is the local shard's slice, so this gives the local count rather than the global last-token index. To fix this, _prefill_and_decode needs to derive the global last valid token position and ensure all CP shards agree on the same decode starting state (logits, positions, etc.). Possible approaches: 1. Before entering the decode loop, use a collective (e.g., jax.lax.pmax across the 'cp' axis) to find the global last token position, then broadcast the correct initial logits and positions to all shards. 2. Alternatively, restructure the CP sample path so that prefill runs under shard_map (with ring attention), but the decode loop runs outside shard_map with a single-device KV cache. The prefill outputs (last hidden state, KV cache) would be gathered before entering the decode loop. The fix must ensure that (a) the correct logits from the global last valid token are used for the first sampled token, (b) all CP shards use the same decode positions for RoPE and KV cache updates, and (c) the generated tokens are consistent across CP shards to satisfy the replicated output spec.

Was this helpful? React with 👍 or 👎 to provide feedback.

Yeah, finding a way to do this cleanly

tanmaysachan added 2 commits February 16, 2026 12:24

Implement context parallelism, WIP

489ef18

not important changes

ea9fb7c

pcmoritz added the tx label Feb 16, 2026

tanmaysachan added 5 commits February 16, 2026 23:14

Simplify some paths

e4ba460

Simplify ring attention

634b539

Merge main: resolve JAX backend loss_fn_config sharding conflict

5754799

ruff disagreeing with variable naming

c90023d

Add comment

f946194

tanmaysachan marked this pull request as ready for review February 20, 2026 12:44

tanmaysachan changed the title ~~[tx][WIP] Implement context parallelism in tx~~ [tx] Implement context parallelism in tx with ring attention using ppermute Feb 20, 2026

gemini-code-assist bot reviewed Feb 20, 2026

View reviewed changes

This comment was marked as resolved.

Sign in to view

Set cpu count appropriately

d1697ff

devin-ai-integration bot reviewed Feb 20, 2026

View reviewed changes

tanmaysachan marked this pull request as draft February 20, 2026 13:55

		# TODO: constraints for running ring attention
		if cp > 1 and (is_causal or q.shape[1] == 1):

		positions = jnp.broadcast_to(jnp.arange(q.shape[1], dtype=jnp.int32), (q.shape[0], q.shape[1]))
		result = dot_product_attention(q, k, v, mask, is_causal=is_causal, head_dim=head_dim, positions=positions)

		if positions is None:
		positions = jnp.arange(prompt_length, dtype=jnp.int32)[None, :]

Conversation

tanmaysachan commented Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What’s included

Validation

Tasks

Uh oh!

tanmaysachan commented Feb 20, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

tanmaysachan Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

tanmaysachan Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

This comment was marked as resolved.

Uh oh!

devin-ai-integration bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration bot Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

tanmaysachan Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tanmaysachan commented Feb 16, 2026 •

edited

Loading