feat: deepseek-v4 model support by wenxie-amd · Pull Request #698 · AMD-AGI/Primus

wenxie-amd · 2026-04-28T11:01:30Z

Summary

This PR brings DeepSeek-V4 training support into Primus on the Megatron backend.

It now spans the full bring-up arc (P0 – P10) and the plan-2 lockdown (P12) that closes out plan-0 / plan-1 with an architecture-faithful rewrite plan for the remaining work (P13 – P21).

Plan timeline

Plan	Phases	Window	Status
plan-0 (`develop/plan-0/`)	P0 – P7	2026-04-28	done — initial bring-up, configs, dispatch, layer specs, HC + Hybrid Attn, MoE / activation / RoPE / MTP, single-node smoke (`PP=2 EP=4`)
plan-1 (`develop/plan-1/`)	P8 – P11	2026-04-29 → 2026-04-30	partial — P8 / P9 / P10 done; P11 paused by the architecture review
plan-2 (`develop/plan-2/`)	P12 – P19 (+ deferred P20 / P21 / P22+)	2026-05-01 (lockdown) → 2026-05-07 (P19 close-out)	wrapping up — P12 / P13 / P14 / P15 / P16 / P17 / P18 / P19 done; pre-training-first scope means P20 (perf / convergence gates), P21 (docs / handover), and P22+ (HF state-dict adapter) are all deferred follow-ups, gated by the next campaign that needs them

Plan-2 reshuffle — 2026-05-01 (commit `f548d8b2`, docs-only)

Pre-training is the release path; HF-weight loading is not required for the release. Plan-2 phase shape after this reshuffle:

Phase	New scope	Notes
P17	Code cleanup (was: state-dict adapter)	retire `_RMSNorm` duplicates / `dual_rope.py` / `csa_attention.py` / `hca_attention.py` / legacy `DeepseekV4MTPBlock` / EP `all_reduce` fallback gate / `_v4_token_ids` residue / yaml comment fixes. New gate G14 (static dead-code audit).
P18	Spec audit (unchanged; `_v4_token_ids` removal moved to P17)
P19	Distributed re-validation (unchanged; G6 / G7 still here)
P20	Convergence + perf gates (HF numerical-alignment row removed; convergence baseline switched to Megatron-bridge)
P21	Docs + handover (slimmed; cleanup tasks moved to P17)	techblog / progress HTML / PPT / `develop_deepseek-v4-in-primus.md` only
P22+	HF state-dict adapter + V4-Flash checkpoint load (deferred)	Activate when SFT / evaluation needs HF weights. Design notes preserved in `02-target-architecture.md` §7 + `03-phase-details.md` (P22+ section). G8 / G9 deferred from P17; HF-numerical-alignment portion of G12 also deferred here.

Why plan-2

A code review of dev/wenx/deepseek-v4 against real DeepSeek-V4 (HF reference, NeMo port, official inference) and Megatron's spec + config + provider + submodule + build_module pattern surfaced 28 findings (10 CRIT / 11 HIGH / 6 MED / 5 LOW). Highlights:

Attention uses separate linear_k_proj / linear_v_proj; real V4 has a single-latent wkv (K = V = kv).
q_norm / kv_norm per-head RMSNorms are missing.
HashRouter outputs uniform 1/topk weights with no learnable gate.
clamped_swiglu clamps post-mul; real V4 clamps pre-mul on silu(gate) and up.
No state-dict adapter: official V4-Flash / V4-Pro HF safetensors cannot be loaded.
DeepseekV4Attention / DeepseekV4TransformerBlock / DeepseekV4HybridLayer / DeepseekV4MoE reinvent rather than subclass MLASelfAttention / TransformerBlock / TransformerLayer / MoELayer.

Plan-2 (develop/plan-2/) is the architecture-faithful rewrite. Full review in develop/plan-2/00-review-findings.md; rewrite map in 02-target-architecture.md; phase-by-phase plan in 03-phase-details.md; gates in 04-test-strategy.md.

Commit map

commit	phase	scope
`e194e039`	docs	architecture deep-dive + plan docs
`d3383c02`	P1	configs / yaml + tokenizer
`8ae10000`	P2	`model_type=deepseek_v4` dispatch
`a5d2a561`	P3	layer spec + block scaffolding
`3b7ad8c8`	P4	HC + Hybrid Attention + dual-RoPE
`5e4008dc`	P5	V4 MoE + clamped SwiGLU + V4 MTP
`97b9720d`	P6-P7	PP/EP integration fixes + single-node run script + progress docs
`df273a45`	P8(v2)	LanguageModule migration + DeepSeek runtime spec-tree main path
`e5fec968`	P9(v2)	provider reuse integration + TE CUDA runtime validation/report
`b38e83cf`	P10(v2)	enforce MoE provider path and add V4 config schema
`752b7534`	P10(v2)	stabilize smoke runtime and add phase report
`636ab3de`	P12(v3)	plan-2 lockdown + as-built techblog + roadmap visuals
`cad0fb38`	P13(v3)	rebase V4 attention on `MLASelfAttention` (faithful dense path)
`aa9929a0`	P13(v3)	fold Compressor / Indexer + TP-shard projections; retire CSA / HCA legacy
`1a8bf32e`	P14(v3) phase-1	faithful pre-mul clamped SwiGLU + V4 routers (learnable gate weight; HF-aligned scoring) + G3 + G4 unit tests
`5fe8bc3c`	P14(v3) phase-2	`DeepseekV4MoE` -> `MegatronModule` + CPU local-experts path; `v4_grouped_mlp_spec` / `v4_router_spec` providers; G5 (1L MoE forward <= 1e-3 vs HF reference)
`25ccdb5e`	P15(v3)	`DeepseekV4HybridLayer` -> `TransformerLayer`; `DeepseekV4TransformerBlock` -> `TransformerBlock`; HC x PP K-stream packing helpers; `HyperHead` only on post_process; `token_ids` forward kwarg replaces `decoder._v4_token_ids` stash; 16 unit tests
`6c5875d4`	P16(v3)	spec-based MTP via upstream `MultiTokenPredictionBlock` + `process_mtp_loss`; `get_v4_mtp_block_spec` helper; layer forward returns `(hidden_states, None)` for MTP-call compatibility; legacy `DeepseekV4MTPBlock` deprecated; 17 unit tests
`f548d8b2`	docs	plan-2 reshuffle — defer HF state-dict adapter to P22+; repurpose P17 for code cleanup; add G14 gate; update roadmap / phase-details / test-strategy / status / README
`e591b893`	P17(v3)	dead-code retirement (G14): delete legacy `DeepseekV4MTPBlock` + `v4_use_custom_mtp_block` / `mtp_compress_ratios` config fields; introduce shared `LocalRMSNorm` helper and dedup three `_RMSNorm` shadows (`block.py` / `attention.py` / `compressor.py`); fix inverted yaml comment (4=CSA / 128=HCA); refresh package `__init__` surface; add `tests/unit_tests/megatron/transformer/deepseek_v4/test_v4_p17_dead_code.py` (G14 audit). `dual_rope.py` is intentionally kept — load-bearing for V4's CSA / HCA dual-base RoPE; no Megatron equivalent.
`b5832672`	P18(v3)	spec-system audit (D1 / D2 / D4 / G1): `build_context.resolve_v4_provider(config)` caches the V4 provider on the config object (replaces three direct `DeepSeekV4SpecProvider(...)` call sites); new `provider.v4_mlp_activation_func()` returns `None` when `use_te_activation_func=False` (V4 default — clamped-SwiGLU eager path) and `TEActivationOp` otherwise; `compress_ratios` normalized to `tuple[int, ...]` in `__post_init__` (so runtime never re-runs `ast.literal_eval`); new `tests/unit_tests/configs/test_deepseek_v4_yaml.py` (G1 schema gate) + `tests/unit_tests/megatron/transformer/deepseek_v4/test_v4_p18_spec_audit.py` (D1 / D2 / package-surface AST audits).
`83c33ad0`	P19(v3)	distributed re-validation (G10) — two primus-patches that close PP > 1 + VPP under V4: `megatron.deepseek_v4.pp_tensor_shape` (wraps both `schedules.get_tensor_shapes` for 1F1B and `forward_backward_pipelining_with_interleaving` for VPP, multiplies the seq dim by `hc_mult` so the PP wire carries V4's mHC `[SK, B, D]` packing) and `megatron.deepseek_v4.pp_token_pre_broadcast` (pre-broadcasts all microbatch / chunk `input_ids` from PP rank 0 across the PP group upfront* in a wrapper around `get_forward_backward_func`, so middle PP stages owning hash-routed MoE layers see real token IDs without deadlocking the interleaved-1F1B / VPP schedule). Drops the in-`forward` PP broadcast + VPP fail-fast assert from `DeepseekV4Model`, and stops pre-assigning `self.mtp = None` so Megatron's `set_current_microbatch` only iterates `model.mtp.layers` when MTP is live (matches upstream `GPTModel`).
`dba27163`	plan-2 close-out	docs-only — mark the `c10d::allreduce_` autograd warning as gone (verified absent in P19 smokes A/B/C/D + EP=8 / PP=2 EP=4 profile runs on `mi355-gpu-12`); mark G11 (routing-snapshot diff = 0 across PP / EP changes) as deferred (snapshot dump tooling never landed; not on the pre-training release path); drop Phase 20 / 21 / 22+ sections from `status.md` (kept as documented intent in `plan-2/03-phase-details.md`); add `deepseek-v4/develop/progress/plan-2-summary.md` (stand-alone summary of the architecture-faithful rewrite from P12 → P19, including a per-phase outcome table, a P19 deep-dive, the test-gate ledger, the plan-1 → plan-2 architectural-shift table, and pointers to logs / profile traces); add P19 profile launchers (`run_profile_ep8.sh` for TP=1 PP=1 EP=8 and `run_profile_pp2_ep4.sh` for TP=1 PP=2 EP=4) plus `deepseek-v4/download_ref.sh` (idempotent helper that ensures git-lfs and clones the V4 reference assets — HF transformers, ROCm/TransformerEngine, AMD-AGI/Primus-Turbo, NVIDIA-NeMo/Automodel, and the four DeepSeek-V4 model repos — at pinned commits with `GIT_LFS_SKIP_SMUDGE=1` so weights are not downloaded by default).

What landed in `97b9720d` (P6/P7)

P6 integration

deepseek_v4_builders.py
- Align model_provider with upstream Megatron signature (config, pg_collection).
deepseek_v4_block.py
- Build only local PP layers via get_num_layers_to_build + get_transformer_layer_offset.
- Add set_input_tensor support for non-first PP stages.
- Normalize/parse compress_ratios more robustly.
- Return viewless output via make_viewless_tensor for PP schedule compatibility.
v4_moe.py
- Add EP-aware local expert sharding and EP all-reduce merge path.
deepseek_v4_model.py
- Keep custom V4 MTP block behind v4_use_custom_mtp_block; default to native GPTModel MTP path for stable bring-up.
dual_rope.py, deepseek_v4_attention.py, attn_sink.py
- Rename DualRoPE.apply -> apply_rope (avoid nn.Module.apply conflict).
- Cast attention probs to value dtype before matmul to avoid bf16 mismatch.

P7 bring-up

Add run_deepseek_v4.sh (based on run_qwen.bak.sh) with fixed knobs:
- MBS=1, GBS=16, TP=1, PP=2, EP=4
- lightweight smoke overrides (num_layers=8, num_experts=8, mtp_num_layers=0)
Single-node run passed on:
- host: uswslocpm2m-106-2371
- container: dev_primus_wenx_691
- command: TRAIN_ITERS=3 ./run_deepseek_v4.sh
- result: reached iteration 3/3, torchrun exit code 0

What landed in `df273a45` (P8 v2)

deepseek_v4_model.py
- DeepseekV4Model now inherits from LanguageModule (no longer GPTModel).
- Remove super_init_transformer_layer_spec path.
- Build decoder directly from externally supplied DeepSeek runtime transformer_layer_spec.
deepseek_v4_layer_specs.py
- Remove GPT placeholder-spec helpers.
- Keep DeepSeek-native runtime spec tree only, with full layer/submodules topology.
deepseek_v4_builders.py
- Resolve/pass runtime decoder spec only; remove GPT placeholder/super-init dependence.
deepseek-v4/develop/progress/status.md
- Mark Phase 8(v2) tasks completed and sync notes with the finalized implementation.

Runtime verification:

On host uswslocpm2m-106-2371, container dev_primus_wenx_691:
- Instantiate DeepseekV4Model (LanguageModule-based) with runtime spec tree.
- Forward pass succeeds with output shape (128, 2, 256).

What landed in `e5fec968` (P9 v2)

core/extensions/transformer_engine_spec_provider.py
- Add DeepSeekV4SpecProvider(PrimusTurboSpecProvider) as the V4 provider entry point.
- Resolve runtime mode (local / te / turbo) and expose V4-specific provider helpers for norm/grouped-MLP selection.
deepseek_v4_layer_specs.py
- Resolve provider once at spec-build time and route norm, attention projection specs, dense projection specs, and MoE grouped path payload through provider-aware ModuleSpec construction.
deepseek_v4_attention.py
- Refactor attention projections to submodules + build_module via DeepseekV4AttentionSubmodules (q_a, q_b, k_proj, v_proj, o_proj) with local fallback.
deepseek_v4_block.py
- Align dense MLP projection initialization with provider-selected linear modules.
- Add explicit fail-fast guard: TE/Turbo provider mode requires CUDA hidden_states.
v4_moe.py
- Integrate provider grouped-GEMM expert path with safe fallback to local clamped SwiGLU experts.
docs/status updates
- Add deepseek-v4/develop/plan-1/03-phase9-provider-ab-report.md.
- Update deepseek-v4/develop/progress/status.md with completed Phase 9(v2) items and English-only notes.

Runtime verification:

On host uswslocpm2m-106-2371, container dev_primus_wenx_691:
- local mode forward passes (Linear projections).
- TE mode module-map build resolves to TELinear projections.
- TE mode CUDA forward passes (decoder.cuda() + CUDA inputs).
- TE/Turbo host-input path now fails fast with explicit runtime error instead of low-level GPU fault.

What landed in `b38e83cf` (P10)

core/transformer/moe/v4_moe.py
- enforce SharedExpertMLP-only shared-expert path (remove local ClampedSwiGLUMLP fallback for shared experts).
- wire clamped-SwiGLU behavior through SharedExpertMLP config path.
core/models/deepseek_v4/deepseek_v4_transformer_config.py
- add DeepSeekV4TransformerConfig (inherits MLATransformerConfig) with DeepSeek-V4 specific fields used by V4 runtime modules.
- align aliases/compat in __post_init__ (norm_epsilon, moe_intermediate_size, clamp sync, vocab/padded vocab sync).
deepseek_v4_builders.py
- explicitly build V4 model config via core_transformer_config_from_args(..., config_class=DeepSeekV4TransformerConfig).
V4 modules/specs type wiring
- update V4 builder/spec/model/attention/MoE module signatures and type hints to consume DeepSeekV4TransformerConfig.
model yaml
- add activation_func_clamp_value to primus/configs/models/megatron/deepseek_v4_base.yaml with clamped-SwiGLU comment.
docs/progress
- refresh deepseek-v4/develop/plan-1/* and deepseek-v4/develop/progress/status.md for Phase10 implementation notes.

Validation in this commit:

pre-commit hooks passed (isort/autoflake/black/yaml checks).
Python syntax compile checks passed for all touched DeepSeek-V4 runtime files.

What landed in `752b7534` (P10 runtime stabilization + report)

run_deepseek_v4.sh
- add smoke-safe overrides for Phase 10 validation (seq_length/max_position_embeddings=128, index_topk=8).
- set v4_grouped_experts_support_clamped_swiglu=True for grouped-expert clamped-SwiGLU runtime guard compliance.
- disable overlap_grad_reduce and overlap_param_gather in smoke mode to avoid DDP bucket reset assertion between iterations.
primus/backends/megatron/core/transformer/hyper_connection.py
- align F.linear weight dtype to activation dtype in HyperMixer and HyperHead to fix BF16 runtime mismatch.
primus/backends/megatron/core/transformer/deepseek_v4_attention.py
- cast attention output back to activation dtype before TE output projection to satisfy TE dtype assertions.
deepseek-v4/develop/plan-1/04-phase10-moe-distributed-convergence-report.md
- add formal Phase 10 report covering delivered architecture, runtime blocker/fix chain, and remaining tracked items.

Runtime verification in this update:

host: uswslocpm2m-106-2371
container: dev_primus_wenx_691
command: ./run_deepseek_v4.sh
result: reached iteration 10/10, and torchrun finished successfully (code 0).

What landed in `636ab3de` (P12 — plan-2 lockdown)

Documentation-only commit; no runtime code changes.

Architecture review

Walked the branch e194e039..HEAD against:
- deepseek-v4/deepseek-ai/DeepSeek-V4-Flash/{config.json, inference/model.py}
- HF Transformers PR 45616 / 45643 (deepseek-v4/transformers/.../deepseek_v4/)
- NVIDIA NeMo AutoModel V4 port (deepseek-v4/NVIDIA-NeMo/Automodel/...)
Surfaced 28 findings (10 CRIT / 11 HIGH / 6 MED / 5 LOW), spanning architecture faithfulness, Megatron reuse / spec violations, distributed correctness, spec-system hygiene, code quality, and testing gaps.

Plan-2 documents (active plan of record)

deepseek-v4/develop/plan-2/README.md
deepseek-v4/develop/plan-2/00-review-findings.md — full severity-ranked findings ledger
deepseek-v4/develop/plan-2/01-roadmap.md — phases P12 → P21, dependency graph, milestones, top risks
deepseek-v4/develop/plan-2/02-target-architecture.md — module-by-module rewrite map (rebases on MLASelfAttention, TransformerLayer, TransformerBlock, MoELayer, MultiTokenPredictionBlock, (Yarn)RotaryEmbedding)
deepseek-v4/develop/plan-2/03-phase-details.md — granular tasks / exit criteria / risks per phase
deepseek-v4/develop/plan-2/04-test-strategy.md — L0..L3 test pyramid and release gates G1..G14 (G8 / G9 marked deferred → P22+ since the 2026-05-01 reshuffle)

Plan-1 phases 9 / 10 / 11 are paused — their tracking rows in status.md remain for history.

Tech blog closure

Added deepseek-v4/develop/techblog/02-plan-1-as-built-and-plan-2-pointer.md: closes plan-0 / plan-1 with an as-built note (what shipped, what fell short) and points readers at plan-2.
Updated deepseek-v4/develop/techblog/README.md with a banner declaring plan-2 the active plan of record.

Layout cleanup + visuals

Renamed develop/plan/ → develop/plan-0/ (the original bring-up plan; tracked as a rename).
Added develop/progress/timeline.html: standard system-fonts version of the project timeline; daily-column Gantt with a May 02 – 05 Holiday band; remaining nine phases (P13 – P21) packed into the May 06 – 09 working window.
Added develop/progress/build_roadmap_pptx.py (generator) + develop/progress/deepseek_v4_roadmap_v1.pptx (13-slide tech-style deck on a black background, 16:9). Slide 7 — 07 · 开发计划 · DEVELOPMENT SCHEDULE — is the day-by-day plan with a 3-row layout (date chip / P0~P7-style phase chip / work-content card) plus a directional arrow with the holiday-gap marker.

Status tracker

develop/progress/status.md now has explicit Phase 12 → Phase 21 (v3) sections.
All P12 engineering items are checked off; only the stakeholder sign-off on plan-2 scope remains open.
The blockers/risks log carries one row per CRIT finding, each pointing at the plan-2 phase that resolves it.

Schedule

Block A (landed): 2026-04-28 → 2026-05-01 — plan-0 P0 – P7 + plan-1 P8 – P10 + plan-2 P12 lockdown.
Holiday: 2026-05-02 → 2026-05-05.
Block B (planned): 2026-05-06 → 2026-05-09 — plan-2 P13 – P21 across 4 working days (P13 + P14 / P15 + P16 / P17 + P18 / P19 + P20 + P21). Note: P17 scope changed to code cleanup per the 2026-05-01 reshuffle; HF state-dict adapter + V4-Flash numerical alignment is deferred to P22+ and not in this Block B window.

What landed in `cad0fb38` + `aa9929a0` (P13 — faithful attention)

Plan-2 P13 lands in two commits inside the May 06 budget. Both are scoped strictly to the dense / CSA / HCA attention path; faithful MoE / router / MTP are tracked in P14 / P15 / P16. (HF state-dict adapter — originally planned for P17 — has since been deferred to P22+ by the 2026-05-01 reshuffle; pre-training does not need it.)

`cad0fb38` — V4-faithful attention rooted on `MLASelfAttention` (dense path)

Rewrite the dense (compress_ratio == 0) path of DeepSeek-V4 attention to be faithful to the released DeepSeek-V4-Flash checkpoint and rooted on Megatron's MLASelfAttention.

primus/backends/megatron/core/transformer/deepseek_v4_attention.py
- New DeepseekV4Attention(MLASelfAttention) subclasses MLA for type identity but bypasses the parent __init__ chain because V4's KV layout differs from MLA's compressed-KV form.
- Single-latent KV: one linear_kv projection (hidden -> head_dim) feeds both K and V, broadcast across all query heads.
- Per-head q_rms: parameter-less RMS on head_dim after linear_q_up_proj and before partial RoPE (no q_rms.weight in the released checkpoint).
- Grouped low-rank O: einsum-based linear_o_a per group + linear_o_b when o_lora_rank > 0. Falls back to MLA-style flat linear_proj when o_lora_rank == 0.
- Learnable attn_sink: direct nn.Parameter on the attention (matches the released key layers.{i}.attn.attn_sink exactly), with inline softmax-with-sink in _attention_forward.
- New DeepseekV4AttentionSubmodules dataclass with MLA-canonical names (linear_q_down_proj, linear_q_up_proj, q_layernorm, kv_layernorm) plus V4 extras (linear_kv, linear_o_a, linear_o_b, attn_sink).
- _LegacyDeepseekV4Attention retained temporarily as the parent for CSAAttention / HCAAttention until the P13 follow-up commit folds the compressor / indexer into the new class.
primus/backends/megatron/core/extensions/transformer_engine_spec_provider.py
- Added v4_q_layernorm(), v4_kv_layernorm(), v4_attention_sink() factory methods.
primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_layer_specs.py
- Routes compress_ratio == 0 to the new class with V4-canonical submodules; legacy path retained for {4, 128}.
primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_transformer_config.py
- Added o_groups: int = 8 and o_lora_rank: int = 0.
tests/unit_tests/megatron/transformer/deepseek_v4/test_deepseek_v4_attention.py
- State-dict-key contract; forward shape + finiteness; numerical equivalence vs an inline V4 reference (single-latent KV, partial interleaved RoPE, attn-sink as virtual key column, grouped low-rank O), with attn_sink enabled and disabled (≤ 1e-3); per-head q_rms is parameter-less; o_lora_rank == 0 fallback path; rejection paths.

`aa9929a0` — Fold Compressor / Indexer + TP-shard projections; retire CSA / HCA legacy

Closes P13 by folding the compressed-branch attention into the V4-faithful class as spec submodules, switching the TP-sensitive projections to ColumnParallel / RowParallel, and retiring the plan-1 legacy attention classes.

primus/backends/megatron/core/transformer/deepseek_v4_attention.py
- DeepseekV4Attention.__init__ accepts compress_ratio in {0, 4, 128}. When compress_ratio > 0 it builds self.compressor from submodules.compressor; when compress_ratio == 4 it also builds self.indexer from submodules.indexer.
- DeepseekV4AttentionSubmodules extended with compressor and indexer fields.
- DeepseekV4Attention.forward now dispatches on self.compress_ratio:
  - 0 — dense / SWA over local KV.
  - 128 — HCA: compressed pool with compress-base partial RoPE on indices [0..P), broadcast to H heads, concat to local KV with a compressed-causal mask, joint softmax-with-sink shared across local + compressed branches.
  - 4 — CSA: per-query top-K from compressed pool via Indexer + overlap-mode Compressor, joint softmax-with-sink across local + sparse keys.
- _LegacyDeepseekV4Attention and _LegacyDeepseekV4AttentionSubmodules removed.
primus/backends/megatron/core/transformer/{csa,hca}_attention.py deleted.
primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_layer_specs.py
- _build_v4_attention_submodules now also builds compressor / indexer ModuleSpecs for compressed branches.
- linear_q_up_proj switched to provider.column_parallel_linear() (gather_output=True); linear_o_b (grouped) and linear_proj (flat-O fallback) switched to provider.row_parallel_linear() (input_is_parallel=False). At tp > 1 the projection weights are sharded across TP ranks; at tp = 1 the result is bit-identical to the previous duplicated path. linear_q_down_proj, linear_kv, linear_o_a stay duplicated; full grouped-O TP plan is tracked in P14.
primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_block.py
- _build_attention (no-spec fallback) now constructs DeepseekV4Attention for all branches; the new class builds its own Compressor / Indexer locally when no spec is provided.
tests/unit_tests/megatron/transformer/deepseek_v4/test_deepseek_v4_attention.py
- HCA forward shape + finiteness + numerical equivalence vs an inline reference (≤ 1e-3); CSA forward shape + finiteness; spec wiring contract tests for ColumnParallel / RowParallel and Compressor / Indexer presence; torchrun --nproc_per_node=2 parity scaffold (skipif single-rank).

Status

deepseek-v4/develop/progress/status.md: P13 fully checked off (including the items previously deferred to the follow-up commit). Items routed to P14 (full grouped-O TP plan) / P22+ — deferred (HF-reference numerical alignment via the state-dict adapter, originally P17) / P19 (full TP=2 sharding-parity bit-equality check) are noted as such on each row.

Schedule

Block A (extended): 2026-05-01 — plan-2 P13 first commit cad0fb38 (early start; the May 02 – 05 holiday remains).
Holiday: 2026-05-02 → 2026-05-05.
Block B (planned): 2026-05-06 → 2026-05-09 — P14 – P21 across 4 working days. P13 follow-up aa9929a0 is recorded under May 06 in the daily plan.

What landed in `1a8bf32e` (P14 phase-1 — faithful pre-mul clamped SwiGLU + V4 routers)

P14 ships in two commits. This one lands the math + parameter-layout faithfulness so V4-Flash checkpoints will load through the future state-dict adapter (originally P17, now deferred to P22+ by the 2026-05-01 reshuffle) without remapping. The structural refactor (DeepseekV4MoE(MoELayer) subclassing, provider helpers, G5 1L MoE forward) is the P14 phase-2 follow-up.

Activation (G3)

primus/backends/megatron/core/transformer/clamped_swiglu.py
- Replace post-multiplication clamp with V4 pre-multiplication semantics: SiLU(clamp(gate, max=alpha)) * clamp(up, +/- alpha). New helpers clamped_swiglu_pre_mul(gate, up, alpha) (split inputs) and clamped_swiglu_pre_mul_fused(x, alpha) ([gate | up] last-dim concat for grouped-gemm experts).
- ClampedSwiGLUMLP now uses separate w1 / w2 / w3 Linears so the released checkpoint (Expert(w1, w2, w3, swiglu_limit)) loads without remapping. Optional fused_gate_up=True fuses the gate / up GEMMs at forward time only; the saved / loaded state_dict keys remain w1.weight / w2.weight / w3.weight.
primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_block.py
- _DenseSwiGLUMLP now applies the same pre-mul clamp on its dense head/tail layers; previously it computed vanilla SiLU(gate) * up and ignored swiglu_limit.

Learned router (G4)

primus/backends/megatron/core/transformer/moe/v4_topk_router.py
- Rename V4TopKRouter → DeepseekV4LearnedRouter (back-compat alias retained).
- Gate exposed as weight Parameter of shape [num_experts, hidden_size] — matches Megatron's TopKRouter.weight AND HF reference Gate.weight exactly (no gate.weight indirection).
- expert_bias is selection-only: routing weights gather from the un-biased scores so probs gradient flows to weight, never to expert_bias.
- Renormalization gated on score_function != "softmax" (HF parity; softmax probs already sum to 1).
- topk_scaling_factor honors moe_router_topk_scaling_factor (HF route_scale).
- Score functions: v4_score_fn covers softmax, sigmoid, sqrtsoftplus.

Hash router (G4)

primus/backends/megatron/core/transformer/moe/v4_hash_router.py
- Rename HashRouter → DeepseekV4HashRouter (back-compat alias retained).
- Add learnable weight Parameter same shape as the learned router; previously the hash router emitted uniform 1/topk weights, which broke gradient flow into the gate weights and silently differed from the released checkpoint.
- tid2eid is now a frozen nn.Parameter(requires_grad=False, dtype=torch.int32) (matches HF reference layout — released checkpoint stores it as a parameter so state-dict round-trips preserve it without polluting the optimizer state).
- forward(hidden, token_ids) gathers learned scores at the static expert ids prescribed by tid2eid[token_ids]; renorm + scale parity with the learned router.

MoE wiring

primus/backends/megatron/core/transformer/moe/v4_moe.py
- _route now passes (hidden, token_ids) to the hash router; both routers receive hidden_size / score_function / topk_scaling_factor at init.

Tests

tests/unit_tests/megatron/transformer/deepseek_v4/test_clamped_swiglu.py — 7 tests cover pre-mul activation vs HF reference (≤ 1e-6 fp32, four alpha values), alpha = 0 disables clamp, fused-vs-split agreement, one-sided gate clamp behavior, w1 / w2 / w3 state-dict keys (no gate_up.weight leak), fused_gate_up forward equivalence, end-to-end ClampedSwiGLUMLP vs HF Expert.forward.
tests/unit_tests/megatron/transformer/deepseek_v4/test_v4_routers.py — 13 tests:
- Score function: parity vs inline reference for all three functions.
- Learned router: HF agreement across (softmax × sigmoid × sqrtsoftplus) × (with / without expert_bias) ≤ 1e-6; back-compat alias; gradient flows to gate weight; expert_bias detached from probs graph; softmax skips renorm.
- Hash router: HF agreement across the three score functions ≤ 1e-6; tid2eid is a frozen Parameter (requires_grad=False, dtype int32); state-dict keys; deterministic table across seeds; OOB / shape-mismatch error paths; gradient flows to weight while tid2eid.grad is None.

Status

deepseek-v4/develop/progress/status.md: P14 phase-1 tasks checked off with this commit hash (1a8bf32e); deferred items listed for the phase-2 follow-up; the "HashRouter has no learnable gate weight / clamped SwiGLU clamps post-mul" blocker is marked resolved.

Schedule

Block A (extended): 2026-05-01 — plan-2 P14 phase-1 commit 1a8bf32e (continuing the early start; May 02 – 05 holiday remains).
Block B (planned): 2026-05-06 → 2026-05-09 — P14 phase-2 + P15 – P21 across 4 working days. P13 follow-up aa9929a0 and P14 phase-1 1a8bf32e are recorded under May 01 / 06 in the daily plan.

What landed in `5fe8bc3c` (P14 phase-2 — V4 MoE structural bring-up + G5)

Closes plan-2 P14 by bringing DeepseekV4MoE into Megatron's spec lifecycle, exposing a CPU-testable forward path so the MoE math is pinned against the released HF reference, and adding the V4 provider helpers that plan-2 §5 / §6 call for.

`DeepseekV4MoE` → `MegatronModule`

primus/backends/megatron/core/transformer/moe/v4_moe.py
- Parent class switched from nn.Module to MegatronModule so it inherits the standard config plumbing and integrates with TransformerLayer.mlp via the spec lifecycle.
- BaseMoELayer-compatible public surface: set_layer_number(layer_number) mirrors BaseMoELayer.set_layer_number; local_expert_indices is exposed as a list attribute.

CPU local-experts path

primus/backends/megatron/core/transformer/moe/v4_moe.py
- When pg_collection is None, __init__ skips the dispatcher / grouped-experts construction and instead builds:
  - local_experts: nn.ModuleList[ClampedSwiGLUMLP] — one ClampedSwiGLUMLP per local expert (mirrors HF reference Expert exactly: separate w1 / w2 / w3 Linears + V4 pre-multiplication clamp).
  - shared_expert: ClampedSwiGLUMLP — a single shared expert with the same activation.
- _local_experts_forward runs a per-expert dispatch loop matching DeepSeek-V4-Flash/inference/model.py:MoE.forward exactly (for each routed expert, gather routed tokens, multiply by per-token routing weight, accumulate). Production path (pg_collection provided) continues to use the Megatron dispatcher + grouped experts unchanged.

Provider helpers (plan-2 P14 §5 / §6)

primus/backends/megatron/core/extensions/transformer_engine_spec_provider.py
- DeepSeekV4SpecProvider.v4_grouped_mlp_spec(swiglu_limit, moe_use_grouped_gemm=True, ...) returns a ready-to-use ModuleSpec(grouped_module, MLPSubmodules) for the V4 MoE expert path. The pre-mul clamp itself is applied via config.activation_func_clamp_value — Megatron's eager glu() (mlp.py:312-321) already implements SiLU(clamp(gate, max=alpha)) * clamp(up, +/- alpha), which is bit-equal to the HF reference math; the spec only commits to the right grouped module + the column / row-parallel linears.
- DeepSeekV4SpecProvider.v4_router_spec(learned=True/False) returns a bare ModuleSpec for either DeepseekV4LearnedRouter or DeepseekV4HashRouter.

G5 numerical alignment

tests/unit_tests/megatron/transformer/deepseek_v4/test_v4_moe.py — 11 tests:
- Construction sanity: parent class is MegatronModule; CPU path builds local_experts (ClampedSwiGLUMLP) + shared_expert; the token_dispatcher / grouped_experts attributes stay None; set_layer_number propagates.
- Learned-router MoE forward vs inline HF reference on a 1L toy across (sqrtsoftplus, sigmoid, softmax) × (shared expert on / off) — ≤ 1e-3 fp32 CPU.
- Hash-router MoE forward vs HF across the three score functions, with token_ids feeding tid2eid — ≤ 1e-3 fp32 CPU.
- moe_router_topk_scaling_factor (HF route_scale) propagates to the output.
- Backward populates grads on router.weight, on the shared expert, and on at least one routed expert's w1 / w2 / w3.
- Hash layer raises a clear error when token_ids is missing.

Status

deepseek-v4/develop/progress/status.md — P14 phase-2 tasks ticked with this commit; the structural row records the MegatronModule-via-CPU-path approach and explicitly defers the TopKRouter-rooted aux-loss / z-loss path to P19 alongside the distributed re-validation matrix (rationale: upstream TopKRouter.__init__ registers CUDA buffers unconditionally, which is impractical for CPU-clean V4 routers; gating that on a device check is out-of-scope for this commit).

Schedule

Block A (extended): 2026-05-01 — plan-2 P14 phase-2 commit 5fe8bc3c (continuing the early start; May 02 – 05 holiday remains).
Block B (planned): 2026-05-06 → 2026-05-09 — P15 – P21 across 4 working days. P13 follow-up aa9929a0, P14 phase-1 1a8bf32e, and P14 phase-2 5fe8bc3c are recorded under May 01 in the daily plan.

What landed in `25ccdb5e` (P15 — V4 layer / block subclass refactor + token-ids forward kwarg + HC × PP packing)

Closes plan-2 P15 except the distributed PP-equivalence gate (G6) which is tracked into P19. This commit brings V4's layer / block onto Megatron's TransformerLayer / TransformerBlock parents, drops the decoder._v4_token_ids attribute stash in favor of a real forward kwarg, gates HyperHead to the post_process stage, and extracts HC × PP K-stream packing helpers.

`DeepseekV4HybridLayer` → `TransformerLayer`

primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_block.py
- Parent class switched from GraphableMegatronModule to TransformerLayer. TransformerLayer.__init__ is bypassed (V4's submodule contract differs — no cross-attention, no BDA, V4-specific attention signature); MegatronModule.__init__ is called directly.
- DeepseekV4HybridLayerSubmodules now extends TransformerLayerSubmodules and uses upstream-canonical field names: input_layernorm / self_attention / pre_mlp_layernorm / mlp. The two V4-specific HC mixer hooks attn_hc / ffn_hc remain, both default to None for hc_mult == 1.
- The layer's forward signature is now upstream-compatible: (hidden_states, attention_mask=None, *, position_ids=None, token_ids=None, **kwargs). attention_mask is accepted and ignored (V4 manages SWA / sink mask internally); position_ids is consumed from the caller (fallback to arange(S) for tiny smokes); **kwargs lets the layer plug into MultiTokenPredictionLayer (P16) without bespoke adapters.

`DeepseekV4TransformerBlock` → `TransformerBlock`

primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_block.py
- Parent class switched from nn.Module to TransformerBlock (init bypass via MegatronModule for CPU instantiability; V4 has its own layer-spec / lift-lower pipeline). Type identity unlocks Megatron isinstance checks + sharded-state-dict integration.
- HyperHead is built only on the post_process stage. Earlier PP stages forward the K-stream tensor via _lower_streams_out (no per-stage HyperHead), saving memory and removing a correctness drift risk.

HC × PP K-stream packing helpers

_lift_streams_in(hidden_states, pre_process, hc_mult) / _lower_streams_out(x, post_process, hc_mult) extracted as module-level helpers in deepseek_v4_block.py.
- First PP stage: [S, B, D] -> [B, S, K, D] (broadcast across K).
- Non-first PP stage: [S*K, B, D] -> [B, S, K, D] (unfold packed K).
- Final stage: [B, S, D] -> [S, B, D] (post-HyperHead transpose).
- Non-final stage: [B, S, K, D] -> [S*K, B, D] (pack K into seq for PP P2P).
- Both helpers raise clear errors on shape mismatches.
The packing math is intentionally K-folded-into-seq (not the batch axis) so sequence-parallel chunking lines up cleanly; PP P2P doesn't need to know about K.

Token-ids forward kwarg

primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_model.py
- DeepseekV4Model.forward no longer assigns decoder._v4_token_ids (and removes the try/finally cleanup). It now passes token_ids=input_ids and position_ids=position_ids directly to self.decoder(...).
- The decoder block + each layer consume them as standard forward kwargs and propagate to mlp.forward -> hash_router.forward.
- An AST-level audit (test_v4_block_pp.py::test_model_forward_does_not_set_decoder_v4_token_ids_attribute) prevents the attribute stash from regressing.

Spec wiring + MTP block update

primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_layer_specs.py renames the four core fields when constructing DeepseekV4HybridLayerSubmodules: attn_norm → input_layernorm, attention → self_attention, ffn_norm → pre_mlp_layernorm, ffn → mlp.
primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_mtp.py switches the per-MTP-layer call to layer(stream, position_ids=..., token_ids=...) (kwarg, not positional) to match the new layer forward signature.

Tests (`tests/unit_tests/megatron/transformer/deepseek_v4/test_v4_block_pp.py`, 16 tests)

Subclass identity: DeepseekV4HybridLayer is a TransformerLayer; DeepseekV4TransformerBlock is a TransformerBlock; DeepseekV4HybridLayerSubmodules extends TransformerLayerSubmodules and exposes attn_hc / ffn_hc.
Lift / lower roundtrip: bit-exact across the four PP-stage permutations (pre_process × post_process), for both single-stream (hc_mult=1) and multi-stream (K=3, K=4).
Error paths: misaligned S*K on non-first stage; collapsed input on non-final lower; uncollapsed input on final lower.
Token-ids stash: AST audit confirms decoder._v4_token_ids is gone from the model source; token_ids=input_ids kwarg is present.
Forward signatures: block.forward exposes position_ids + token_ids kwargs; layer.forward accepts (hidden_states, attention_mask=None, position_ids, token_ids).

Status / blockers

deepseek-v4/develop/progress/status.md — Phase 15 tasks ticked except G6 (PP=1 vs PP=2 vs PP=4 equivalence on a 4L toy), which requires distributed init and is tracked into P19 distributed re-validation. The CPU-only sub-gate — _lift_streams_in after _lower_streams_out is bit-exact — is covered by the lift/lower roundtrip tests, which is the math contract a real PP run depends on.
Two blocker rows resolved:
- "Custom V4 block / layer / MoE bypass TransformerBlock / TransformerLayer / MoELayer" — closed by P14 phase-2 + P15.
- "Token-IDs propagation via decoder._v4_token_ids attribute" — closed by P15.

Schedule

Block A (extended): 2026-05-01 — plan-2 P15 commit 25ccdb5e (continuing the early start; May 02 – 05 holiday remains).
Block B (planned): 2026-05-06 → 2026-05-09 — P16 – P21 across 4 working days. P14 phase-1 1a8bf32e, P14 phase-2 5fe8bc3c, and P15 25ccdb5e are recorded under May 01 in the daily plan.

What landed in `6c5875d4` (P16 — spec-based MTP via `MultiTokenPredictionBlock` + `process_mtp_loss`)

Closes plan-2 P16 except the distributed MTP-loss ablation gate (G7), which is tracked into P19 alongside G6. This commit wires V4 onto Megatron's upstream MTP pipeline so the auxiliary multi-token-prediction loss flows through process_mtp_loss (per-depth shifted logits + MTPLossAutoScaler) instead of the standalone primus-owned MTP block. The legacy DeepseekV4MTPBlock remains behind the v4_use_custom_mtp_block config flag for back-compat with research checkpoints (planned removal: P17 — moved up from P21 by the 2026-05-01 reshuffle) and now emits a DeprecationWarning on construction.

Spec helper (`primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_mtp_specs.py`, new)

get_v4_mtp_block_spec(config, *, transformer_layer_spec, vp_stage) returns
ModuleSpec(MultiTokenPredictionBlock, submodules=MultiTokenPredictionBlockSubmodules(layer_specs=[...]*mtp_num_layers)).
Each per-depth MultiTokenPredictionLayer spec pulls
- enorm / hnorm / layer_norm from DeepSeekV4SpecProvider.v4_norm_module()
- eh_proj from provider.column_parallel_linear()
- mtp_model_layer from the V4 hybrid-layer spec passed in by the model — so each MTP depth shares HC, hash routing, and clamped-SwiGLU with the main decoder exactly.
Rejects mtp_num_layers < 1 with a clear ValueError.

`DeepseekV4Model` updates (`deepseek_v4_model.py`)

New default path: when mtp_num_layers > 0 and not v4_use_custom_mtp_block, __init__ builds self.mtp = MultiTokenPredictionBlock(spec=get_v4_mtp_block_spec(...)) on stages where mtp_on_this_rank() is True. mtp_on_this_rank is wrapped in try/except so CPU smokes (no parallel_state) do not crash; self.mtp_process is False and self.mtp is None on those paths.
Legacy DeepseekV4MTPBlock path stays available behind v4_use_custom_mtp_block; self.mtp_block is the legacy slot, self.mtp is the new spec-based slot. Both are None when MTP is disabled.
forward now mirrors GPTModel.forward: runs self.mtp(...) on stages with MTP layers (passing input_ids / position_ids / hidden_states / attention_mask / embedding / packed_seq_params), then on post_process with mtp_num_layers > 0 calls process_mtp_loss(...) which chunks the concatenated hidden states, computes the per-depth shifted MTP loss, and folds it into the gradient via MTPLossAutoScaler.
New forward kwargs: loss_mask (forwarded to process_mtp_loss) and packed_seq_params.

Layer / block forward contract

DeepseekV4HybridLayer.forward now returns (hidden_states, None) instead of just hidden_states. This matches upstream TransformerLayer (which returns (hidden_states, context)) and is required by MultiTokenPredictionLayer._proj_and_transformer_layer which unpacks hidden_states, _ = self.mtp_model_layer(...).
DeepseekV4TransformerBlock's per-layer iteration updates to x, _ = layer(...).
Legacy DeepseekV4MTPBlock likewise updates to unpack the tuple.

V4 attention spec advertises `attn_mask_type`

The V4 attention spec now declares params={"compress_ratio": ..., "attn_mask_type": AttnMaskType.causal}. MultiTokenPredictionLayer.__init__ validates the inner layer's self_attention.params['attn_mask_type'] against {padding, causal, no_mask, padding_causal}; without this the MTP block fails to construct. The value is functionally inert for V4 (which manages its own SWA / sink mask).
DeepseekV4Attention.__init__ accepts and ignores attn_mask_type plus a **kwargs catch-all so the spec lifecycle keeps working.

Legacy `DeepseekV4MTPBlock` (`deepseek_v4_mtp.py`)

Module docstring annotated as deprecated (planned removal: P17 — moved up from P21 by the 2026-05-01 reshuffle).
Construction emits a DeprecationWarning pointing users at get_v4_mtp_block_spec. Code path unchanged otherwise.

Tests (`tests/.../test_v4_mtp.py`, ~17 tests)

get_v4_mtp_block_spec structural assertions: outer module is MultiTokenPredictionBlock; layer_specs length matches mtp_num_layers (parametrised 1/2/3); each per-depth spec is a MultiTokenPredictionLayer; the V4 inner layer is threaded through unchanged; norm + linear come from the V4 provider.
Rejects mtp_num_layers=0 with a clear ValueError.
DeepseekV4HybridLayerSubmodules extends TransformerLayerSubmodules so MTP picks up the GPT path (not Mamba) in its inner-layer-submodules isinstance check.
DeepseekV4HybridLayer.forward returns (hidden_states, None) (source-level assertion on return x, None).
V4 attention spec advertises AttnMaskType.causal (source-level assertion).
Legacy DeepseekV4MTPBlock emits DeprecationWarning on construction.
AST audits on deepseek_v4_model.py: process_mtp_loss is called; upstream MTP machinery is imported; spec helper is invoked; v4_use_custom_mtp_block flag is preserved; the mtp_num_layers > 0 guard keeps the no-MTP path inert.

Status / blockers

deepseek-v4/develop/progress/status.md — Phase 16 tasks ticked except G7 (MTP loss appears in train log; mtp_num_layers=0 vs mtp_num_layers=1 ablation matches LM loss to 1e-6), which requires distributed init + MultiTokenPredictionBlock runtime (CP / SP plumbing); tracked into P19 distributed re-validation alongside G6.
Two new follow-on rows recorded for the cross-cutting layer-tuple return + attention attn_mask_type declarations (both required by upstream MTP wiring).

Schedule

Block A (extended): 2026-05-01 — plan-2 P16 commit 6c5875d4 (continuing the early start; May 02 – 05 holiday remains).
Block B (planned): 2026-05-06 → 2026-05-09 — P17 – P21 across 4 working days. P14 phase-1 1a8bf32e, P14 phase-2 5fe8bc3c, P15 25ccdb5e, and P16 6c5875d4 are recorded under May 01 in the daily plan.

What landed in `e591b893` (P17 — code cleanup, gate G14)

P17 ships the dead-code retirement that was front-loaded from P21 in the 2026-05-01 reshuffle (f548d8b2). With pre-training as the release path, the HF state-dict adapter slot moved out (deferred to P22+) and the cleanup work moved up so P18's spec audit walks a clean tree.

Retired in this commit

primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_mtp.py — the legacy primus-owned DeepseekV4MTPBlock was deprecation-warned since P16 (6c5875d4); the spec-based path (get_v4_mtp_block_spec + upstream MultiTokenPredictionBlock + process_mtp_loss) is the only MTP route now.
DeepSeekV4TransformerConfig.v4_use_custom_mtp_block (legacy MTP gate) — removed.
DeepSeekV4TransformerConfig.mtp_compress_ratios (legacy-only field) — removed.
DeepseekV4Model.__init__ — single MTP branch on the spec path; the if v4_use_custom_mtp_block arm + self.mtp_block field are gone.

Dedup'd in this commit

primus/backends/megatron/core/transformer/local_rmsnorm.py (new) — one canonical LocalRMSNorm consumed by deepseek_v4_block.py (input_layernorm / pre_mlp_layernorm / final_layernorm fallback), deepseek_v4_attention.py (q_norm / kv_norm fallback closure), and compressor.py (kv_norm). The three pre-existing _RMSNorm definitions are deleted.

YAML cleanup

deepseek_v4_flash.yaml — inverted comment fixed: 4 = CSA (overlap) and 128 = HCA (non-overlap) match DeepseekV4Attention.forward dispatch.
deepseek_v4_pro.yaml + deepseek_v4_base.yaml — same canonical comment block added so all three V4 yamls are self-documenting.

Audit gate G14

tests/unit_tests/megatron/transformer/deepseek_v4/test_v4_p17_dead_code.py (new):
- retired files gone (deepseek_v4_mtp.py, csa_attention.py, hca_attention.py).
- legacy import path raises ImportError; package __all__ no longer exposes DeepseekV4MTPBlock.
- DeepSeekV4TransformerConfig no longer carries v4_use_custom_mtp_block / mtp_compress_ratios.
- AST scan over every V4 source for runtime _v4_token_ids access (Attribute / Assign / Name) — docstring mentions are exempt.
- AST scan over every V4 source for class _RMSNorm shadow definitions — none allowed.
- parameterised yaml check that the canonical 4 = CSA / 128 = HCA mapping is documented.

Out of scope (kept, with notes in `status.md`)

primus/backends/megatron/core/transformer/dual_rope.py — load-bearing for V4's CSA / HCA dual-base partial RoPE; Megatron's RotaryEmbedding only supports a single base. Plan-2 was over-eager listing this for retirement; it stays.

What landed in `b5832672` (P18 — spec-system audit, gate G1 + D1 / D2 / D4)

P18 closes the spec-system audit findings D1 / D2 / D4 from 00-review-findings.md. Walking a clean tree (after P17) makes the audits crisp.

Provider singleton (D1)

primus/backends/megatron/core/models/deepseek_v4/build_context.py (new): resolve_v4_provider(config) caches a single DeepSeekV4SpecProvider on the config object via a private attribute. Different configs get different providers; the cache is GC'd when the config is released.
All three direct DeepSeekV4SpecProvider(config=config) call sites migrated to the helper:
- deepseek_v4_block.py (_build_projection + DeepseekV4MoE shared-expert wiring)
- deepseek_v4_layer_specs.py
- deepseek_v4_mtp_specs.py
AST audit (test_v4_p18_spec_audit.py::test_no_direct_DeepSeekV4SpecProvider_construction_outside_build_context) rejects future regressions; build_context.py is the only allowed instantiation site.

Activation-func consistency (D2)

New helper DeepSeekV4SpecProvider.v4_mlp_activation_func() returns:
- None when config.use_te_activation_func is False — the V4 default; needed so Megatron MLP keeps the eager clamped-SwiGLU path (which applies activation_func_clamp_value).
- TEActivationOp (the TE class, instantiated by Megatron MLP at build) when the user opts into TE.
Layer specs + DeepseekV4MoE shared-expert spec switched to the V4 helper. The base provider's activation_func() is unchanged (BackendSpecProvider contract still says "returns a type").

`compress_ratios` normalization (D4)

DeepSeekV4TransformerConfig.__post_init__ calls _normalize_compress_ratios_field on the raw value once, so downstream consumers see tuple[int, ...] (or None). The helper handles strings ("[0, 0, 4, 128, ...]") and real lists.
Runtime helpers (_parse_int_sequence / _normalize_compress_ratios in deepseek_v4_block.py) keep accepting both forms for back-compat, but always receive the normalized form on the live path.

Schema gate G1

tests/unit_tests/configs/test_deepseek_v4_yaml.py (new): parameterises over deepseek_v4_{base,flash,pro}.yaml:
- parse_yaml() succeeds; required fields present.
- DeepSeekV4TransformerConfig builds from the parsed dict.
- compress_ratios normalized to tuple[int, ...] with no value drift vs the raw schedule.
- every compress_ratios entry is in {0, 4, 128} (canonical V4 branches).
- retired P17 fields (v4_use_custom_mtp_block / mtp_compress_ratios) are gone from the dataclass and from each YAML.
- V4-specific runtime fields (HC, sliding-window, sink, o_groups / o_lora_rank, MoE extras, swiglu_limit) all declared on the dataclass.
- provider singleton: resolve_v4_provider(cfg_a) returns the same instance on repeated calls; different configs get different providers.
- v4_mlp_activation_func contract verified for both branches of use_te_activation_func.

Spec audit (light-weight, AST-only)

tests/unit_tests/megatron/transformer/deepseek_v4/test_v4_p18_spec_audit.py (new):
- D1 / D2 audits described above.
- package surface __init__.py __all__ does not re-export DeepseekV4MTPBlock (P17 cross-check).
- spec builders do not eagerly construct TENorm / TE{Column,Row}ParallelLinear / TELinear / TEActivationOp inside __init__ — they emit ModuleSpec(module=...) references that runtime build_module resolves.

Schedule

Block A (extended): 2026-05-01 — plan-2 P17 + P18 commits e591b893 + b5832672 (continuing the early start; May 02 – 05 holiday remains).
Block B (planned): 2026-05-06 → 2026-05-09 — P19 – P21 across 4 working days. P17 + P18 are recorded under May 01 in the daily plan; P19 (distributed re-validation) is the first item in Block B.

What landed in `83c33ad0` (P19 — distributed re-validation) + `dba27163` (plan-2 close-out)

P19 closes the distributed re-validation gate (G10) for the architecture-faithful V4 stack landed across P13 → P18. All four target smokes pass 10/10 iterations on mi355-gpu-12 (BF16, MBS=1 GBS=16, seq=128, 8 layers / 3 hash layers / hc_mult=4); two torch.profiler chrome-trace JSONs (EP=8 and PP=2 EP=4) are captured for the perf-baseline reference.

Smokes (10 iters each)

smoke	parallelism	result	gating patch	log
A	TP=1 PP=1 EP=1	10/10	none (HC stays in-stage; PP > 1 patches are no-ops)	`deepseek-v4/develop/progress/p19/smokeA*.log`
B	TP=1 PP=2 EP=4	10/10	`pp_tensor_shape`	`p19/smokeB*.log`
C	TP=1 PP=4 EP=2	10/10	`pp_tensor_shape` + `pp_token_pre_broadcast`	`p19/smokeC_pp4_ep2_v2.log`
D	TP=1 PP=2 EP=4 VPP=2	10/10	`pp_tensor_shape` (also wraps the interleaved schedule) + `pp_token_pre_broadcast` (upfront)	`p19/smokeD_pp2_ep4_vpp2_v2_run3.log`

Profile traces

torch.profiler chrome-trace JSONs (single active step, iter 6 → 7) under the same V4 smoke config:

output/amd/tas-mi355x-20260507/p19_profile_pp1_ep8/tensorboard/...rank[0].*.pt.trace.json — TP=1 PP=1 EP=8 (~99 MB).
output/amd/tas-mi355x-20260507/p19_profile_pp2_ep4/tensorboard/...rank[0].*.pt.trace.json — TP=1 PP=2 EP=4 (~105 MB).

Launchers: deepseek-v4/develop/progress/p19/run_profile_ep8.sh and run_profile_pp2_ep4.sh.

`megatron.deepseek_v4.pp_tensor_shape` (`primus/backends/megatron/patches/deepseek_v4_pp_shape_patches.py`)

Wraps two Megatron entry points in megatron.core.pipeline_parallel.schedules so V4's mHC K = hc_mult packing is reflected on the PP wire:

get_tensor_shapes (used by 1F1B): seq dim multiplied by hc_mult so the receive buffer matches [S * K, B, D] instead of the stock [S, B, D].
forward_backward_pipelining_with_interleaving (used by VPP): seq_length kwarg multiplied by hc_mult before the schedule's inline tensor_shape = [seq_length, mbs, hidden] runs.

Both wrappers gate on model_type == "deepseek_v4" + hc_mult > 1 + PP > 1 and are strict no-ops otherwise. Without (2) VPP allocates [S, B, D] recv buffers while the sender emits [S * K, B, D], and _lift_streams_in reshapes the truncated copy — surfaces as DeepseekV4HashRouter: hidden=32 vs token_ids=128.

`megatron.deepseek_v4.pp_token_pre_broadcast` (`primus/backends/megatron/patches/deepseek_v4_get_batch_patches.py`)

V4's hash-routed MoE layers (the first num_hash_layers) need raw input_ids on every PP stage that owns one, but pretrain_gpt.get_batch returns None on middle PP stages. Two earlier in-loop hooks both deadlocked under VPP — an in-DeepseekV4Model.forward broadcast and a per-call get_batch broadcast each raced the interleaved schedule's pre-warmup recv_forward.wait().

This patch wraps pp_module.get_forward_backward_func so each train_step first runs all num_microbatches × num_chunks PP dist.broadcast collectives upfront, before the schedule's first send / recv, and caches the resulting (tokens, labels, loss_mask, attention_mask, position_ids, packed_seq_params) tuples per (vp_stage, microbatch). A companion wrapper around pretrain_gpt.get_batch consumes the cache when active and falls back to the original implementation otherwise. Cache is reset in a finally after each schedule call. Cost ≈ mbs * seq * 8B per microbatch (~32 KiB / step on the smoke), dwarfed by the activation P2P.

Model-side cleanup (`deepseek_v4_model.py`, `deepseek_v4_layer_specs.py`)

Drop the in-forward input_ids PP broadcast + VPP fail-fast assert from DeepseekV4Model; the pre-broadcast patch handles both 1F1B and VPP cleanly.
Stop pre-assigning self.mtp = None in __init__; Megatron's set_current_microbatch (in cuda_graphs.py) only iterates model.mtp.layers when MTP is actually live, which matches upstream GPTModel. Downstream MTP guards use getattr(self, "mtp", None).
Import DeepSeekV4SpecProvider in deepseek_v4_layer_specs.py so the type annotation resolves at module load (NameError surfaced once turbo path was off).

`c10d::allreduce_` autograd warning gone

The historical UserWarning: An operator was called with autograd not registered for c10d::allreduce_ came from the early bring-up's "local shard + torch.distributed.all_reduce" path for MoE routed-output aggregation in v4_moe.py. P14 phase-2 migrated MoE to Megatron's token dispatchers (MoEAlltoAllTokenDispatcher / MoEFlexTokenDispatcher); P17 deleted the v4_enable_ep_allreduce_fallback debug gate; and P19 confirms zero c10d::allreduce hits in stderr across all four smokes + the EP=8 / PP=2 EP=4 profile runs.

`dba27163` plan-2 close-out (docs-only)

status.md — mark c10d::allreduce_ warning as gone (with the verification log paths); mark G11 as [-] deferred (snapshot dump tooling never landed); drop Phase 20 / 21 / 22+ sections (kept as documented intent in plan-2/03-phase-details.md); refresh the Blockers / Risks log entry for c10d to reference the actual P19 verification rather than "still tracked into P19".
deepseek-v4/develop/progress/plan-2-summary.md (new) — stand-alone summary of the plan-2 architecture-faithful rewrite (P12 → P19): per-phase outcome with key commits; P19 deep-dive (smokes / profile traces / patches / c10d verification); test-gate ledger (G1 / G3 / G4 / G5 / G6 / G7 / G11 / G14 + smokes); plan-1 → plan-2 architectural-shift table (attention, MoE, layer / block, MTP, token-IDs path, HC × PP, TP, spec hygiene); explicit deferred / out-of-scope list (G6 distributed, G7 MTP, G11, P20, P21, P22+).
P19 profile launchers — run_profile_ep8.sh (TP=1 PP=1 EP=8) and run_profile_pp2_ep4.sh (TP=1 PP=2 EP=4); both set PROFILE=True + disable_tensorboard=False so the existing torch_profiler_patches.py hook captures iter 6 → 7.
deepseek-v4/download_ref.sh — idempotent helper that ensures git-lfs and clones the V4 reference assets at pinned commits (HF transformers, ROCm TransformerEngine, AMD-AGI/Primus-Turbo, NVIDIA-NeMo/Automodel, plus DeepSeek-V4-Pro / Flash / Flash-Base / Pro-Base) with GIT_LFS_SKIP_SMUDGE=1 so weights are not downloaded by default.

Schedule

Block A (extended): 2026-05-01 — plan-2 P12 → P18 commits (636ab3de → b5832672).
Block B (delivered): 2026-05-07 — plan-2 P19 (83c33ad0) + plan-2 close-out (dba27163).
Deferred follow-ups: P20 (200-step Megatron-bridge convergence + TE on / off perf report + FP8 follow-up plan), P21 (techblog / progress timeline / PPT refresh), P22+ (HF state-dict adapter + V4-Flash safetensors round-trip / token-0 logits ≤ 1e-2 vs HF reference). All three are documented in plan-2/03-phase-details.md; they re-enter active work when the next campaign (release, downstream integration ask, SFT / eval) needs them.

Test plan

Known risk / follow-up

EP routed-output path currently uses all_reduce and emits a PyTorch autograd warning (c10d::allreduce_ kernel registration). Functional for bring-up; gated behind the v4_enable_ep_allreduce_fallback debug toggle on the active path. resolved (P14 phase-2 / P17 audit / P19 runtime verification): the v4_enable_ep_allreduce_fallback flag was removed during the dispatcher migration in P14; the debug gate was deleted in P17 (e591b893); P19 smokes (A/B/C/D + EP=8 / PP=2 EP=4 profile runs on mi355-gpu-12) confirm zero c10d::allreduce warnings in stderr — the EP routed-output reduction now flows entirely through Megatron's MoEAlltoAllTokenDispatcher / MoEFlexTokenDispatcher.
~~HC × PP — HyperHead per-stage application destroys K-stream context.~~ resolved (P15 + P19): DeepseekV4TransformerBlock packs [B, S, K, D] → [S*K, B, D] for PP P2P via _lower_streams_out and only applies HyperHead on the post_process stage. CPU-side bit-exact roundtrip covered by test_v4_block_pp.py. Runtime stability across PP > 1 verified by P19 smokes B / C / D with the pp_tensor_shape patch; distributed bit-equality across PP = 1 / 2 / 4 (G6) is a separate audit and is not on the pre-training release path (deferred follow-up).
~~decoder._v4_token_ids attribute stash — leaks state across PP and microbatches.~~ resolved (P15): DeepseekV4Model.forward now passes token_ids=input_ids directly to the decoder; AST audit prevents regressions.
No state-dict adapter — V4-Flash safetensors cannot be loaded. Deferred to P22+ by the 2026-05-01 reshuffle; pre-training does not need HF weights. Plan-2 P22+ (when activated by an SFT / evaluation campaign) lands the adapter and adds the HF numerical-alignment gate (G8 / G9). Design notes preserved in 02-target-architecture.md §7 + 03-phase-details.md (P22+ section).

Initial design / planning materials for integrating DeepSeek-V4 training support into Primus. Documentation only; no production code changes. - techblog/: architecture deep dive (CSA / HCA / mHC / Hash routing / sqrtsoftplus / clamped SwiGLU / dual RoPE / Muon / MTP) plus 4 PNG diagrams rendered via Pillow (see render_diagrams.py). - plan/: 8-phase roadmap, full code-landing list, per-phase task breakdown, and testing strategy. - progress/status.md: 64-task checklist tracking phase progress. - develop_deepseek-v4-in-primus.md: top-level goal and development cadence. Made-with: Cursor

Phase 1 of the V4 development plan. Pure config; no Python code paths exercised yet. Subsequent phases (P2..P4) wire dispatch and modules. * primus/configs/models/megatron/deepseek_v4_base.yaml Extends llama_base, sets model_type=deepseek_v4 and registers V4-specific defaults (hc_mult, hybrid_attention_*, q_lora_rank, attn_sink, hash routing, swiglu_limit, dual-RoPE knobs, etc.). * primus/configs/models/megatron/deepseek_v4_flash.yaml Hyperparams from DeepSeek-V4-Flash/config.json. * primus/configs/models/megatron/deepseek_v4_pro.yaml Hyperparams from DeepSeek-V4-Pro/config.json. * examples/megatron/configs/MI355X/deepseek_v4_flash-BF16-pretrain.yaml Training scaffold; parallelism / perf knobs are conservative and will be retuned during the perf phase. * primus/backends/megatron/training/tokenizer/tokenizer.py Add DeepSeekV4Tokenizer to CUSTOM_TOKENIZER_TYPES so _add_tokenizer_args accepts it. Note: V4 fields do not need to be registered in Megatron's argparse — Primus's merge_namespace mechanism (train_runtime.py:_initialize_trainer) copies yaml-only fields onto backend_args after MegatronArgBuilder.update. Made-with: Cursor

Phase 2 of the V4 development plan. Wires the end-to-end dispatch from yaml.model_type=deepseek_v4 to a primus-owned model_provider + builder, without changing model behaviour yet. The model class is still a thin GPTModel subclass; Phase 3 swaps the decoder for the V4 transformer block. * primus/core/utils/import_utils.py Add a deepseek_v4 branch to get_model_provider() that imports primus.backends.megatron.core.models.deepseek_v4.deepseek_v4_builders and returns partial(model_provider, deepseek_v4_builder). * primus/backends/megatron/megatron_pretrain_trainer.py Add a model_type == "deepseek_v4" branch alongside gpt / mamba. V4 is a causal-LM with the same data shape as GPT, so we reuse pretrain_gpt's forward_step + train_valid_test_datasets_provider; only the model_provider itself is V4-specific. * primus/backends/megatron/core/models/deepseek_v4/__init__.py (new) Re-export DeepseekV4Model + deepseek_v4_builder + model_provider. * primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_model.py (new) DeepseekV4Model: thin subclass of GPTModel. P3 will replace self.decoder with DeepseekV4TransformerBlock. * primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_builders.py (new) deepseek_v4_builder + model_provider. Uses GPT layer specs in P2; P3 will swap them for V4 specs. Made-with: Cursor

Phase 3 of the V4 development plan. Lands the V4 layer-spec helpers and a transparent V4 transformer-block subclass; attention / MLP behaviour still matches GPT. Phase 4 will plug HC + hybrid attention into the block, and Phase 5 will swap in V4 MoE / clamped SwiGLU through the spec-resolution hooks added here. * primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_layer_specs.py (new) Four V4 layer-spec helpers (layer / decoder_block / decoder_layer_specs / mtp_block) that delegate to the GPT helpers in P3, plus two resolution hooks (_resolve_attention_module_spec / _resolve_mlp_module_spec) that return None for now -- P4 / P5 fill these in. * primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_block.py (new) DeepseekV4TransformerBlock: subclasses TransformerBlock and stashes V4 config fields (hc_mult, compress_ratios, attn_sliding_window, attn_sink, q_lora_rank, index_*) onto self so P4 patches don't have to re-walk the config. Forward behaviour unchanged in P3. * primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_model.py Override __init__: after super().__init__() builds the stock decoder, swap self.decoder for DeepseekV4TransformerBlock (same call signature so GPTModel.forward keeps working). * primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_builders.py _resolve_layer_spec / _resolve_mtp_block_spec now route through the V4 layer-spec helpers instead of the GPT helpers directly. * primus/backends/megatron/core/models/deepseek_v4/__init__.py Re-export DeepseekV4TransformerBlock alongside the existing surface. Made-with: Cursor

…dual-RoPE) Phase 4 of the V4 development plan. Lands the full V4 transformer block: mHC multi-stream residual, per-layer hybrid attention dispatch (Dense / HCA / CSA), sliding-window mask, attention sink, dual-RoPE with YaRN. The V4 block becomes a standalone nn.Module that bypasses Megatron's TransformerBlock + ModuleSpec mechanism so the multi-stream HC loop is expressed cleanly. P5 will swap the placeholder SwiGLU MLP for V4's MoE. New modules under primus/backends/megatron/core/transformer/ :: * hyper_connection.py HyperMixer (per-layer mHC mixer), HyperHead (final K->1 collapse), sinkhorn_normalize (doubly-stochastic projection). Linear weights / scales / biases held in fp32 for stability; fp32 sinkhorn iterates. Unit-tested: row/col errors ~1e-6, hc_mult=1 degenerate path exact. * compressor.py V4 compressor for KV downsampling. ratio=4 overlap mode (CSA, coff=2), ratio=128 non-overlap mode (HCA, coff=1). Internal RMSNorm + learnable APE; RoPE applied externally. * indexer.py Sparse top-K position selector for CSA. Internal mini-Compressor builds the score grid; causal mask + top-K (-1 fill for invalid positions); backward propagates to the indexer params. * sliding_window_kv.py Causal SWA mask + per-query KV index helpers. * attn_sink.py Per-head learnable sink scalar; softmax_with_sink ensures probs.sum() <= 1 with the sink absorbing the residual mass. Backward propagates to the sink params. * dual_rope.py Two RoPE bases (main + compress) with optional YaRN scaling. Partial interleaved RoPE: only ``rotary_dim`` of each head's channels rotated; remaining channels passed through unchanged. * deepseek_v4_attention.py Shared base for V4 attention: QKV projection (optional Q LoRA), partial dual-RoPE, SWA mask, attention sink, output projection. ``_extra_kv`` hook lets HCA / CSA augment KV (full pool or sparse top-K). * hca_attention.py Heavily-Compressed Attention. Subclasses DeepseekV4Attention; adds a non-overlap Compressor and concatenates the full compressed pool to the local KV (always visible). * csa_attention.py Compressed-Sparse Attention. Subclasses DeepseekV4Attention; adds an overlap Compressor + Indexer; per-query attention is computed over the local SWA + the indexer's top-K compressed positions. Updated: * primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_block.py Rewritten as a standalone nn.Module. Holds the dual-RoPE for the whole stack, builds DeepseekV4HybridLayer per layer (Dense/HCA/CSA picked from compress_ratios), and runs the K-stream HC loop. Forward shape: [S, B, D] -> [B, S, D] -> [B, S, K, D] -> ... -> [B, S, D] -> [S, B, D]. Smoke-tested: 8-layer mixed dense/CSA/HCA + hc_mult=4 forward / backward / causality OK. * primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_layer_specs.py Cleaned up to a placeholder spec. The V4 block is standalone and bypasses Megatron's spec mechanism; we still hand a valid GPT-shaped spec to GPTModel.__init__ until P6 refactors that allocation away. * primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_model.py Docstring rewritten for the P4 standalone-block layout; pg_collection switched to getattr(self, "pg_collection", None) for safety. * deepseek-v4/develop/progress/status.md, plan/02-phase-details.md Track P1..P4 completion; add the argparse-not-needed note (Primus's merge_namespace covers V4 fields). Made-with: Cursor

Copilot

Pull request overview

Adds a new model_type=deepseek_v4 to Primus’ Megatron backend, including V4 configs, model/provider dispatch, and an initial DeepSeek-V4 block implementation with HC + hybrid attention building blocks.

Changes:

Add DeepSeek-V4 model dispatch + builders and a Primus-owned V4 model package.
Introduce V4 config yamls (base/flash/pro) and a MI355X pretrain scaffold yaml.
Implement core V4 transformer components (HC, dual-RoPE, compressor, indexer, CSA/HCA attention, sliding-window helpers, attention sink).

Reviewed changes

Copilot reviewed 33 out of 37 changed files in this pull request and generated 10 comments.

Show a summary per file

File	Description
`primus/core/utils/import_utils.py`	Adds `deepseek_v4` branch to resolve the V4 model provider/builder.
`primus/backends/megatron/megatron_pretrain_trainer.py`	Dispatches `model_type=deepseek_v4` while reusing GPT data/forward_step plumbing.
`primus/backends/megatron/training/tokenizer/tokenizer.py`	Allows selecting `DeepSeekV4Tokenizer` via HF tokenizer wrapper.
`primus/configs/models/megatron/deepseek_v4_{base,flash,pro}.yaml`	Adds V4 model configs and defaults.
`examples/megatron/configs/MI355X/deepseek_v4_flash-BF16-pretrain.yaml`	Adds a training scaffold yaml for MI355X.
`primus/backends/megatron/core/models/deepseek_v4/*`	Adds V4 model/builders/spec placeholders and a standalone V4 block implementation.
`primus/backends/megatron/core/transformer/*`	Implements HC, dual-RoPE, compressor/indexer, CSA/HCA attention, SWA helpers, and attention sink.
`deepseek-v4/develop/**`	Adds development docs/diagrams and planning materials for the V4 integration.

Copilot · 2026-04-28T11:08:56Z

+# Per-layer compression schedule (from config.json:compress_ratios)
+# 0   = uncompressed dense layer (full attention with SWA)
+# 4   = HCA branch (compress ratio 4)
+# 128 = CSA branch (compress ratio 128)
+compress_ratios: "[0, 0, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 0]"


compress_ratios is currently a quoted string, so YAML will parse it as str rather than a list of ints. DeepseekV4TransformerBlock.__init__ does list(compress_ratios) and checks len(...) == num_layers, so this will either explode into a list of characters or fail the length check at runtime. Define this as a real YAML list (no quotes) or normalize the string to List[int] before the block consumes it; also ensure the list length matches num_layers (43).

Copilot · 2026-04-28T11:08:57Z

+# Per-layer compression schedule (from config.json:compress_ratios)
+# 0   = uncompressed dense layer (full attention with SWA)
+# 4   = HCA branch (compress ratio 4)
+# 128 = CSA branch (compress ratio 128)


The per-layer schedule comments invert CSA vs HCA: per the V4 design and the rest of this PR, compress_ratio == 4 is CSA and compress_ratio == 128 is HCA. Please fix the comment mapping so it matches the implementation.

Copilot · 2026-04-28T11:08:57Z

+#
+# Source: deepseek-v4/deepseek-ai/DeeSeek-v4-Pro/config.json
+###############################################################################


Typo in the referenced source path (DeeSeek-v4-Pro). If this path is meant to mirror the repo directory (DeepSeek-V4-Pro), please correct it to avoid confusion when cross-referencing configs.

Copilot · 2026-04-28T11:08:58Z

+        # Sliding-window mask.
+        window = self.attn_sliding_window
+        local_mask = sliding_window_causal_mask(S, window, device=device, dtype=dtype)  # [S, S]
+
+        # Subclass hook: extra K/V (compressed pool, sparse top-K, etc.).
+        # Subclass should return tensors already broadcast to [B, S_extra, H, head_dim]
+        # so they can be cat'd along the Sk axis.
+        extra_k, extra_v, extra_mask = self._extra_kv(hidden, position_ids, q)
+
+        # Concatenate sliding-window KV with extra KV (if any).
+        if extra_k is not None:
+            k_full = torch.cat([k_local_h, extra_k], dim=1)  # [B, Sk_total, H, head_dim]
+            v_full = torch.cat([v_local_h, extra_v], dim=1)
+            full_mask = torch.cat([local_mask, extra_mask], dim=-1)  # [Sq, Sk_total]
+        else:
+            k_full = k_local_h
+            v_full = v_local_h
+            full_mask = local_mask
+
+        # Move heads dim before sequence: [B, S, H, head_dim] -> [B, H, S, head_dim]
+        q_bh = q.transpose(1, 2)
+        k_bh = k_full.transpose(1, 2)
+        v_bh = v_full.transpose(1, 2)
+
+        out_bh = self._compute_attention_output(q_bh, k_bh, v_bh, full_mask)


sliding_window_causal_mask creates a [S, S] mask, but the attention still computes q @ k^T over all S keys (k_local_h is length S). For realistic training lengths (e.g. 4096), this becomes quadratic memory/compute and is very likely to OOM, even though the model is conceptually sliding-window. Consider actually restricting K/V to the window (e.g. gather with sliding_window_kv_indices, unfold, or use a kernel/backend that supports causal sliding-window attention) so Sk_local is window rather than S.

Suggested change

# Sliding-window mask.

window = self.attn_sliding_window

local_mask = sliding_window_causal_mask(S, window, device=device, dtype=dtype) # [S, S]

# Subclass hook: extra K/V (compressed pool, sparse top-K, etc.).

# Subclass should return tensors already broadcast to [B, S_extra, H, head_dim]

# so they can be cat'd along the Sk axis.

extra_k, extra_v, extra_mask = self._extra_kv(hidden, position_ids, q)

# Concatenate sliding-window KV with extra KV (if any).

if extra_k is not None:

k_full = torch.cat([k_local_h, extra_k], dim=1) # [B, Sk_total, H, head_dim]

v_full = torch.cat([v_local_h, extra_v], dim=1)

full_mask = torch.cat([local_mask, extra_mask], dim=-1) # [Sq, Sk_total]

else:

k_full = k_local_h

v_full = v_local_h

full_mask = local_mask

# Move heads dim before sequence: [B, S, H, head_dim] -> [B, H, S, head_dim]

q_bh = q.transpose(1, 2)

k_bh = k_full.transpose(1, 2)

v_bh = v_full.transpose(1, 2)

out_bh = self._compute_attention_output(q_bh, k_bh, v_bh, full_mask)

# Materialize only the causal sliding-window K/V for each query position

# so local attention scales with `window` rather than the full sequence `S`.

window = self.attn_sliding_window

window = min(window, S)

# Build per-query local indices: for query i attend to [i - window + 1, ..., i].

query_positions = torch.arange(S, device=device)

window_offsets = torch.arange(window, device=device)

local_indices = query_positions.unsqueeze(1) - (window - 1) + window_offsets.unsqueeze(0) # [S, window]

local_valid = local_indices >= 0

local_indices = local_indices.clamp_(min=0, max=S - 1)

# Gather local K/V windows: [B, S, H, D] -> [B, S, window, H, D].

gather_index = local_indices.view(1, S, window, 1, 1).expand(

B, S, window, self.num_heads, self.head_dim

)

k_local = torch.gather(

k_local_h.unsqueeze(2).expand(B, S, window, self.num_heads, self.head_dim),

1,

gather_index,

)

v_local = torch.gather(

v_local_h.unsqueeze(2).expand(B, S, window, self.num_heads, self.head_dim),

1,

gather_index,

)

# Subclass hook: extra K/V (compressed pool, sparse top-K, etc.).

# Subclass should return tensors already broadcast to [B, S_extra, H, head_dim].

extra_k, extra_v, extra_mask = self._extra_kv(hidden, position_ids, q)

# Move heads dim before sequence for local attention:

# q: [B, S, H, D] -> [B, H, S, D]

# local k/v: [B, S, window, H, D] -> [B, H, S, window, D]

q_bh = q.transpose(1, 2)

k_local_bh = k_local.permute(0, 3, 1, 2, 4)

v_local_bh = v_local.permute(0, 3, 1, 2, 4)

scale = self.head_dim ** -0.5

local_scores = (q_bh.unsqueeze(-2) * k_local_bh).sum(dim=-1) * scale # [B, H, S, window]

local_scores = local_scores.masked_fill(

~local_valid.view(1, 1, S, window), torch.finfo(local_scores.dtype).min

)

if extra_k is not None:

extra_k_bh = extra_k.transpose(1, 2) # [B, H, S_extra, D]

extra_v_bh = extra_v.transpose(1, 2) # [B, H, S_extra, D]

extra_scores = torch.einsum("bhsd,bhkd->bhsk", q_bh, extra_k_bh) * scale

if extra_mask is not None:

if extra_mask.dtype == torch.bool:

extra_scores = extra_scores.masked_fill(

~extra_mask.view(1, 1, S, -1), torch.finfo(extra_scores.dtype).min

)

else:

extra_scores = extra_scores + extra_mask.view(1, 1, S, -1).to(extra_scores.dtype)

attn_scores = torch.cat([local_scores, extra_scores], dim=-1)

attn_probs = torch.softmax(attn_scores.float(), dim=-1).to(q_bh.dtype)

local_probs = attn_probs[..., :window]

extra_probs = attn_probs[..., window:]

out_local = (local_probs.unsqueeze(-1) * v_local_bh).sum(dim=-2)

out_extra = torch.einsum("bhsk,bhkd->bhsd", extra_probs, extra_v_bh)

out_bh = out_local + out_extra

else:

attn_probs = torch.softmax(local_scores.float(), dim=-1).to(q_bh.dtype)

out_bh = (attn_probs.unsqueeze(-1) * v_local_bh).sum(dim=-2)

Copilot · 2026-04-28T11:08:58Z

+    ) -> torch.Tensor:
+        """Gather ``[B, P, head_dim]`` along ``P`` per query → ``[B, S, K, head_dim]``.
+


_gather_topk_kv is annotated as returning torch.Tensor, but it actually returns (gathered, valid). This will confuse type-checkers and readers; update the return annotation (and docstring if needed) to reflect the tuple return type.

Suggested change

) -> torch.Tensor:

"""Gather ``[B, P, head_dim]`` along ``P`` per query → ``[B, S, K, head_dim]``.

) -> Tuple[torch.Tensor, torch.Tensor]:

"""Gather ``[B, P, head_dim]`` along ``P`` per query.

Returns:

A tuple ``(gathered, valid)`` where:

- ``gathered`` has shape ``[B, S, K, head_dim]``.

- ``valid`` has shape ``[B, S, K]`` and marks non-masked indices.

Copilot · 2026-04-28T11:08:58Z

+        gathered, valid = self._gather_topk_kv(pool_kv, topk_idxs)  # [B, S, K, head_dim]
+
+        # 5) Stash for ``_compute_attention_output`` to consume.
+        gathered.shape[2]


This statement has no effect (gathered.shape[2] is computed and discarded). It looks like a leftover debug line; please remove it to keep the CSA path clean.

Suggested change

gathered.shape[2]

Copilot · 2026-04-28T11:08:59Z

+num_layers: 61
+hidden_size: 7168
+num_attention_heads: 128
+num_query_groups: 1
+kv_channels: 512
+qk_pos_emb_head_dim: 64
+ffn_hidden_size: 18432
+moe_ffn_hidden_size: 3072
+moe_shared_expert_intermediate_size: 3072
+
+q_lora_rank: 1536
+o_lora_rank: 1024
+o_groups: 16
+
+num_experts: 384
+moe_router_topk: 6
+moe_router_topk_scaling_factor: 2.5
+
+index_topk: 1024
+
+compress_ratios: "[128, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 0]"


Same issue as Flash: compress_ratios is a quoted string, which will not deserialize to Sequence[int] and will break DeepseekV4TransformerBlock's len(compress_ratios) == num_layers check. Please make this a real YAML list (or add a normalization step) and verify the schedule length matches num_layers (61).

Copilot · 2026-04-28T11:08:59Z

+# Reference:
+#   - deepseek-v4/deepseek-ai/DeepSeek-V4-Flash/config.json
+#   - deepseek-v4/deepseek-ai/DeeSeek-v4-Pro/config.json
+#   - deepseek-v4/develop/techblog/01-deepseek-v4-architecture-deep-dive.md


Typo in the reference path (DeeSeek-v4-Pro). Please correct the spelling/casing so the comment points at the actual directory name and is searchable.

Copilot · 2026-04-28T11:08:59Z

+        # local v we have [B, H, Sk_local, head_dim] (independent of S),
+        # while sparse v depends on S. Build a "value tensor" with the
+        # same shape on both paths by broadcasting local v:
+        v.shape[2]


This statement has no effect (v.shape[2] is computed and discarded). Please remove it; it reads like a debug remnant and makes the attention path harder to audit.

Suggested change

v.shape[2]

Copilot · 2026-04-28T11:08:59Z

+FONT_REG = "/home/xiewen12/.local/share/fonts/NotoSansSC-Regular.otf"
+FONT_BOLD = FONT_REG  # we only have Regular; use it for both
+
+OUT_DIR = os.path.join(os.path.dirname(__file__), "diagrams")
+os.makedirs(OUT_DIR, exist_ok=True)
+
+
+def font(sz: int, bold: bool = False) -> ImageFont.FreeTypeFont:
+    return ImageFont.truetype(FONT_BOLD if bold else FONT_REG, sz)


FONT_REG is hard-coded to an absolute path under a specific user's home directory, which will fail for other developers/CI. Consider using a repo-relative font path, allowing an environment variable override, and/or falling back to a default font when the file isn't present.

Suggested change

FONT_REG = "/home/xiewen12/.local/share/fonts/NotoSansSC-Regular.otf"

FONT_BOLD = FONT_REG # we only have Regular; use it for both

OUT_DIR = os.path.join(os.path.dirname(__file__), "diagrams")

os.makedirs(OUT_DIR, exist_ok=True)

def font(sz: int, bold: bool = False) -> ImageFont.FreeTypeFont:

return ImageFont.truetype(FONT_BOLD if bold else FONT_REG, sz)

BASE_DIR = os.path.dirname(__file__)

FONT_CANDIDATES = (

os.environ.get("DIAGRAM_FONT"),

os.environ.get("FONT_REG"),

os.path.join(BASE_DIR, "NotoSansSC-Regular.otf"),

os.path.join(BASE_DIR, "fonts", "NotoSansSC-Regular.otf"),

)

def _resolve_font_path() -> str | None:

for path in FONT_CANDIDATES:

if path and os.path.isfile(path):

return path

return None

FONT_REG = _resolve_font_path()

FONT_BOLD = FONT_REG # we only have Regular; use it for both when available

OUT_DIR = os.path.join(BASE_DIR, "diagrams")

os.makedirs(OUT_DIR, exist_ok=True)

def font(sz: int, bold: bool = False) -> ImageFont.ImageFont | ImageFont.FreeTypeFont:

font_path = FONT_BOLD if bold else FONT_REG

if font_path:

return ImageFont.truetype(font_path, sz)

return ImageFont.load_default()

…+ MTP Phase 5 of the V4 development plan. Lands the FFN side of the V4 stack: hash-routed and learned top-K MoE, clamped SwiGLU experts, and the V4 MTP head. The V4 block now plugs the V4 MoE in place of P4's placeholder SwiGLU FFN; the V4 model instantiates a separate-HyperHead MTP block when mtp_num_layers > 0. Layer-aware YaRN was already done in P4 (DualRoPE.get_rope picks main_rope vs compress_rope by compress_ratio). New modules: * primus/backends/megatron/core/transformer/clamped_swiglu.py clamped_swiglu(x, alpha=7.0): silu(gate)*up clamped to [-alpha, alpha]. ClampedSwiGLUMLP wraps it as a fused gate_up + down two-linear MLP. Eager (Python) for v1; perf phase will register a fused kernel. * primus/backends/megatron/core/transformer/moe/v4_hash_router.py HashRouter: static [vocab_size, topk] tid2eid table from a fixed seed. Active for the first num_hash_layers V4 layers; gives each token a permanent expert assignment with uniform weight 1/topk. No learnable parameters; deterministic across PP / TP / EP ranks. * primus/backends/megatron/core/transformer/moe/v4_topk_router.py V4TopKRouter: learned gate with score_function in {"sqrtsoftplus", "sigmoid", "softmax"}. Top-K with optional renorm and optional noaux_tc per-expert bias (selection-only; probs are read from the un-biased score). * primus/backends/megatron/core/transformer/moe/v4_moe.py DeepseekV4MoE: per-layer router pick (hash vs learned) + N ClampedSwiGLUMLP routed experts + 1 shared expert. Pure-PyTorch per-expert dispatch; P6 swaps in Megatron's token-dispatcher / grouped-GEMM / EP path. * primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_mtp.py DeepseekV4MTPBlock: mtp_num_layers V4 layers, each owning its own HyperHead (separate from the main decoder's). Shares the dual-RoPE with the main decoder. Loss-side wiring is deferred to P6; P5 just stands the module up so it can be unit-tested standalone. Updated: * primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_block.py DeepseekV4HybridLayer now picks MoE vs dense FFN based on num_routed_experts. forward() threads token_ids through to the MoE for hash-routed layers. The block-level forward picks token_ids up from a model-side stash (_v4_token_ids) so callers don't have to thread it explicitly through every layer of the call stack. * primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_model.py Builds DeepseekV4MTPBlock when mtp_num_layers > 0 (post-process rank only). forward() overridden to stash input_ids onto self.decoder before delegating to GPTModel.forward, so hash-routed MoE layers can consume them. Cross-PP propagation of input_ids is a P6 concern. * primus/backends/megatron/core/models/deepseek_v4/__init__.py Re-export DeepseekV4MTPBlock alongside the existing surface. Smoke-tested on dev-box PyTorch container (CPU, 7-test suite): * clamped_swiglu: clamp tight; MLP forward+backward OK. * HashRouter: per-token top-K distinct, deterministic across re-runs and re-instantiations w/ same seed, probs sum to 1. * V4TopKRouter: top-K honored, renorm OK, backward OK for all three score functions (sqrtsoftplus, sigmoid, softmax). * DeepseekV4MoE (learned & hash modes): forward + backward; same-token determinism for hash routing. * DeepseekV4TransformerBlock with MoE FFN (4 layers, hc_mult=2, mixed dense + CSA): forward + backward; deterministic in eval mode. * DeepseekV4MTPBlock (mtp_num_layers=2, hc_mult=2): forward + backward; per-MTP HyperHead state_dict separation verified. Deferred to P6 (already noted in progress doc): * Real Megatron-MoE / token-dispatcher / EP integration -- replaces the pure-PyTorch dispatch loop in DeepseekV4MoE.forward. * MTP loss path wiring -- DeepseekV4Model.forward currently builds the MTP block but does not yet feed its outputs through lm_head + the auxiliary loss term. * Numerical alignment vs reference inference/model.py (token-0 logits within 1e-2) -- needs reference checkpoint loading. Made-with: Cursor

Copilot

Pull request overview

Copilot reviewed 38 out of 42 changed files in this pull request and generated no new comments.

Wire DeepSeek-V4 through Megatron P6 integration (PP local-layer build, EP expert sharding, and compatibility fixes) and add the P7 single-node launcher plus progress docs after passing PP=2/EP=4 smoke run. Made-with: Cursor

Add the plan-1 roadmap/detail/test documentation plus progress tracker entries, and update the development target doc with TransformerEngine and Primus-Turbo reference pointers. Made-with: Cursor

Copilot

Pull request overview

Copilot reviewed 44 out of 48 changed files in this pull request and generated 8 comments.

Copilot · 2026-04-29T12:33:29Z

+        gen = torch.Generator(device="cpu").manual_seed(int(seed))
+        # For each token id, pick ``topk`` distinct expert ids deterministically.
+        # randperm(num_experts) is a stable, dense permutation; slicing the
+        # first ``topk`` rows gives uniform-without-replacement routing.
+        rows = []
+        for _ in range(vocab_size):
+            perm = torch.randperm(num_experts, generator=gen)[:topk]
+            rows.append(perm)
+        tid2eid = torch.stack(rows, dim=0).long()  # [vocab_size, topk]


HashRouter.__init__ builds tid2eid by looping over every vocab_size entry and calling torch.randperm(num_experts) each time. For real V4 sizes (e.g., vocab≈129k, experts≈384), this will add significant startup time and CPU memory churn at model construction. Consider replacing this with a deterministic hash-based mapping (no table), or generating the table in larger vectorized blocks (and/or only for the subset of vocab used), so model init remains scalable.

Copilot · 2026-04-29T12:33:30Z

+    def _gather_topk_kv(
+        self,
+        pool: torch.Tensor,  # [B, P, head_dim]
+        topk_idxs: torch.Tensor,  # [B, S, K]  (-1 for masked)
+    ) -> torch.Tensor:
+        """Gather ``[B, P, head_dim]`` along ``P`` per query → ``[B, S, K, head_dim]``.
+
+        Out-of-range / masked indices (``-1``) are clamped to ``0`` for the
+        gather, then *zero-masked* afterwards.
+        """
+        B, S, K = topk_idxs.shape
+        P, Hd = pool.shape[1], pool.shape[2]
+        valid = topk_idxs >= 0  # [B, S, K]
+        safe_idx = topk_idxs.clamp(min=0)
+        # Expand idx to gather along P for each (B, S, K, Hd).
+        idx_expand = safe_idx.unsqueeze(-1).expand(B, S, K, Hd)
+        pool_expand = pool.unsqueeze(1).expand(B, S, P, Hd)  # [B, S, P, Hd]
+        gathered = torch.gather(pool_expand, dim=2, index=idx_expand)  # [B, S, K, Hd]
+        gathered = gathered * valid.unsqueeze(-1).to(gathered.dtype)
+        return gathered, valid


_gather_topk_kv is annotated as returning only a torch.Tensor, but it actually returns (gathered, valid). This mismatch can break type checking and mislead callers; update the return annotation (and docstring if desired) to Tuple[torch.Tensor, torch.Tensor].

Copilot · 2026-04-29T12:33:30Z

+        in_dtype = x.dtype
+        x32 = x.float()
+        rsqrt = torch.rsqrt(x32.pow(2).mean(dim=-1, keepdim=True) + self.eps)
+        return (x32 * rsqrt).to(in_dtype) * self.weight


The standalone RMSNorm implementation returns (…to(in_dtype) * self.weight). If self.weight remains fp32 (common in mixed-precision training), this multiplication will upcast the output back to fp32, potentially defeating BF16 activation flow and increasing memory/compute. Consider multiplying by self.weight.to(in_dtype) (or casting the final result back to in_dtype) so the output dtype stays consistent with the input activation dtype.

Suggested change

return (x32 * rsqrt).to(in_dtype) * self.weight

return (x32 * rsqrt).to(in_dtype) * self.weight.to(in_dtype)

Copilot · 2026-04-29T12:33:31Z

+        flat = hidden.reshape(-1, D)  # [N, D]
+        flat.shape[0]
+


This flat.shape[0] statement is a no-op and appears to be leftover debug code. Please remove it to keep the forward path minimal and lint-clean.

Copilot · 2026-04-29T12:33:31Z

+# DeepSeek-V4 Pro (large MoE variant).
+#
+# Source: deepseek-v4/deepseek-ai/DeeSeek-v4-Pro/config.json
+###############################################################################


Typo in the source comment path: DeeSeek-v4-Pro should be DeepSeek-v4-Pro (consistent with the model naming elsewhere).

Copilot · 2026-04-29T12:33:31Z

+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        in_dtype = x.dtype
+        x32 = x.float()
+        rms = torch.rsqrt(x32.pow(2).mean(dim=-1, keepdim=True) + self.eps)
+        return (x32 * rms).to(in_dtype) * self.weight
+


Same RMSNorm dtype issue here: (…to(in_dtype) * self.weight) can upcast the output back to fp32 if self.weight is fp32, which is likely under mixed precision. To keep the compressor output in the activation dtype, multiply by self.weight.to(in_dtype) or cast the final output back to in_dtype.

Copilot · 2026-04-29T12:33:31Z

+        v.shape[2]
+        v_local_per_q = v.unsqueeze(2).expand(-1, -1, q.shape[2], -1, -1)  # [B, H, S, Sk_local, head_dim]


This v.shape[2] line is a no-op (likely leftover from debugging) and should be removed to avoid confusing readers and linters.

Copilot · 2026-04-29T12:33:32Z

+class HashRouter(nn.Module):
+    """Static hash-based MoE router.
+
+    Args:
+        num_experts: total number of routed experts.
+        topk: number of experts each token is routed to.
+        vocab_size: tokenizer vocabulary size; controls the table length.
+        seed: deterministic seed for the hash; same across all ranks.
+        dtype: dtype of the returned ``probs`` tensor; defaults to
+            ``torch.float32``.
+


This PR introduces substantial new DeepSeek-V4 core modules (attention variants, compressor/indexer, routers, MoE, HC) but does not add unit tests covering their key invariants (e.g., HashRouter determinism, CSA/HCA causality masks, compressor/indexer shape/validity). The repo already has a Python unit test suite under tests/unit_tests/ (including Megatron transformer tests), so please add focused unit tests for these new modules to prevent regressions.

Remove GPT placeholder/super-init spec coupling so DeepSeek-V4 builds decoder directly from DeepSeek ModuleSpec submodule trees, and update Phase 8 progress records to match the finalized implementation and validation status. Made-with: Cursor

Unify DeepSeek-V4 runtime module selection under DeepSeekV4SpecProvider and migrate attention/MLP/MoE construction to provider-driven ModuleSpec flows with safe local fallbacks. Document and validate the TE CUDA runtime contract, including an explicit fail-fast guard for non-CUDA TE/Turbo inputs and updated Phase 9 progress records in English. Made-with: Cursor

Copilot

Pull request overview

Copilot reviewed 46 out of 50 changed files in this pull request and generated 4 comments.

Copilot · 2026-04-30T03:14:29Z

+
+        # 5) Stash for ``_compute_attention_output`` to consume.
+        gathered.shape[2]
+        # Build mask for the compressed branch: ``-inf`` where invalid.


There are a couple of no-op statements (e.g., gathered.shape[2]) that have no effect and appear to be leftover debugging. Please remove them to keep the CSA path easier to read/maintain.

Copilot · 2026-04-30T03:14:30Z

+                    batch, seq = input_ids.shape
+                    position_ids = (
+                        input_ids.new_arange(seq, dtype=input_ids.dtype).unsqueeze(0).expand(batch, -1)
+                    )


input_ids.new_arange(...) is not a valid PyTorch Tensor API (and there is no local helper/monkeypatch in the repo), so this will raise AttributeError when position_ids is omitted. Use torch.arange(seq, device=input_ids.device, dtype=...) (or the existing Megatron helper used elsewhere) to build position ids.

Copilot · 2026-04-30T03:14:30Z

+export PRECISION_TYPE=${PRECISION_TYPE:-BF16}
+export FP8=null
+export FP8_RECIPE=null
+


FP8/FP8_RECIPE default to the literal string null, but the script still passes them via --fp8/--fp8_recipe. That makes args.fp8 truthy and can trigger FP8 validation paths (and failures) even when FP8 is intended to be disabled. Only include these CLI flags when PRECISION_TYPE=FP8, or ensure the disabled state is represented in a way the arg parser treats as false/None.

Copilot · 2026-04-30T03:14:30Z

+        B, S, D = hidden.shape
+        flat = hidden.reshape(-1, D)  # [N, D]
+        flat.shape[0]
+


There are a few no-op statements left in forward (e.g., flat.shape[0]) that don't affect execution and look like leftover debugging. Please remove them to avoid confusion and keep the forward path clean.

…chema Align phase10 DeepSeek-V4 modules on explicit spec/provider contracts by enforcing SharedExpertMLP-only shared experts and introducing a dedicated DeepSeekV4TransformerConfig for V4-only runtime fields. Update builder/spec/docs so training resolves the new config type and tracks activation clamp through model config. Made-with: Cursor

Fix HC/attention dtype mismatches and tune the DeepSeek-V4 smoke script defaults so the Phase 10 MI355X run completes reliably end-to-end. Add a dedicated Phase 10 convergence report documenting delivered scope, runtime blockers, and remaining tracked items. Made-with: Cursor

Copilot

Pull request overview

Copilot reviewed 48 out of 52 changed files in this pull request and generated 5 comments.

Copilot · 2026-04-30T12:53:47Z

+export PRECISION_TYPE=${PRECISION_TYPE:-BF16}
+export FP8=null
+export FP8_RECIPE=null
+
+if [ "$PRECISION_TYPE" = "FP8" ]; then
+  export FP8=${FP8:-hybrid}
+  export FP8_RECIPE=${FP8_RECIPE:-delayed}
+fi
+
+export EXP=${EXP:-examples/megatron/configs/MI355X/deepseek_v4_flash-BF16-pretrain.yaml}
+export BACKEND_PATH=${BACKEND_PATH:-"$(pwd)/third_party/Megatron-LM"}
+export PRIMUS_TEAM=${PRIMUS_TEAM:-amd}
+export PRIMUS_USER=${PRIMUS_USER:-tas-mi355x-$(date +%Y%m%d)}
+export PRIMUS_EXP_NAME=${PRIMUS_EXP_NAME:-deepseek_v4_smoke_${PRECISION_TYPE}_MBS${MBS}_GBS${GBS}_PP${PRIMUS_PP}_EP${PRIMUS_EP}}
+
+if [ ! -d "$BACKEND_PATH" ] || [ -z "$(ls -A "$BACKEND_PATH" 2>/dev/null)" ]; then
+  echo "[ERROR] BACKEND_PATH does not exist or is empty: $BACKEND_PATH"
+  echo "Run: git submodule update --init --recursive"
+  exit 1
+fi
+
+mkdir -p "output/$PRIMUS_TEAM/$PRIMUS_USER/$PRIMUS_EXP_NAME"
+
+./primus-cli direct \
+  -- train pretrain --config "$EXP" \
+  --backend_path "$BACKEND_PATH" \
+  --num_layers "$PRIMUS_TOTAL_LAYERS" \
+  --train_iters "$TRAIN_ITERS" \
+  --lr_warmup_iters 0 \
+  --lr_decay_iters "$TRAIN_ITERS" \
+  --micro_batch_size "$MBS" \
+  --global_batch_size "$GBS" \
+  --seq_length "$PRIMUS_SEQ_LENGTH" \
+  --max_position_embeddings "$PRIMUS_MAX_POSITION_EMBEDDINGS" \
+  --rope_type rope \
+  --tensor_model_parallel_size "$PRIMUS_TP" \
+  --pipeline_model_parallel_size "$PRIMUS_PP" \
+  --expert_model_parallel_size "$PRIMUS_EP" \
+  --num_experts "$PRIMUS_NUM_EXPERTS" \
+  --moe_router_topk "$PRIMUS_MOE_TOPK" \
+  --moe_router_enable_expert_bias "$PRIMUS_MOE_ENABLE_EXPERT_BIAS" \
+  --moe_ffn_hidden_size "$PRIMUS_MOE_FFN_HIDDEN_SIZE" \
+  --index_topk "$PRIMUS_INDEX_TOPK" \
+  --v4_grouped_experts_support_clamped_swiglu "$PRIMUS_V4_GROUPED_EXPERTS_SUPPORT_CLAMPED_SWIGLU" \
+  --compress_ratios "$PRIMUS_COMPRESS_RATIOS" \
+  --mtp_num_layers 0 \
+  --mock_data True \
+  --use_turbo_attention "$USE_TURBO_ATTENTION" \
+  --use_turbo_grouped_mlp "$TURBO_USE_GROUPED_MLP" \
+  --moe_use_legacy_grouped_gemm "$LEGACY_GG" \
+  --fp8 "$FP8" \
+  --fp8_recipe "$FP8_RECIPE" \


FP8/FP8_RECIPE are always passed to primus-cli (defaulting to the literal string null). Other run scripts in this repo gate --fp8 ... args behind an explicit FP8 enable flag; passing null may be rejected by argument parsing or select an unintended FP8 mode. Consider only adding --fp8/--fp8_recipe when PRECISION_TYPE=FP8 (or when a dedicated FP8=True flag is set), and omit them entirely otherwise.

Copilot · 2026-04-30T12:53:48Z

+    # Primus-owned: DeepSeek-V4 (Phase 2 stub; full V4 wiring lands in Phase 3+)
+    if model_type == "deepseek_v4":
+        deepseek_v4_module = importlib.import_module(
+            "primus.backends.megatron.core.models.deepseek_v4.deepseek_v4_builders"
+        )


The comment "Phase 2 stub; full V4 wiring lands in Phase 3+" is now misleading since this PR imports the full DeepSeek-V4 builders/specs. Updating/removing it will avoid confusion when debugging model-type dispatch.

Copilot · 2026-04-30T12:53:48Z

+        # 5) Stash for ``_compute_attention_output`` to consume.
+        gathered.shape[2]
+        # Build mask for the compressed branch: ``-inf`` where invalid.
+        # This is per-query, shape [S, K]; we keep it on the module as a
+        # full [B, S, K] additive mask.
+        sparse_mask = torch.where(valid, 0.0, float("-inf")).to(dtype)  # [B, S, K]
+        self._csa_state = {
+            "gathered": gathered,  # [B, S, K, head_dim]
+            "sparse_mask": sparse_mask,  # [B, S, K]
+        }
+
+        # Tell the parent: no cat-extension; we handle CSA inside
+        # ``_compute_attention_output``.
+        return None, None, None


CSAAttention stores per-forward tensors in self._csa_state and then reads them in _compute_attention_output. This is not safe under pipeline parallel schedules (multiple microbatches in flight) or activation checkpoint recomputation, because the module attribute can be overwritten before earlier microbatches/backward recomputes run, leading to wrong outputs/gradients. Refactor CSA to avoid mutable module-level forward state (e.g., compute the joint local+sparse attention fully inside forward, or thread the gathered KV/mask through the call stack without storing on self).

Copilot · 2026-04-30T12:53:48Z

+        # 5) Stash for ``_compute_attention_output`` to consume.
+        gathered.shape[2]
+        # Build mask for the compressed branch: ``-inf`` where invalid.


There are two no-op expression statements (gathered.shape[2] and later v.shape[2]) that have no effect and look like leftover debug code. They should be removed to avoid confusion (and to keep linters/type checkers from flagging them).

Copilot · 2026-04-30T12:53:49Z

+        decoder = getattr(self, "decoder", None)
+        if decoder is not None:
+            decoder._v4_token_ids = input_ids
+        try:
+            hidden_states = self.decoder(
+                hidden_states=decoder_input,
+                attention_mask=attention_mask,
+                **kwargs,
+            )
+        finally:
+            if decoder is not None:
+                decoder._v4_token_ids = None


DeepseekV4Model.forward stashes input_ids onto decoder._v4_token_ids and clears it immediately after the forward. This breaks any activation checkpoint/recompute that re-invokes decoder/layer forwards during backward (token_ids will be missing) and is also unsafe with pipeline schedules that can have multiple microbatches using the same module instance. Prefer passing token_ids=input_ids explicitly into self.decoder(...) (the decoder already accepts a token_ids kwarg) instead of relying on mutable module state.

Copilot

Copilot wasn't able to review this pull request because it exceeds the maximum number of lines (20,000). Try reducing the number of changed lines and requesting a review from Copilot again.

…) before dispatch + smoke User directive: P27's first deliverable is now a release-tier correctness gate that runs the existing G23 / G24 / G26 / G27 equivalence tests on production V4 dimensions (`head_dim=512`, real `H`, real `swa_window`, real `K_topk`), so any kernel-numerics regression at the head-dim that plan-4 exists to solve is caught BEFORE the dispatch + smoke layers are added on top. Plan-doc updates: * `01-roadmap.md` — P27 deliverables now lead with the release-tier shape gate; Milestone M4 split into M4a (release-tier shape correctness) and M4b (smoke). * `02-phase-details.md` — Phase 27 section rewritten to lead with task 1 "Release-tier shape gate (G28)" followed by dispatch precedence, run-script docs, dispatch unit test (G29), smoke run + smoke gate (G30), and the hand-off note. Design notes explain why G28 fits in P27 (not P25 / P26), why eager fp32 reference fits at calibrated `S ∈ {512, 1024}`, and why we don't target full `S=4096` in unit tests (smoke covers full `S`). * `03-test-strategy.md` — gate matrix gains G28 (release-tier kernel correctness at production V4 dims); previous G28 / G29 renumbered to G29 / G30. GPU-toy harness paragraph documents the fast-tier vs release-tier split. * `status.md` — Phase 27 task table re-ordered: G28 row first, then dispatch precedence, env-var plumbing, dispatch unit test (G29), smoke run, smoke gate (G30), hand-off note. The actual kernel implementations (P25 / P26) and dispatch plumbing (P22 / P25 / P26) are unchanged — this is a plan-doc + status-table reorganisation only. Co-authored-by: Cursor <cursoragent@cursor.com>

wiring + EP8 smoke (closes plan-4) Plan-4 P27 lands the three layered closing gates for the in-tree Primus Triton V4 attention kernels (P25 dense / HCA, P26 CSA), and appends the plan-4 hand-off summary. * G28 release-tier shape gate (lands first per user directive — kernel numerics are locked at production V4 dims BEFORE the dispatch + smoke layers are stacked on top). Extends `_BASE_SHAPES` in the four P25 / P26 fwd/bwd test files with V4-Flash (`H=64, head_dim=512, swa_window=128, S=1024, K_topk=512`) and V4-Pro (`H=128, head_dim=512, swa_window=128, S=512, K_topk=512`) pytest.param entries marked `pytest.mark.slow`. New `tests/unit_tests/megatron/transformer/deepseek_v4/conftest.py` ships an autouse `torch.cuda.empty_cache()` fixture so the eager CSA reference's `[B, H, Sq, K, D]` einsum intermediate doesn't accumulate in PyTorch's caching allocator across consecutive release-tier tests. Root `tests/unit_tests/conftest.py` registers `pytest.mark.slow` and adds a `--run-slow` opt-in (also accepts `-m slow`). Release-tier bf16 tolerances bumped to absorb `head_dim=512` matmul noise + `tl.atomic_add` jitter on the backward (FWD bf16 atol=5e-2; BWD bf16 dq/dk/dv/dgathered atol=2e-1; dsink atol=5e-2). 80 / 80 release-tier tests pass on mi355-gpu-14 inside dev_primus_wenx_693 in 60.2 s (`pytest --run-slow -m slow`); fast-tier suite remains green. * G29 dispatch precedence + startup log line. New `_log_kernel_choice` helper on `DeepseekV4Attention` emits one `[V4-attn] Layer N: cr=R, kernel = ...` info line per layer at rank 0 so smoke / training logs unambiguously show which kernel each layer is firing through. The class docstring grows a precedence table covering all three layer kinds plus the auto-disable rules for the two flags. New `test_v4_p27_dispatch_precedence.py` (16 tests) covers the dispatch path at runtime: 7 parametrised log-line tests across every (cr, flag) → expected-kernel combo + format / once-per-call / layer-number assertions; runtime-mock tests on `_attention_forward_via_v4_triton` and `_csa_forward` verifying the right kernel symbol is invoked with the right kwargs; two auto-disable runtime tests for the cross-layer-kind contracts. `run_deepseek_v4.sh` gains a soft `[WARN]` echo when either Triton flag is on and `PRIMUS_TP > 1` (kernels are MQA-centric and operate per-rank on the local H/TP head slice — TP > 1 should work but stays uncovered by plan-4 gates). * G30 TP=1 PP=1 EP=8 10-iter smoke with both kernels engaged + Turbo DeepEP. New `progress/p27/run_smoke_v4_kernels_ep8_pp1.sh` script + matching `.gitignore` (excludes `*.log` / `log_*.txt` / `debug.log` / `*.tgz` / `*.json` per the plan-3 directive — smoke logs MUST NOT land in git). Smoke is green: 10 / 10 iters clean, lm_loss converges 11.85 → 11.65, grad norm steady, 0 nan iterations, all 8 layers emit the expected kernel-choice log line. Steady-state ~17.3 TFLOP/s/GPU (peak ~19.8) at ~500 ms / iter — at parity with the P23 Turbo-DeepEP-on-eager-attention baseline at the smoke's small seq length (the eager attention is matmul-cheap at S=128 and DeepEP dominates iter time; the Triton kernels' real win is on full V4-Flash production dims, planned as a plan-4 follow-up). * P27 hand-off block appended to `plan-4/02-phase-details.md` recording: commit chain P24 → P27, fast-tier + release-tier test totals, G30 smoke perf delta vs. eager / DeepEP baselines, and the follow-up list (Megatron-side `layer_number` plumbing, full-S=4096 smoke, HCA LSE-merge for Turbo, CSA in-kernel gather, FP8, default-True flip). Plan-4 ends. The two switches (`use_v4_triton_attention`, `use_v4_triton_csa_attention`) ship at default `False` so this PR is a pure safety-net add; the Triton path is opt-in via the existing run-script env vars. Co-authored-by: Cursor <cursoragent@cursor.com>

Copilot

Copilot wasn't able to review this pull request because it exceeds the maximum number of lines (20,000). Try reducing the number of changed lines and requesting a review from Copilot again.

…P27 SHA (e19663f) Replaces the TBD-p27 / TBD-p27a placeholders in `deepseek-v4/develop/progress/status.md` and the P27 hand-off block in `deepseek-v4/develop/plan-4/02-phase-details.md` with the actual commit SHA `e19663f7`. Mirrors the P25 / P26 SHA-pin convention (`1ba38ba5` / `36dfca66`). Plan-4 ends here. Co-authored-by: Cursor <cursoragent@cursor.com>

…+ flip run_deepseek_v4.sh defaults to PP1EP8 + V4 Triton kernels on Plan-5 picks up where plan-4 closed (in-tree V4 Triton kernels for dense / HCA / CSA shipped behind use_v4_triton_attention / use_v4_triton_csa_attention; +37 % vs P22 eager at smoke seq) and is strictly scoped to taking V4-Flash EP=8 single-node training from its plan-4 P27 G30 steady-state up the throughput curve at production-shape sequence length, by attacking the bottlenecks visible in a real torch.profiler trace. Five phases (every P29..P32 task list is seeded; refined in writing against the P28 trace at phase open; targets < 10 % of step time get de-scoped on the spot): - P28 (kick-off) — run_deepseek_v4_flash_proxy.sh (V4-Flash widths, 8 layers, all four perf knobs on: USE_V4_TRITON_ATTENTION, USE_V4_TRITON_CSA_ATTENTION, USE_TURBO_DEEPEP, TURBO_USE_GROUPED_MLP) calibrated for one MI355X node at EP=8; chrome-trace JSON for one steady iter; baseline analysis report (md + html) under develop/profile/profile-baseline-ep8-<date>.{md,html}. The report's ranked bottleneck list pins the X / Y / Z / W per-phase improvement budgets for everything that follows. Gates: G31 (smoke) + G31a (report). - P29 (small-op fusion) — seeded targets: (a) Q-projection chain, (b) KV-projection chain, (c) O-projection group, (d) Compressor + Indexer, (e) MoE router. Each behind own use_v4_fused_* switch (default False); functional fusion (not module-level) so Megatron's spec walker stays untouched. Gates: G32.{a..e} + G33 (smoke + perf, >= +X % TFLOP/s/GPU vs P28 baseline). - P30 (V4 Triton attention perf) — per-shape autotune for FWD + BWD (BLOCK_M / BLOCK_N / num_warps / num_stages keyed on H, head_dim, swa_window; SMEM heuristic prunes > 160 KiB at compile time); persistent FWD kernel; HCA LSE-merge variant (was a plan-4 follow-up; runs SWA + compressed-pool branches as two flash kernels and merges via online softmax — avoids the materialised additive bias). use_v4_attention_lse_merge switch (default False); G34 asserts FWD + BWD equivalence within bf16 budget. - P31 (V4 Triton CSA perf) — in-kernel topk_idxs gather (drops the ~64 GiB / microbatch wrapper-side materialisation at V4-Flash production dims; this is also the structural fix that eventually lets the proxy reach Sq=4096); K-tile prefetching. use_v4_csa_in_kernel_gather switch (default False); reuses plan-4 G26 / G27 release-tier with dgathered -> dpool assertion. - P32 (overlap + recompute) — re-enable --overlap_grad_reduce True --overlap_param_gather True (currently False; plan-4 G30 obsoletes the plan-2 stability hedge that turned them off); MoE shared-expert overlap investigation; recompute granularity tuning if P31's in-kernel gather frees enough HBM. Final EP=8 trace at develop/profile/profile-final-*. Gate: G35 (smoke + cumulative perf, >= W % vs P28 baseline) + plan-4 ratchet (G23..G30) all green. Ratchet — every plan-5 phase MUST keep plan-4 gates G23 / G24 / G25 / G26 / G27 / G28 / G29 / G30 green. Banned-warning ratchet adds "v4_fused_* compile error" and "DeepEP contract violation". Plan-5 is measurement-driven: no per-phase budget number is committed in the plan docs; P28's report owns picking and writing them. Out of scope (plan-5): FP8 / FP4 / mxfp4 quantised forward, convergence run, long-context (1M-token) bring-up, multi-node EP scaling, HF state-dict adapter, V3 / V2 backports of plan-5 fusions. run_deepseek_v4.sh — defaults flipped per the user directive that preceded plan-5 planning so the V4-Flash production smoke runs end-to-end without any env-var override: - PRIMUS_PP defaults 2 -> 1 - PRIMUS_EP defaults 4 -> 8 - USE_V4_TRITON_ATTENTION defaults False -> True - USE_V4_TRITON_CSA_ATTENTION defaults False -> True All four are still env-var overridable; the existing TP > 1 soft warning (plan-4 P27) still fires when the V4 Triton kernels are on at TP > 1. plan-4 G30 evidence (10/10 iters clean, lm_loss 11.85 -> 11.65, throughput 17.3 TFLOP/s/GPU steady at PP=1 EP=8 with both V4 Triton kernels + Turbo DeepEP on) gates this default flip. Documents: - deepseek-v4/develop/plan-5/README.md (overview, scope, phase map) - deepseek-v4/develop/plan-5/01-roadmap.md (phase overview, dep graph, milestones, top risks, out-of-scope) - deepseek-v4/develop/plan-5/02-phase-details.md (per-phase tasks, design notes, edge cases; hand-off note placeholder) - deepseek-v4/develop/plan-5/03-test-strategy.md (gate matrix G31..G35, plan-4 ratchet contract, banned-warning ratchet, perf-budget contract) - deepseek-v4/develop/progress/status.md (Phase 28..32 task tables with TBD-p2X commit cells) Co-authored-by: Cursor <cursoragent@cursor.com>

Copilot

Copilot wasn't able to review this pull request because it exceeds the maximum number of lines (20,000). Try reducing the number of changed lines and requesting a review from Copilot again.

…+ bottleneck report Phase 28 ships the foundation that every other plan-5 phase reports its delta against: * `run_deepseek_v4_flash_proxy.sh` — thin wrapper over `run_deepseek_v4.sh` that pins V4-Flash production widths (`H=64, head_dim=512, num_experts=256, moe_router_topk=6, moe_ffn_hidden_size=2048, index_topk=512`), 8 layers, `compress_ratios=[0,0,4,128,4,128,4,0]` (every layer kind exercised: 3 dense, 3 CSA, 2 HCA), `TP=1 PP=1 EP=8`, and all four perf knobs on (`USE_V4_TRITON_ATTENTION`, `USE_V4_TRITON_CSA_ATTENTION`, `USE_TURBO_DEEPEP`, `TURBO_USE_GROUPED_MLP`). * `progress/p28/run_baseline_trace_ep8.sh` — self-contained trace-capture script (mirrors plan-3 P23 / plan-4 P25 pattern; `run_deepseek_v4.sh` hard-codes `--disable_tensorboard True` which blocks the profiler's TB writer). Captures iter 6 -> 7 (one steady iter) at `Sq=4096` with `PROFILE=True --use_pytorch_profiler True`. `progress/p28/.gitignore` excludes the raw `*.log` / `*.json` / `*.tgz` outputs. * `develop/profile/_tools/render_baseline_report.py` — chrome-trace consumer that emits the markdown + HTML bottleneck-analysis report. Multi-stream-overlap-aware GPU-active math (interval-union sweep — single-stream `Sigma dur` over-counts on multi-stream HIP); top-1 reduce-kernel signature isolated; module-level CPU op-time numbers carry an explicit "nests" caveat so readers do not misread bloated `Sigma event dur` totals. Tool is reused by P32 for `profile-final-*`. * `develop/profile/profile-baseline-ep8-20260508.{md,html}` — the P28 report. Headline findings: - GPU active = 99.7 % (CPU-bound floor 0.3 %, multi-stream overlap factor 1.87x). The pre-trace hypothesis that small-kernel-launch tail is the bottleneck DOES NOT HOLD at V4-Flash production widths. - Top kernel by far is one specific `aten::sum` fp32 reduce (`reduce_kernel<512, 1, ReduceOp<float, sum_functor<float, float, float>>>`) at 7.61 s (87.3 % of step) over 717 launches x 10.62 ms each. - V4 Triton attention kernels are BWD-heavy: dense / HCA = 3.90 s (44.7 %), CSA = 4.19 s (48.1 %). - Comm time = 12.85 ms (0.1 %); HBM peak = 195 / 287 GiB ~ 68 %. - Per-phase de-scope decisions (data-driven, 10 % rule): P29 KEEP but RESCOPE (drop small-op fusion mandate, redirect to root- causing the 7.6 s `aten::sum`); P30 KEEP (BWD prioritised); P31 KEEP but RESCOPE (HBM motivation gone, kept for BWD speed-up); P32 DE-SCOPED. - Combined target: plan-5 final >= 110 TFLOP/s/GPU steady at Sq=4096 EP=8 single-node (40 %+ over the 78 TFLOP/s/GPU baseline). * Calibration outcome: `Sq=4096` (production target) confirmed fitting on a single MI355X node at EP=8 (peak rocm HBM 195 GiB / 287 GiB ~ 68 %, 5/5 calibration iters clean, 10/10 baseline iters clean, lm_loss 11.16 -> 9.26, 0 NaN, banned-warning grep on plan-3 / plan-4 ratchet patterns returns 0 for every term). No fall-back to Sq=2048 / 1024 / 512 needed; `Sq=4096` adopted as the proxy default. * `progress/status.md` Phase 28 row: all 6 P28 task cells checked, with `TBD-p28` SHA placeholders that will be SHA-pinned in a follow-up commit (mirrors the plan-4 P27 -> 03bacc2 pattern). Closes plan-5 P28. P29 / P30 / P31 task lists open against this baseline; P32 is de-scoped pending evidence. Co-authored-by: Cursor <cursoragent@cursor.com>

Plan-5 P28 closed in afd7ea5; replace the `TBD-p28` placeholders in the Phase 28 task table with the actual commit SHA. No content change beyond the SHA pin. Mirrors the plan-4 P27 -> 03bacc2 pattern. Co-authored-by: Cursor <cursoragent@cursor.com>

…rn + project-wide rules doc Plan-5 P29 (RESCOPED) — kill the dominant aten::sum fp32 reduce kernel that the P28 baseline trace pinned at 7.61 s / 87.3 % of step time. Forensic root cause (progress/p29/refinement.md + _forensics{,2,3}.py): 624 / 717 of all dominant `reduce_kernel<512, 1, ...>` launches (96 % by count, 99.95 % by Σ kernel duration) come from `hyper_connection.py:47 sinkhorn_normalize` — 39 reductions / call × 8 layers × 2 (FWD + AOT-autograd BWD) = 624. Inputs are `(1, 4096, 4, 4) → keepdim=True dim=-1` fp32. HIP's default `reduce_kernel<512, 1, ...>` is sized for huge reductions; for our 4-elements-per-output shape it runs at ~250× over the memory-bound floor (~12.5 % occupancy + 624 × 5 µs launch overhead). Fix: a `torch.compile(fullgraph=True, dynamic=True)` build of `sinkhorn_normalize`, cached on `(n_iters, eps, in_dtype)` (shape NOT in key — `dynamic=True` ships ONE shape-generic Inductor kernel; that also avoids Dynamo's `cache_size_limit=8` collision when closures from the same factory share a `code` object). Algorithm is byte- identical; only the kernel boundary moves. AOT autograd handles BWD. Behind a default-off feature flag `use_v4_compiled_sinkhorn` plumbed through `DeepSeekV4TransformerConfig` → V4 base + V4-Flash YAML → `DeepseekV4HybridLayer` → `HyperMixer.__init__` → `HyperMixer.compute_weights` → `sinkhorn_normalize(use_compiled=...)`. `run_deepseek_v4.sh` exports `USE_V4_COMPILED_SINKHORN` (default `False`); `run_deepseek_v4_flash_proxy.sh` flips its default to `True` so plan-5 P30 / P31 measure against the post-P29 baseline. Gates (all green): * G32 — FWD + BWD parity (compiled vs eager); 10 / 10 tests pass; fast tier (B=2, S=64, K=4) atol=1e-5; release tier (B=1, S=4096, K=4) marked pytest.mark.slow; cache-hit assertion on second call; HyperMixer flag-propagation test included. * G33a — 10-iter EP=8 proxy smoke; no NaN / Inf / banned warnings; `lm_loss[10] = 9.258` vs P28 baseline `9.258` (bit-for-bit); steady 79.1 vs 77.5 TFLOP/s/GPU (+2.0 %). * G33b — post-P29 chrome-trace + bottleneck report at `develop/profile/profile-after-p29-ep8-20260509.{md,html}`. Budget X1 (≥ 50 % drop in aten::sum kernel time) MET BY ~1000×: critical shape kernel time 7607.9 ms → 0.2 ms (−99.997 %), launches 624 → 16. Multi-stream overlap factor collapsed 1.87× → 1.00× — explains why wall-time gain is only +2 % despite the kernel-time delta (the reduce was a parallel hitchhiker on stream-1; the V4 Triton attention BWD on stream-0 was already wall-time gating). New top wall-time bottleneck: V4 Triton CSA BWD (4.03 s, 46.8 %) + V4 Triton dense BWD (3.18 s, 36.8 %) = 92.6 % of step. P30 / P31 mandate confirmed unchanged. De-scope decisions recorded at P29 close: * Hand-Triton fall-back kernel — NOT NEEDED (X1 over-shot ~1000×). * Global default flip — DEFERRED to P32 hand-off (G35) because the +2 % wall-time gain does not justify the cold-compile footgun for short-iter unit-test harnesses; proxy default is enough for plan-5 P30 / P31 perf work. Also lands `develop/rules/rule.md` — project-wide working rules doc codifying the standing decisions accumulated across plan-2..plan-5 (review-before-commit, status-pin commit pattern, per-phase summary file convention introduced at this phase, banned-warning ratchet, dispatch precedence, DeepEP best-practice config, dtype contract, TFLOPs counting rule, 10 % de-scope rule, etc.). README + status.md + plan-5/01-roadmap.md now point at it as the single source of truth. Co-authored-by: Cursor <cursoragent@cursor.com>

Pin the Plan-5 P29 tracker rows, post-P29 profile provenance, and `progress/p29/p29-summary.md` commit chain to the feature commit `1ea7e7a8`. This is the standard docs-only status-pin commit that follows every DeepSeek-V4 phase feature commit. Co-authored-by: Cursor <cursoragent@cursor.com>

Copilot

Copilot wasn't able to review this pull request because it exceeds the maximum number of lines (20,000). Try reducing the number of changed lines and requesting a review from Copilot again.

+        )[0]
+    )
+    print(f"loading {fp} ...")
+    data = json.load(open(fp))


+        trace_dir = "output/amd/tas-mi355x-20260509/p28_profile_baseline_pp1_ep8_seq4096/tensorboard"
+        fp = glob.glob(os.path.join(trace_dir, "*.pt.trace.json"))[0]
+    print(f"loading {fp} ...")
+    data = json.load(open(fp))


+        trace_dir = "output/amd/tas-mi355x-20260509/p28_profile_baseline_pp1_ep8_seq4096/tensorboard"
+        fp = glob.glob(os.path.join(trace_dir, "*.pt.trace.json"))[0]
+    print(f"loading {fp} ...")
+    data = json.load(open(fp))


…tiles Optimize the in-tree V4 Triton attention path by routing dense and HCA layers through kernel-native SWA pruning, including an HCA split-mask mode that preserves the joint softmax while avoiding dead local-key tiles. Co-authored-by: Cursor <cursoragent@cursor.com>

Co-authored-by: Cursor <cursoragent@cursor.com>

…nd sparse pool kernels Co-authored-by: Cursor <cursoragent@cursor.com>

Copilot

Copilot wasn't able to review this pull request because it exceeds the maximum number of lines (20,000). Try reducing the number of changed lines and requesting a review from Copilot again.

Co-authored-by: Cursor <cursoragent@cursor.com>

Copilot

Copilot wasn't able to review this pull request because it exceeds the maximum number of lines (20,000). Try reducing the number of changed lines and requesting a review from Copilot again.

…D kernels P32 closes the residual single-kernel attention bottlenecks pinned by the post-P31b microbenchmark using `progress/p31/bench_csa_attention_ep8.py` and a new `progress/p32/bench_v4_attention_ep8.py` (dense `cr=0` + HCA `cr=128` modes; mirrors the CSA bench argparse + timing). CSA FWD: 48.17 ms -> 3.16 ms (-93.4 %, 15.2x; target <=6 ms MET). Replace the monolithic `_v4_csa_attention_pool_fwd_kernel` with three kernels joined by an online-softmax LSE merge so the local SWA and sparse top-K branches no longer serialise through a single program: reuse the P30-pruned dense FWD for local, add `_v4_csa_attention_pool_sparse_fwd_kernel` for head-block sparse, and add `_v4_csa_attention_lse_merge_kernel` to combine the two. `PRIMUS_V4_CSA_FWD_FORCE_MONOLITHIC=1` keeps the legacy kernel. V4 attention BWD: dense 17.27 ms -> 7.65 ms (-55.7 %, 2.26x; target <=15 ms MET); HCA 20.87 ms -> 11.91 ms (-42.9 %, 1.75x; target <=15 ms MET). Split `_v4_attention_bwd_kernel` into `_v4_attention_bwd_dq_kernel` (parallel over `m`) and `_v4_attention_bwd_dkv_kernel` (parallel over `n`) so dQ, dK, dV are written atomic-free. MHA fast path drops the kvgroup head loop when `HEAD_K == HEAD_Q`. `PRIMUS_V4_ATTN_BWD_FORCE_MONOLITHIC=1` keeps the legacy kernel. CSA BWD: 35.43 ms -> 16.31 ms (-54.0 %, 2.17x; target <=15 ms missed by ~1.3 ms). Local SWA reuses the new split dq + dkv kernels with CSA's joint `lse / D`. Sparse pool branch defaults to a two-pass segmented reduction (`PRIMUS_V4_CSA_BWD_SEGREDUCE=1`): a new `_v4_csa_attention_pool_sparse_bwd_partial_kernel` writes per-visit dpool contributions to a compact `[B, M, K_topk, D]` partial buffer with `tl.store` (no atomics), then a new `_v4_csa_attention_pool_segreduce_kernel` folds them into `dpool_fp32` segment-by-segment via a sorted inverse index, also atomic-free. Sweep + ship `BLOCK_K_PARTIAL=16`, `partial warps=8`, `partial stages=2`, `segreduce BLOCK_D=512`, `BLOCK_I=64`, `warps=8`, `stages=3`. Fallback retains the legacy gather + dpool atomics path (sparse `BLOCK_K=32`, `num_warps=4` defaults after a sweep). Tests: `pytest -x -q tests/.../deepseek_v4/{test_v4_p25_v4_attention_bwd ,test_v4_p26_v4_csa_attention_bwd,test_v4_p31_v4_csa_in_kernel_gather}.py` -> 51 passed, 48 skipped. Pre-existing unrelated `test_v4_mtp` failure verified on `git stash` baseline. Docs: `progress/status.md` P32 row checked through the kernel work (EP8 trace + report + `proxy_ep8.md` left for follow-up); `develop/perf/attention_perf.md` P32 row added with effective TFLOP/s re-derived from the microbench wall times; full eight-section summary in `progress/p32/p32-summary.md` per rule R2.1, including the negative probes (dense-mask scatter, bf16 partial buffer, fused dpool matmuls, multi-stream overlap). Bench shape: `B=1, H=64, S=4096, D=512, P=1024, K_topk=512, swa_window=128, bf16, sink=on` on `mi355-gpu-8` / `dev_primus_wenx_693`, median of 60 iters after 20 warmup. Co-authored-by: Cursor <cursoragent@cursor.com>

…sive BWD optimizations behind opt-in env vars After landing the P32 split CSA FWD + atomic-free V4/CSA BWD kernels in the prior commit, the EP8 proxy trace surfaced a HBM-contention story that does not show up in standalone microbenchmarks: - CSA FWD split (local SWA + sparse pool + LSE merge) wins both: bench 48.17 -> 3.22 ms (-93.3%, 15.0x) AND proxy iter 10 963 -> 891 ms (-7.5%) / 711 -> 768 TFLOP/s/GPU (+8.1%). KEEP DEFAULT ON. - V4 attention BWD split (dQ kernel + dK/dV kernel, atomic-free) wins the bench (dense 17.27 -> 7.65 ms, HCA 20.87 -> 11.91 ms; both clear <=15 ms target) but regresses EP8 proxy iter time by ~190 ms because the split design reads Q / K / V twice (2x HBM traffic per BWD step) and loses the bandwidth fight against concurrent MoE work. FLIP DEFAULT TO MONOLITHIC; opt in via PRIMUS_V4_ATTN_BWD_USE_SPLIT=1. - CSA BWD segmented reduction (4 GiB partial buffer + sorted inverse index, atomic-free dpool) wins the bench (35.43 -> 16.31 ms, -54%) but regresses EP8 proxy iter time by ~40 ms for the same HBM- contention reason. FLIP DEFAULT TO gather + atomic_add dpool (the P31b path, now ~7.9% faster at 32.62 ms vs P31b 35.43 ms thanks to incidental Triton autotuner improvements). Opt in via PRIMUS_V4_CSA_BWD_SEGREDUCE=1. EP8 proxy trace 1778476971738245137 (mi355-gpu-8, dev_primus_wenx_693): iter 10 963.0 -> 890.5 ms (-7.5%) TFLOP/s/GPU 709.3 -> 768.4 (+8.1%) profiler steady 980.9 -> 899.99 ms (-8.2%) GPU active 940.04 -> 859.54 ms (-8.6%) CSA FWD trace 123.07 -> ~50.6 ms (-59%) V4 attn BWD trace 259.74 -> 256.97 ms (-1.1%, monolithic kept) CSA sparse BWD 80.81 -> 72.54 ms (-10.2%) Attention family ~493 -> ~410 ms (-16.8%) Tests: 114 passed, 88 skipped across test_v4_p25/p26/p27/p31 attention suites on the shipped defaults. Docs: - profile/profile-after-p32-ep8-20260511.{md,html}: full P32 trace report. - progress/p32/p32-summary.md: rewritten with shipped + opt-in numbers, proxy attribution, and HBM-contention rationale for the opt-in gates. - perf/proxy_ep8.md: P32 row added (890.5 ms / 768.4 TFLOP/s/GPU, 9.92x vs P28 baseline 8837 ms). - perf/attention_perf.md: P32 (shipped) and P32 (bench-opt opt-in) rows. - progress/status.md: P32 rows checked through trace + report; opt-in rationale recorded for the BWD rows. - progress/p32/_render_html.py: helper that renders the markdown profile report to HTML using the same style as the P28..P31b reports. - progress/p32/run_baseline_trace_ep8_p32.sh: trace script (iter 6->7 profiler window, same harness as P31b). Co-authored-by: Cursor <cursoragent@cursor.com>

Copilot

Copilot wasn't able to review this pull request because it exceeds the maximum number of lines (20,000). Try reducing the number of changed lines and requesting a review from Copilot again.

wenxie-amd added 5 commits April 28, 2026 08:40

Copilot AI review requested due to automatic review settings April 28, 2026 11:01

Copilot started reviewing on behalf of wenxie-amd April 28, 2026 11:03 View session

Copilot AI reviewed Apr 28, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings April 28, 2026 11:30

wenxie-amd force-pushed the dev/wenx/deepseek-v4 branch from b8e47a3 to 5e4008d Compare April 28, 2026 11:30

Copilot started reviewing on behalf of wenxie-amd April 28, 2026 11:32 View session

Copilot AI reviewed Apr 28, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings April 29, 2026 12:25

Copilot started reviewing on behalf of wenxie-amd April 29, 2026 12:26 View session

docs(deepseek-v4): add phase 8+ replan docs and reference notes

1030293

Add the plan-1 roadmap/detail/test documentation plus progress tracker entries, and update the development target doc with TransformerEngine and Primus-Turbo reference pointers. Made-with: Cursor

wenxie-amd force-pushed the dev/wenx/deepseek-v4 branch from ecf8169 to 1030293 Compare April 29, 2026 12:28

Copilot AI reviewed Apr 29, 2026

View reviewed changes

wenxie-amd added 2 commits April 29, 2026 13:38

Copilot AI review requested due to automatic review settings April 30, 2026 03:08

Copilot started reviewing on behalf of wenxie-amd April 30, 2026 03:10 View session

Copilot AI reviewed Apr 30, 2026

View reviewed changes

github-code-quality Bot found potential problems Apr 30, 2026

View reviewed changes

Comment thread primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_layer_specs.py Fixed

Copilot AI review requested due to automatic review settings April 30, 2026 12:46

Copilot started reviewing on behalf of wenxie-amd April 30, 2026 12:47 View session

Copilot AI reviewed Apr 30, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings May 8, 2026 15:00

Copilot AI reviewed May 8, 2026

View reviewed changes

wenxie-amd and others added 2 commits May 8, 2026 10:11

Copilot AI review requested due to automatic review settings May 9, 2026 00:40

Copilot AI reviewed May 9, 2026

View reviewed changes

wenxie-amd and others added 2 commits May 8, 2026 19:44

Copilot AI review requested due to automatic review settings May 9, 2026 01:39

Copilot AI reviewed May 9, 2026

View reviewed changes

wenxie-amd and others added 4 commits May 8, 2026 21:40

Copilot AI review requested due to automatic review settings May 9, 2026 07:40

Copilot AI reviewed May 9, 2026

View reviewed changes

github-code-quality Bot found potential problems May 9, 2026

View reviewed changes

wenxie-amd and others added 3 commits May 9, 2026 03:58

docs(deepseek-v4)[P30]: pin status.md P30 cells to the P30 SHA (e8ca33e)

d4d4aee

Co-authored-by: Cursor <cursoragent@cursor.com>

feat(deepseek-v4)[plan-5][P31]: split CSA backward into dense local a…

2c7cf59

…nd sparse pool kernels Co-authored-by: Cursor <cursoragent@cursor.com>

Copilot AI review requested due to automatic review settings May 9, 2026 11:15

Copilot AI reviewed May 9, 2026

View reviewed changes

wenxie-amd and others added 2 commits May 9, 2026 06:26

docs(deepseek-v4)[P31b]: add split CSA backward profile report

945ec74

Co-authored-by: Cursor <cursoragent@cursor.com>

docs(deepseek-v4)[plan-5]: remove Phase 32 planning

a3f5f46

Co-authored-by: Cursor <cursoragent@cursor.com>

Copilot AI review requested due to automatic review settings May 9, 2026 12:32

Copilot AI reviewed May 9, 2026

View reviewed changes

wenxie-amd and others added 2 commits May 10, 2026 22:44

Copilot AI review requested due to automatic review settings May 11, 2026 05:39

Copilot AI reviewed May 11, 2026

View reviewed changes

		) -> torch.Tensor:
		"""Gather ``[B, P, head_dim]`` along ``P`` per query → ``[B, S, K, head_dim]``.

-    ) -> torch.Tensor:
-        """Gather ``[B, P, head_dim]`` along ``P`` per query → ``[B, S, K, head_dim]``.
+    ) -> Tuple[torch.Tensor, torch.Tensor]:
+        """Gather ``[B, P, head_dim]`` along ``P`` per query.
+        Returns:
+            A tuple ``(gathered, valid)`` where:
+            - ``gathered`` has shape ``[B, S, K, head_dim]``.
+            - ``valid`` has shape ``[B, S, K]`` and marks non-masked indices.

-FONT_REG = "/home/xiewen12/.local/share/fonts/NotoSansSC-Regular.otf"
-FONT_BOLD = FONT_REG  # we only have Regular; use it for both
-OUT_DIR = os.path.join(os.path.dirname(__file__), "diagrams")
-os.makedirs(OUT_DIR, exist_ok=True)
-def font(sz: int, bold: bool = False) -> ImageFont.FreeTypeFont:
-    return ImageFont.truetype(FONT_BOLD if bold else FONT_REG, sz)
+BASE_DIR = os.path.dirname(__file__)
+FONT_CANDIDATES = (
+    os.environ.get("DIAGRAM_FONT"),
+    os.environ.get("FONT_REG"),
+    os.path.join(BASE_DIR, "NotoSansSC-Regular.otf"),
+    os.path.join(BASE_DIR, "fonts", "NotoSansSC-Regular.otf"),
+)
+def _resolve_font_path() -> str | None:
+    for path in FONT_CANDIDATES:
+        if path and os.path.isfile(path):
+            return path
+    return None
+FONT_REG = _resolve_font_path()
+FONT_BOLD = FONT_REG  # we only have Regular; use it for both when available
+OUT_DIR = os.path.join(BASE_DIR, "diagrams")
+os.makedirs(OUT_DIR, exist_ok=True)
+def font(sz: int, bold: bool = False) -> ImageFont.ImageFont | ImageFont.FreeTypeFont:
+    font_path = FONT_BOLD if bold else FONT_REG
+    if font_path:
+        return ImageFont.truetype(font_path, sz)
+    return ImageFont.load_default()

	return (x32 * rsqrt).to(in_dtype) * self.weight
	return (x32 * rsqrt).to(in_dtype) * self.weight.to(in_dtype)

		v.shape[2]
		v_local_per_q = v.unsqueeze(2).expand(-1, -1, q.shape[2], -1, -1) # [B, H, S, Sk_local, head_dim]

Conversation

wenxie-amd commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Plan timeline

Plan-2 reshuffle — 2026-05-01 (commit f548d8b2, docs-only)

Why plan-2

Commit map

What landed in 97b9720d (P6/P7)

P6 integration

P7 bring-up

What landed in df273a45 (P8 v2)

What landed in e5fec968 (P9 v2)

What landed in b38e83cf (P10)

What landed in 752b7534 (P10 runtime stabilization + report)

What landed in 636ab3de (P12 — plan-2 lockdown)

Architecture review

Plan-2 documents (active plan of record)

Tech blog closure

Layout cleanup + visuals

Status tracker

Schedule

What landed in cad0fb38 + aa9929a0 (P13 — faithful attention)

cad0fb38 — V4-faithful attention rooted on MLASelfAttention (dense path)

aa9929a0 — Fold Compressor / Indexer + TP-shard projections; retire CSA / HCA legacy

Status

Schedule

What landed in 1a8bf32e (P14 phase-1 — faithful pre-mul clamped SwiGLU + V4 routers)

Activation (G3)

Learned router (G4)

Hash router (G4)

MoE wiring

Tests

Status

Schedule

What landed in 5fe8bc3c (P14 phase-2 — V4 MoE structural bring-up + G5)

DeepseekV4MoE → MegatronModule

CPU local-experts path

Provider helpers (plan-2 P14 §5 / §6)

G5 numerical alignment

Status

Schedule

What landed in 25ccdb5e (P15 — V4 layer / block subclass refactor + token-ids forward kwarg + HC × PP packing)

DeepseekV4HybridLayer → TransformerLayer

DeepseekV4TransformerBlock → TransformerBlock

HC × PP K-stream packing helpers

Token-ids forward kwarg

Spec wiring + MTP block update

Tests (tests/unit_tests/megatron/transformer/deepseek_v4/test_v4_block_pp.py, 16 tests)

Status / blockers

Schedule

What landed in 6c5875d4 (P16 — spec-based MTP via MultiTokenPredictionBlock + process_mtp_loss)

Spec helper (primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_mtp_specs.py, new)

DeepseekV4Model updates (deepseek_v4_model.py)

Layer / block forward contract

V4 attention spec advertises attn_mask_type

Legacy DeepseekV4MTPBlock (deepseek_v4_mtp.py)

Tests (tests/.../test_v4_mtp.py, ~17 tests)

Status / blockers

Schedule

What landed in e591b893 (P17 — code cleanup, gate G14)

Retired in this commit

Dedup'd in this commit

YAML cleanup

Audit gate G14

Out of scope (kept, with notes in status.md)

What landed in b5832672 (P18 — spec-system audit, gate G1 + D1 / D2 / D4)

Provider singleton (D1)

Activation-func consistency (D2)

compress_ratios normalization (D4)

Schema gate G1

Spec audit (light-weight, AST-only)

Schedule

What landed in 83c33ad0 (P19 — distributed re-validation) + dba27163 (plan-2 close-out)

Smokes (10 iters each)

Profile traces

megatron.deepseek_v4.pp_tensor_shape (primus/backends/megatron/patches/deepseek_v4_pp_shape_patches.py)

megatron.deepseek_v4.pp_token_pre_broadcast (primus/backends/megatron/patches/deepseek_v4_get_batch_patches.py)

Model-side cleanup (deepseek_v4_model.py, deepseek_v4_layer_specs.py)

c10d::allreduce_ autograd warning gone

wenxie-amd commented Apr 28, 2026 •

edited

Loading

Plan-2 reshuffle — 2026-05-01 (commit `f548d8b2`, docs-only)

What landed in `97b9720d` (P6/P7)

What landed in `df273a45` (P8 v2)

What landed in `e5fec968` (P9 v2)

What landed in `b38e83cf` (P10)

What landed in `752b7534` (P10 runtime stabilization + report)

What landed in `636ab3de` (P12 — plan-2 lockdown)

What landed in `cad0fb38` + `aa9929a0` (P13 — faithful attention)

`cad0fb38` — V4-faithful attention rooted on `MLASelfAttention` (dense path)

`aa9929a0` — Fold Compressor / Indexer + TP-shard projections; retire CSA / HCA legacy

What landed in `1a8bf32e` (P14 phase-1 — faithful pre-mul clamped SwiGLU + V4 routers)

What landed in `5fe8bc3c` (P14 phase-2 — V4 MoE structural bring-up + G5)

`DeepseekV4MoE` → `MegatronModule`

What landed in `25ccdb5e` (P15 — V4 layer / block subclass refactor + token-ids forward kwarg + HC × PP packing)

`DeepseekV4HybridLayer` → `TransformerLayer`

`DeepseekV4TransformerBlock` → `TransformerBlock`

Tests (`tests/unit_tests/megatron/transformer/deepseek_v4/test_v4_block_pp.py`, 16 tests)

What landed in `6c5875d4` (P16 — spec-based MTP via `MultiTokenPredictionBlock` + `process_mtp_loss`)

Spec helper (`primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_mtp_specs.py`, new)

`DeepseekV4Model` updates (`deepseek_v4_model.py`)

V4 attention spec advertises `attn_mask_type`

Legacy `DeepseekV4MTPBlock` (`deepseek_v4_mtp.py`)

Tests (`tests/.../test_v4_mtp.py`, ~17 tests)

What landed in `e591b893` (P17 — code cleanup, gate G14)

Out of scope (kept, with notes in `status.md`)

What landed in `b5832672` (P18 — spec-system audit, gate G1 + D1 / D2 / D4)

`compress_ratios` normalization (D4)

What landed in `83c33ad0` (P19 — distributed re-validation) + `dba27163` (plan-2 close-out)

`megatron.deepseek_v4.pp_tensor_shape` (`primus/backends/megatron/patches/deepseek_v4_pp_shape_patches.py`)

`megatron.deepseek_v4.pp_token_pre_broadcast` (`primus/backends/megatron/patches/deepseek_v4_get_batch_patches.py`)

Model-side cleanup (`deepseek_v4_model.py`, `deepseek_v4_layer_specs.py`)

`c10d::allreduce_` autograd warning gone

`dba27163` plan-2 close-out (docs-only)