Skip to content

feat: deepseek-v4 model support#698

Draft
wenxie-amd wants to merge 58 commits intomainfrom
dev/wenx/deepseek-v4
Draft

feat: deepseek-v4 model support#698
wenxie-amd wants to merge 58 commits intomainfrom
dev/wenx/deepseek-v4

Conversation

@wenxie-amd
Copy link
Copy Markdown
Collaborator

@wenxie-amd wenxie-amd commented Apr 28, 2026

Summary

This PR brings DeepSeek-V4 training support into Primus on the Megatron backend.

It now spans the full bring-up arc (P0 – P10) and the plan-2 lockdown (P12) that closes out plan-0 / plan-1 with an architecture-faithful rewrite plan for the remaining work (P13 – P21).

Plan timeline

Plan Phases Window Status
plan-0 (develop/plan-0/) P0 – P7 2026-04-28 done — initial bring-up, configs, dispatch, layer specs, HC + Hybrid Attn, MoE / activation / RoPE / MTP, single-node smoke (PP=2 EP=4)
plan-1 (develop/plan-1/) P8 – P11 2026-04-29 → 2026-04-30 partial — P8 / P9 / P10 done; P11 paused by the architecture review
plan-2 (develop/plan-2/) P12 – P19 (+ deferred P20 / P21 / P22+) 2026-05-01 (lockdown) → 2026-05-07 (P19 close-out) wrapping up — P12 / P13 / P14 / P15 / P16 / P17 / P18 / P19 done; pre-training-first scope means P20 (perf / convergence gates), P21 (docs / handover), and P22+ (HF state-dict adapter) are all deferred follow-ups, gated by the next campaign that needs them

Plan-2 reshuffle — 2026-05-01 (commit f548d8b2, docs-only)

Pre-training is the release path; HF-weight loading is not required for the release. Plan-2 phase shape after this reshuffle:

Phase New scope Notes
P17 Code cleanup (was: state-dict adapter) retire _RMSNorm duplicates / dual_rope.py / csa_attention.py / hca_attention.py / legacy DeepseekV4MTPBlock / EP all_reduce fallback gate / _v4_token_ids residue / yaml comment fixes. New gate G14 (static dead-code audit).
P18 Spec audit (unchanged; _v4_token_ids removal moved to P17)
P19 Distributed re-validation (unchanged; G6 / G7 still here)
P20 Convergence + perf gates (HF numerical-alignment row removed; convergence baseline switched to Megatron-bridge)
P21 Docs + handover (slimmed; cleanup tasks moved to P17) techblog / progress HTML / PPT / develop_deepseek-v4-in-primus.md only
P22+ HF state-dict adapter + V4-Flash checkpoint load (deferred) Activate when SFT / evaluation needs HF weights. Design notes preserved in 02-target-architecture.md §7 + 03-phase-details.md (P22+ section). G8 / G9 deferred from P17; HF-numerical-alignment portion of G12 also deferred here.

Why plan-2

A code review of dev/wenx/deepseek-v4 against real DeepSeek-V4 (HF reference, NeMo port, official inference) and Megatron's spec + config + provider + submodule + build_module pattern surfaced 28 findings (10 CRIT / 11 HIGH / 6 MED / 5 LOW). Highlights:

  • Attention uses separate linear_k_proj / linear_v_proj; real V4 has a single-latent wkv (K = V = kv).
  • q_norm / kv_norm per-head RMSNorms are missing.
  • HashRouter outputs uniform 1/topk weights with no learnable gate.
  • clamped_swiglu clamps post-mul; real V4 clamps pre-mul on silu(gate) and up.
  • No state-dict adapter: official V4-Flash / V4-Pro HF safetensors cannot be loaded.
  • DeepseekV4Attention / DeepseekV4TransformerBlock / DeepseekV4HybridLayer / DeepseekV4MoE reinvent rather than subclass MLASelfAttention / TransformerBlock / TransformerLayer / MoELayer.

Plan-2 (develop/plan-2/) is the architecture-faithful rewrite. Full review in develop/plan-2/00-review-findings.md; rewrite map in 02-target-architecture.md; phase-by-phase plan in 03-phase-details.md; gates in 04-test-strategy.md.

Commit map

commit phase scope
e194e039 docs architecture deep-dive + plan docs
d3383c02 P1 configs / yaml + tokenizer
8ae10000 P2 model_type=deepseek_v4 dispatch
a5d2a561 P3 layer spec + block scaffolding
3b7ad8c8 P4 HC + Hybrid Attention + dual-RoPE
5e4008dc P5 V4 MoE + clamped SwiGLU + V4 MTP
97b9720d P6-P7 PP/EP integration fixes + single-node run script + progress docs
df273a45 P8(v2) LanguageModule migration + DeepSeek runtime spec-tree main path
e5fec968 P9(v2) provider reuse integration + TE CUDA runtime validation/report
b38e83cf P10(v2) enforce MoE provider path and add V4 config schema
752b7534 P10(v2) stabilize smoke runtime and add phase report
636ab3de P12(v3) plan-2 lockdown + as-built techblog + roadmap visuals
cad0fb38 P13(v3) rebase V4 attention on MLASelfAttention (faithful dense path)
aa9929a0 P13(v3) fold Compressor / Indexer + TP-shard projections; retire CSA / HCA legacy
1a8bf32e P14(v3) phase-1 faithful pre-mul clamped SwiGLU + V4 routers (learnable gate weight; HF-aligned scoring) + G3 + G4 unit tests
5fe8bc3c P14(v3) phase-2 DeepseekV4MoE -> MegatronModule + CPU local-experts path; v4_grouped_mlp_spec / v4_router_spec providers; G5 (1L MoE forward <= 1e-3 vs HF reference)
25ccdb5e P15(v3) DeepseekV4HybridLayer -> TransformerLayer; DeepseekV4TransformerBlock -> TransformerBlock; HC x PP K-stream packing helpers; HyperHead only on post_process; token_ids forward kwarg replaces decoder._v4_token_ids stash; 16 unit tests
6c5875d4 P16(v3) spec-based MTP via upstream MultiTokenPredictionBlock + process_mtp_loss; get_v4_mtp_block_spec helper; layer forward returns (hidden_states, None) for MTP-call compatibility; legacy DeepseekV4MTPBlock deprecated; 17 unit tests
f548d8b2 docs plan-2 reshuffle — defer HF state-dict adapter to P22+; repurpose P17 for code cleanup; add G14 gate; update roadmap / phase-details / test-strategy / status / README
e591b893 P17(v3) dead-code retirement (G14): delete legacy DeepseekV4MTPBlock + v4_use_custom_mtp_block / mtp_compress_ratios config fields; introduce shared LocalRMSNorm helper and dedup three _RMSNorm shadows (block.py / attention.py / compressor.py); fix inverted yaml comment (4=CSA / 128=HCA); refresh package __init__ surface; add tests/unit_tests/megatron/transformer/deepseek_v4/test_v4_p17_dead_code.py (G14 audit). dual_rope.py is intentionally kept — load-bearing for V4's CSA / HCA dual-base RoPE; no Megatron equivalent.
b5832672 P18(v3) spec-system audit (D1 / D2 / D4 / G1): build_context.resolve_v4_provider(config) caches the V4 provider on the config object (replaces three direct DeepSeekV4SpecProvider(...) call sites); new provider.v4_mlp_activation_func() returns None when use_te_activation_func=False (V4 default — clamped-SwiGLU eager path) and TEActivationOp otherwise; compress_ratios normalized to tuple[int, ...] in __post_init__ (so runtime never re-runs ast.literal_eval); new tests/unit_tests/configs/test_deepseek_v4_yaml.py (G1 schema gate) + tests/unit_tests/megatron/transformer/deepseek_v4/test_v4_p18_spec_audit.py (D1 / D2 / package-surface AST audits).
83c33ad0 P19(v3) distributed re-validation (G10) — two primus-patches that close PP > 1 + VPP under V4: megatron.deepseek_v4.pp_tensor_shape (wraps both schedules.get_tensor_shapes for 1F1B and forward_backward_pipelining_with_interleaving for VPP, multiplies the seq dim by hc_mult so the PP wire carries V4's mHC [S*K, B, D] packing) and megatron.deepseek_v4.pp_token_pre_broadcast (pre-broadcasts all microbatch / chunk input_ids from PP rank 0 across the PP group upfront in a wrapper around get_forward_backward_func, so middle PP stages owning hash-routed MoE layers see real token IDs without deadlocking the interleaved-1F1B / VPP schedule). Drops the in-forward PP broadcast + VPP fail-fast assert from DeepseekV4Model, and stops pre-assigning self.mtp = None so Megatron's set_current_microbatch only iterates model.mtp.layers when MTP is live (matches upstream GPTModel).
dba27163 plan-2 close-out docs-only — mark the c10d::allreduce_ autograd warning as gone (verified absent in P19 smokes A/B/C/D + EP=8 / PP=2 EP=4 profile runs on mi355-gpu-12); mark G11 (routing-snapshot diff = 0 across PP / EP changes) as deferred (snapshot dump tooling never landed; not on the pre-training release path); drop Phase 20 / 21 / 22+ sections from status.md (kept as documented intent in plan-2/03-phase-details.md); add deepseek-v4/develop/progress/plan-2-summary.md (stand-alone summary of the architecture-faithful rewrite from P12 → P19, including a per-phase outcome table, a P19 deep-dive, the test-gate ledger, the plan-1 → plan-2 architectural-shift table, and pointers to logs / profile traces); add P19 profile launchers (run_profile_ep8.sh for TP=1 PP=1 EP=8 and run_profile_pp2_ep4.sh for TP=1 PP=2 EP=4) plus deepseek-v4/download_ref.sh (idempotent helper that ensures git-lfs and clones the V4 reference assets — HF transformers, ROCm/TransformerEngine, AMD-AGI/Primus-Turbo, NVIDIA-NeMo/Automodel, and the four DeepSeek-V4 model repos — at pinned commits with GIT_LFS_SKIP_SMUDGE=1 so weights are not downloaded by default).

What landed in 97b9720d (P6/P7)

P6 integration

  • deepseek_v4_builders.py
    • Align model_provider with upstream Megatron signature (config, pg_collection).
  • deepseek_v4_block.py
    • Build only local PP layers via get_num_layers_to_build + get_transformer_layer_offset.
    • Add set_input_tensor support for non-first PP stages.
    • Normalize/parse compress_ratios more robustly.
    • Return viewless output via make_viewless_tensor for PP schedule compatibility.
  • v4_moe.py
    • Add EP-aware local expert sharding and EP all-reduce merge path.
  • deepseek_v4_model.py
    • Keep custom V4 MTP block behind v4_use_custom_mtp_block; default to native GPTModel MTP path for stable bring-up.
  • dual_rope.py, deepseek_v4_attention.py, attn_sink.py
    • Rename DualRoPE.apply -> apply_rope (avoid nn.Module.apply conflict).
    • Cast attention probs to value dtype before matmul to avoid bf16 mismatch.

P7 bring-up

  • Add run_deepseek_v4.sh (based on run_qwen.bak.sh) with fixed knobs:
    • MBS=1, GBS=16, TP=1, PP=2, EP=4
    • lightweight smoke overrides (num_layers=8, num_experts=8, mtp_num_layers=0)
  • Single-node run passed on:
    • host: uswslocpm2m-106-2371
    • container: dev_primus_wenx_691
    • command: TRAIN_ITERS=3 ./run_deepseek_v4.sh
    • result: reached iteration 3/3, torchrun exit code 0

What landed in df273a45 (P8 v2)

  • deepseek_v4_model.py
    • DeepseekV4Model now inherits from LanguageModule (no longer GPTModel).
    • Remove super_init_transformer_layer_spec path.
    • Build decoder directly from externally supplied DeepSeek runtime transformer_layer_spec.
  • deepseek_v4_layer_specs.py
    • Remove GPT placeholder-spec helpers.
    • Keep DeepSeek-native runtime spec tree only, with full layer/submodules topology.
  • deepseek_v4_builders.py
    • Resolve/pass runtime decoder spec only; remove GPT placeholder/super-init dependence.
  • deepseek-v4/develop/progress/status.md
    • Mark Phase 8(v2) tasks completed and sync notes with the finalized implementation.

Runtime verification:

  • On host uswslocpm2m-106-2371, container dev_primus_wenx_691:
    • Instantiate DeepseekV4Model (LanguageModule-based) with runtime spec tree.
    • Forward pass succeeds with output shape (128, 2, 256).

What landed in e5fec968 (P9 v2)

  • core/extensions/transformer_engine_spec_provider.py
    • Add DeepSeekV4SpecProvider(PrimusTurboSpecProvider) as the V4 provider entry point.
    • Resolve runtime mode (local / te / turbo) and expose V4-specific provider helpers for norm/grouped-MLP selection.
  • deepseek_v4_layer_specs.py
    • Resolve provider once at spec-build time and route norm, attention projection specs, dense projection specs, and MoE grouped path payload through provider-aware ModuleSpec construction.
  • deepseek_v4_attention.py
    • Refactor attention projections to submodules + build_module via DeepseekV4AttentionSubmodules (q_a, q_b, k_proj, v_proj, o_proj) with local fallback.
  • deepseek_v4_block.py
    • Align dense MLP projection initialization with provider-selected linear modules.
    • Add explicit fail-fast guard: TE/Turbo provider mode requires CUDA hidden_states.
  • v4_moe.py
    • Integrate provider grouped-GEMM expert path with safe fallback to local clamped SwiGLU experts.
  • docs/status updates
    • Add deepseek-v4/develop/plan-1/03-phase9-provider-ab-report.md.
    • Update deepseek-v4/develop/progress/status.md with completed Phase 9(v2) items and English-only notes.

Runtime verification:

  • On host uswslocpm2m-106-2371, container dev_primus_wenx_691:
    • local mode forward passes (Linear projections).
    • TE mode module-map build resolves to TELinear projections.
    • TE mode CUDA forward passes (decoder.cuda() + CUDA inputs).
    • TE/Turbo host-input path now fails fast with explicit runtime error instead of low-level GPU fault.

What landed in b38e83cf (P10)

  • core/transformer/moe/v4_moe.py
    • enforce SharedExpertMLP-only shared-expert path (remove local ClampedSwiGLUMLP fallback for shared experts).
    • wire clamped-SwiGLU behavior through SharedExpertMLP config path.
  • core/models/deepseek_v4/deepseek_v4_transformer_config.py
    • add DeepSeekV4TransformerConfig (inherits MLATransformerConfig) with DeepSeek-V4 specific fields used by V4 runtime modules.
    • align aliases/compat in __post_init__ (norm_epsilon, moe_intermediate_size, clamp sync, vocab/padded vocab sync).
  • deepseek_v4_builders.py
    • explicitly build V4 model config via core_transformer_config_from_args(..., config_class=DeepSeekV4TransformerConfig).
  • V4 modules/specs type wiring
    • update V4 builder/spec/model/attention/MoE module signatures and type hints to consume DeepSeekV4TransformerConfig.
  • model yaml
    • add activation_func_clamp_value to primus/configs/models/megatron/deepseek_v4_base.yaml with clamped-SwiGLU comment.
  • docs/progress
    • refresh deepseek-v4/develop/plan-1/* and deepseek-v4/develop/progress/status.md for Phase10 implementation notes.

Validation in this commit:

  • pre-commit hooks passed (isort/autoflake/black/yaml checks).
  • Python syntax compile checks passed for all touched DeepSeek-V4 runtime files.

What landed in 752b7534 (P10 runtime stabilization + report)

  • run_deepseek_v4.sh
    • add smoke-safe overrides for Phase 10 validation (seq_length/max_position_embeddings=128, index_topk=8).
    • set v4_grouped_experts_support_clamped_swiglu=True for grouped-expert clamped-SwiGLU runtime guard compliance.
    • disable overlap_grad_reduce and overlap_param_gather in smoke mode to avoid DDP bucket reset assertion between iterations.
  • primus/backends/megatron/core/transformer/hyper_connection.py
    • align F.linear weight dtype to activation dtype in HyperMixer and HyperHead to fix BF16 runtime mismatch.
  • primus/backends/megatron/core/transformer/deepseek_v4_attention.py
    • cast attention output back to activation dtype before TE output projection to satisfy TE dtype assertions.
  • deepseek-v4/develop/plan-1/04-phase10-moe-distributed-convergence-report.md
    • add formal Phase 10 report covering delivered architecture, runtime blocker/fix chain, and remaining tracked items.

Runtime verification in this update:

  • host: uswslocpm2m-106-2371
  • container: dev_primus_wenx_691
  • command: ./run_deepseek_v4.sh
  • result: reached iteration 10/10, and torchrun finished successfully (code 0).

What landed in 636ab3de (P12 — plan-2 lockdown)

Documentation-only commit; no runtime code changes.

Architecture review

  • Walked the branch e194e039..HEAD against:
    • deepseek-v4/deepseek-ai/DeepSeek-V4-Flash/{config.json, inference/model.py}
    • HF Transformers PR 45616 / 45643 (deepseek-v4/transformers/.../deepseek_v4/)
    • NVIDIA NeMo AutoModel V4 port (deepseek-v4/NVIDIA-NeMo/Automodel/...)
  • Surfaced 28 findings (10 CRIT / 11 HIGH / 6 MED / 5 LOW), spanning architecture faithfulness, Megatron reuse / spec violations, distributed correctness, spec-system hygiene, code quality, and testing gaps.

Plan-2 documents (active plan of record)

  • deepseek-v4/develop/plan-2/README.md
  • deepseek-v4/develop/plan-2/00-review-findings.md — full severity-ranked findings ledger
  • deepseek-v4/develop/plan-2/01-roadmap.md — phases P12 → P21, dependency graph, milestones, top risks
  • deepseek-v4/develop/plan-2/02-target-architecture.md — module-by-module rewrite map (rebases on MLASelfAttention, TransformerLayer, TransformerBlock, MoELayer, MultiTokenPredictionBlock, (Yarn)RotaryEmbedding)
  • deepseek-v4/develop/plan-2/03-phase-details.md — granular tasks / exit criteria / risks per phase
  • deepseek-v4/develop/plan-2/04-test-strategy.md — L0..L3 test pyramid and release gates G1..G14 (G8 / G9 marked deferred → P22+ since the 2026-05-01 reshuffle)

Plan-1 phases 9 / 10 / 11 are paused — their tracking rows in status.md remain for history.

Tech blog closure

  • Added deepseek-v4/develop/techblog/02-plan-1-as-built-and-plan-2-pointer.md: closes plan-0 / plan-1 with an as-built note (what shipped, what fell short) and points readers at plan-2.
  • Updated deepseek-v4/develop/techblog/README.md with a banner declaring plan-2 the active plan of record.

Layout cleanup + visuals

  • Renamed develop/plan/develop/plan-0/ (the original bring-up plan; tracked as a rename).
  • Added develop/progress/timeline.html: standard system-fonts version of the project timeline; daily-column Gantt with a May 02 – 05 Holiday band; remaining nine phases (P13 – P21) packed into the May 06 – 09 working window.
  • Added develop/progress/build_roadmap_pptx.py (generator) + develop/progress/deepseek_v4_roadmap_v1.pptx (13-slide tech-style deck on a black background, 16:9). Slide 7 — 07 · 开发计划 · DEVELOPMENT SCHEDULE — is the day-by-day plan with a 3-row layout (date chip / P0~P7-style phase chip / work-content card) plus a directional arrow with the holiday-gap marker.

Status tracker

  • develop/progress/status.md now has explicit Phase 12 → Phase 21 (v3) sections.
  • All P12 engineering items are checked off; only the stakeholder sign-off on plan-2 scope remains open.
  • The blockers/risks log carries one row per CRIT finding, each pointing at the plan-2 phase that resolves it.

Schedule

  • Block A (landed): 2026-04-28 → 2026-05-01 — plan-0 P0 – P7 + plan-1 P8 – P10 + plan-2 P12 lockdown.
  • Holiday: 2026-05-02 → 2026-05-05.
  • Block B (planned): 2026-05-06 → 2026-05-09 — plan-2 P13 – P21 across 4 working days (P13 + P14 / P15 + P16 / P17 + P18 / P19 + P20 + P21). Note: P17 scope changed to code cleanup per the 2026-05-01 reshuffle; HF state-dict adapter + V4-Flash numerical alignment is deferred to P22+ and not in this Block B window.

What landed in cad0fb38 + aa9929a0 (P13 — faithful attention)

Plan-2 P13 lands in two commits inside the May 06 budget. Both are scoped strictly to the dense / CSA / HCA attention path; faithful MoE / router / MTP are tracked in P14 / P15 / P16. (HF state-dict adapter — originally planned for P17 — has since been deferred to P22+ by the 2026-05-01 reshuffle; pre-training does not need it.)

cad0fb38 — V4-faithful attention rooted on MLASelfAttention (dense path)

Rewrite the dense (compress_ratio == 0) path of DeepSeek-V4 attention to be faithful to the released DeepSeek-V4-Flash checkpoint and rooted on Megatron's MLASelfAttention.

  • primus/backends/megatron/core/transformer/deepseek_v4_attention.py
    • New DeepseekV4Attention(MLASelfAttention) subclasses MLA for type identity but bypasses the parent __init__ chain because V4's KV layout differs from MLA's compressed-KV form.
    • Single-latent KV: one linear_kv projection (hidden -> head_dim) feeds both K and V, broadcast across all query heads.
    • Per-head q_rms: parameter-less RMS on head_dim after linear_q_up_proj and before partial RoPE (no q_rms.weight in the released checkpoint).
    • Grouped low-rank O: einsum-based linear_o_a per group + linear_o_b when o_lora_rank > 0. Falls back to MLA-style flat linear_proj when o_lora_rank == 0.
    • Learnable attn_sink: direct nn.Parameter on the attention (matches the released key layers.{i}.attn.attn_sink exactly), with inline softmax-with-sink in _attention_forward.
    • New DeepseekV4AttentionSubmodules dataclass with MLA-canonical names (linear_q_down_proj, linear_q_up_proj, q_layernorm, kv_layernorm) plus V4 extras (linear_kv, linear_o_a, linear_o_b, attn_sink).
    • _LegacyDeepseekV4Attention retained temporarily as the parent for CSAAttention / HCAAttention until the P13 follow-up commit folds the compressor / indexer into the new class.
  • primus/backends/megatron/core/extensions/transformer_engine_spec_provider.py
    • Added v4_q_layernorm(), v4_kv_layernorm(), v4_attention_sink() factory methods.
  • primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_layer_specs.py
    • Routes compress_ratio == 0 to the new class with V4-canonical submodules; legacy path retained for {4, 128}.
  • primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_transformer_config.py
    • Added o_groups: int = 8 and o_lora_rank: int = 0.
  • tests/unit_tests/megatron/transformer/deepseek_v4/test_deepseek_v4_attention.py
    • State-dict-key contract; forward shape + finiteness; numerical equivalence vs an inline V4 reference (single-latent KV, partial interleaved RoPE, attn-sink as virtual key column, grouped low-rank O), with attn_sink enabled and disabled (≤ 1e-3); per-head q_rms is parameter-less; o_lora_rank == 0 fallback path; rejection paths.

aa9929a0 — Fold Compressor / Indexer + TP-shard projections; retire CSA / HCA legacy

Closes P13 by folding the compressed-branch attention into the V4-faithful class as spec submodules, switching the TP-sensitive projections to ColumnParallel / RowParallel, and retiring the plan-1 legacy attention classes.

  • primus/backends/megatron/core/transformer/deepseek_v4_attention.py
    • DeepseekV4Attention.__init__ accepts compress_ratio in {0, 4, 128}. When compress_ratio > 0 it builds self.compressor from submodules.compressor; when compress_ratio == 4 it also builds self.indexer from submodules.indexer.
    • DeepseekV4AttentionSubmodules extended with compressor and indexer fields.
    • DeepseekV4Attention.forward now dispatches on self.compress_ratio:
      • 0 — dense / SWA over local KV.
      • 128 — HCA: compressed pool with compress-base partial RoPE on indices [0..P), broadcast to H heads, concat to local KV with a compressed-causal mask, joint softmax-with-sink shared across local + compressed branches.
      • 4 — CSA: per-query top-K from compressed pool via Indexer + overlap-mode Compressor, joint softmax-with-sink across local + sparse keys.
    • _LegacyDeepseekV4Attention and _LegacyDeepseekV4AttentionSubmodules removed.
  • primus/backends/megatron/core/transformer/{csa,hca}_attention.py deleted.
  • primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_layer_specs.py
    • _build_v4_attention_submodules now also builds compressor / indexer ModuleSpecs for compressed branches.
    • linear_q_up_proj switched to provider.column_parallel_linear() (gather_output=True); linear_o_b (grouped) and linear_proj (flat-O fallback) switched to provider.row_parallel_linear() (input_is_parallel=False). At tp > 1 the projection weights are sharded across TP ranks; at tp = 1 the result is bit-identical to the previous duplicated path. linear_q_down_proj, linear_kv, linear_o_a stay duplicated; full grouped-O TP plan is tracked in P14.
  • primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_block.py
    • _build_attention (no-spec fallback) now constructs DeepseekV4Attention for all branches; the new class builds its own Compressor / Indexer locally when no spec is provided.
  • tests/unit_tests/megatron/transformer/deepseek_v4/test_deepseek_v4_attention.py
    • HCA forward shape + finiteness + numerical equivalence vs an inline reference (≤ 1e-3); CSA forward shape + finiteness; spec wiring contract tests for ColumnParallel / RowParallel and Compressor / Indexer presence; torchrun --nproc_per_node=2 parity scaffold (skipif single-rank).

Status

  • deepseek-v4/develop/progress/status.md: P13 fully checked off (including the items previously deferred to the follow-up commit). Items routed to P14 (full grouped-O TP plan) / P22+ — deferred (HF-reference numerical alignment via the state-dict adapter, originally P17) / P19 (full TP=2 sharding-parity bit-equality check) are noted as such on each row.

Schedule

  • Block A (extended): 2026-05-01 — plan-2 P13 first commit cad0fb38 (early start; the May 02 – 05 holiday remains).
  • Holiday: 2026-05-02 → 2026-05-05.
  • Block B (planned): 2026-05-06 → 2026-05-09 — P14 – P21 across 4 working days. P13 follow-up aa9929a0 is recorded under May 06 in the daily plan.

What landed in 1a8bf32e (P14 phase-1 — faithful pre-mul clamped SwiGLU + V4 routers)

P14 ships in two commits. This one lands the math + parameter-layout faithfulness so V4-Flash checkpoints will load through the future state-dict adapter (originally P17, now deferred to P22+ by the 2026-05-01 reshuffle) without remapping. The structural refactor (DeepseekV4MoE(MoELayer) subclassing, provider helpers, G5 1L MoE forward) is the P14 phase-2 follow-up.

Activation (G3)

  • primus/backends/megatron/core/transformer/clamped_swiglu.py
    • Replace post-multiplication clamp with V4 pre-multiplication semantics: SiLU(clamp(gate, max=alpha)) * clamp(up, +/- alpha). New helpers clamped_swiglu_pre_mul(gate, up, alpha) (split inputs) and clamped_swiglu_pre_mul_fused(x, alpha) ([gate | up] last-dim concat for grouped-gemm experts).
    • ClampedSwiGLUMLP now uses separate w1 / w2 / w3 Linears so the released checkpoint (Expert(w1, w2, w3, swiglu_limit)) loads without remapping. Optional fused_gate_up=True fuses the gate / up GEMMs at forward time only; the saved / loaded state_dict keys remain w1.weight / w2.weight / w3.weight.
  • primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_block.py
    • _DenseSwiGLUMLP now applies the same pre-mul clamp on its dense head/tail layers; previously it computed vanilla SiLU(gate) * up and ignored swiglu_limit.

Learned router (G4)

  • primus/backends/megatron/core/transformer/moe/v4_topk_router.py
    • Rename V4TopKRouterDeepseekV4LearnedRouter (back-compat alias retained).
    • Gate exposed as weight Parameter of shape [num_experts, hidden_size] — matches Megatron's TopKRouter.weight AND HF reference Gate.weight exactly (no gate.weight indirection).
    • expert_bias is selection-only: routing weights gather from the un-biased scores so probs gradient flows to weight, never to expert_bias.
    • Renormalization gated on score_function != "softmax" (HF parity; softmax probs already sum to 1).
    • topk_scaling_factor honors moe_router_topk_scaling_factor (HF route_scale).
    • Score functions: v4_score_fn covers softmax, sigmoid, sqrtsoftplus.

Hash router (G4)

  • primus/backends/megatron/core/transformer/moe/v4_hash_router.py
    • Rename HashRouterDeepseekV4HashRouter (back-compat alias retained).
    • Add learnable weight Parameter same shape as the learned router; previously the hash router emitted uniform 1/topk weights, which broke gradient flow into the gate weights and silently differed from the released checkpoint.
    • tid2eid is now a frozen nn.Parameter(requires_grad=False, dtype=torch.int32) (matches HF reference layout — released checkpoint stores it as a parameter so state-dict round-trips preserve it without polluting the optimizer state).
    • forward(hidden, token_ids) gathers learned scores at the static expert ids prescribed by tid2eid[token_ids]; renorm + scale parity with the learned router.

MoE wiring

  • primus/backends/megatron/core/transformer/moe/v4_moe.py
    • _route now passes (hidden, token_ids) to the hash router; both routers receive hidden_size / score_function / topk_scaling_factor at init.

Tests

  • tests/unit_tests/megatron/transformer/deepseek_v4/test_clamped_swiglu.py — 7 tests cover pre-mul activation vs HF reference (≤ 1e-6 fp32, four alpha values), alpha = 0 disables clamp, fused-vs-split agreement, one-sided gate clamp behavior, w1 / w2 / w3 state-dict keys (no gate_up.weight leak), fused_gate_up forward equivalence, end-to-end ClampedSwiGLUMLP vs HF Expert.forward.
  • tests/unit_tests/megatron/transformer/deepseek_v4/test_v4_routers.py — 13 tests:
    • Score function: parity vs inline reference for all three functions.
    • Learned router: HF agreement across (softmax × sigmoid × sqrtsoftplus) × (with / without expert_bias) ≤ 1e-6; back-compat alias; gradient flows to gate weight; expert_bias detached from probs graph; softmax skips renorm.
    • Hash router: HF agreement across the three score functions ≤ 1e-6; tid2eid is a frozen Parameter (requires_grad=False, dtype int32); state-dict keys; deterministic table across seeds; OOB / shape-mismatch error paths; gradient flows to weight while tid2eid.grad is None.

Status

  • deepseek-v4/develop/progress/status.md: P14 phase-1 tasks checked off with this commit hash (1a8bf32e); deferred items listed for the phase-2 follow-up; the "HashRouter has no learnable gate weight / clamped SwiGLU clamps post-mul" blocker is marked resolved.

Schedule

  • Block A (extended): 2026-05-01 — plan-2 P14 phase-1 commit 1a8bf32e (continuing the early start; May 02 – 05 holiday remains).
  • Block B (planned): 2026-05-06 → 2026-05-09 — P14 phase-2 + P15 – P21 across 4 working days. P13 follow-up aa9929a0 and P14 phase-1 1a8bf32e are recorded under May 01 / 06 in the daily plan.

What landed in 5fe8bc3c (P14 phase-2 — V4 MoE structural bring-up + G5)

Closes plan-2 P14 by bringing DeepseekV4MoE into Megatron's spec lifecycle, exposing a CPU-testable forward path so the MoE math is pinned against the released HF reference, and adding the V4 provider helpers that plan-2 §5 / §6 call for.

DeepseekV4MoEMegatronModule

  • primus/backends/megatron/core/transformer/moe/v4_moe.py
    • Parent class switched from nn.Module to MegatronModule so it inherits the standard config plumbing and integrates with TransformerLayer.mlp via the spec lifecycle.
    • BaseMoELayer-compatible public surface: set_layer_number(layer_number) mirrors BaseMoELayer.set_layer_number; local_expert_indices is exposed as a list attribute.

CPU local-experts path

  • primus/backends/megatron/core/transformer/moe/v4_moe.py
    • When pg_collection is None, __init__ skips the dispatcher / grouped-experts construction and instead builds:
      • local_experts: nn.ModuleList[ClampedSwiGLUMLP] — one ClampedSwiGLUMLP per local expert (mirrors HF reference Expert exactly: separate w1 / w2 / w3 Linears + V4 pre-multiplication clamp).
      • shared_expert: ClampedSwiGLUMLP — a single shared expert with the same activation.
    • _local_experts_forward runs a per-expert dispatch loop matching DeepSeek-V4-Flash/inference/model.py:MoE.forward exactly (for each routed expert, gather routed tokens, multiply by per-token routing weight, accumulate). Production path (pg_collection provided) continues to use the Megatron dispatcher + grouped experts unchanged.

Provider helpers (plan-2 P14 §5 / §6)

  • primus/backends/megatron/core/extensions/transformer_engine_spec_provider.py
    • DeepSeekV4SpecProvider.v4_grouped_mlp_spec(swiglu_limit, moe_use_grouped_gemm=True, ...) returns a ready-to-use ModuleSpec(grouped_module, MLPSubmodules) for the V4 MoE expert path. The pre-mul clamp itself is applied via config.activation_func_clamp_value — Megatron's eager glu() (mlp.py:312-321) already implements SiLU(clamp(gate, max=alpha)) * clamp(up, +/- alpha), which is bit-equal to the HF reference math; the spec only commits to the right grouped module + the column / row-parallel linears.
    • DeepSeekV4SpecProvider.v4_router_spec(learned=True/False) returns a bare ModuleSpec for either DeepseekV4LearnedRouter or DeepseekV4HashRouter.

G5 numerical alignment

  • tests/unit_tests/megatron/transformer/deepseek_v4/test_v4_moe.py — 11 tests:
    • Construction sanity: parent class is MegatronModule; CPU path builds local_experts (ClampedSwiGLUMLP) + shared_expert; the token_dispatcher / grouped_experts attributes stay None; set_layer_number propagates.
    • Learned-router MoE forward vs inline HF reference on a 1L toy across (sqrtsoftplus, sigmoid, softmax) × (shared expert on / off) — ≤ 1e-3 fp32 CPU.
    • Hash-router MoE forward vs HF across the three score functions, with token_ids feeding tid2eid — ≤ 1e-3 fp32 CPU.
    • moe_router_topk_scaling_factor (HF route_scale) propagates to the output.
    • Backward populates grads on router.weight, on the shared expert, and on at least one routed expert's w1 / w2 / w3.
    • Hash layer raises a clear error when token_ids is missing.

Status

  • deepseek-v4/develop/progress/status.md — P14 phase-2 tasks ticked with this commit; the structural row records the MegatronModule-via-CPU-path approach and explicitly defers the TopKRouter-rooted aux-loss / z-loss path to P19 alongside the distributed re-validation matrix (rationale: upstream TopKRouter.__init__ registers CUDA buffers unconditionally, which is impractical for CPU-clean V4 routers; gating that on a device check is out-of-scope for this commit).

Schedule

  • Block A (extended): 2026-05-01 — plan-2 P14 phase-2 commit 5fe8bc3c (continuing the early start; May 02 – 05 holiday remains).
  • Block B (planned): 2026-05-06 → 2026-05-09 — P15 – P21 across 4 working days. P13 follow-up aa9929a0, P14 phase-1 1a8bf32e, and P14 phase-2 5fe8bc3c are recorded under May 01 in the daily plan.

What landed in 25ccdb5e (P15 — V4 layer / block subclass refactor + token-ids forward kwarg + HC × PP packing)

Closes plan-2 P15 except the distributed PP-equivalence gate (G6) which is tracked into P19. This commit brings V4's layer / block onto Megatron's TransformerLayer / TransformerBlock parents, drops the decoder._v4_token_ids attribute stash in favor of a real forward kwarg, gates HyperHead to the post_process stage, and extracts HC × PP K-stream packing helpers.

DeepseekV4HybridLayerTransformerLayer

  • primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_block.py
    • Parent class switched from GraphableMegatronModule to TransformerLayer. TransformerLayer.__init__ is bypassed (V4's submodule contract differs — no cross-attention, no BDA, V4-specific attention signature); MegatronModule.__init__ is called directly.
    • DeepseekV4HybridLayerSubmodules now extends TransformerLayerSubmodules and uses upstream-canonical field names: input_layernorm / self_attention / pre_mlp_layernorm / mlp. The two V4-specific HC mixer hooks attn_hc / ffn_hc remain, both default to None for hc_mult == 1.
    • The layer's forward signature is now upstream-compatible: (hidden_states, attention_mask=None, *, position_ids=None, token_ids=None, **kwargs). attention_mask is accepted and ignored (V4 manages SWA / sink mask internally); position_ids is consumed from the caller (fallback to arange(S) for tiny smokes); **kwargs lets the layer plug into MultiTokenPredictionLayer (P16) without bespoke adapters.

DeepseekV4TransformerBlockTransformerBlock

  • primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_block.py
    • Parent class switched from nn.Module to TransformerBlock (init bypass via MegatronModule for CPU instantiability; V4 has its own layer-spec / lift-lower pipeline). Type identity unlocks Megatron isinstance checks + sharded-state-dict integration.
    • HyperHead is built only on the post_process stage. Earlier PP stages forward the K-stream tensor via _lower_streams_out (no per-stage HyperHead), saving memory and removing a correctness drift risk.

HC × PP K-stream packing helpers

  • _lift_streams_in(hidden_states, pre_process, hc_mult) / _lower_streams_out(x, post_process, hc_mult) extracted as module-level helpers in deepseek_v4_block.py.
    • First PP stage: [S, B, D] -> [B, S, K, D] (broadcast across K).
    • Non-first PP stage: [S*K, B, D] -> [B, S, K, D] (unfold packed K).
    • Final stage: [B, S, D] -> [S, B, D] (post-HyperHead transpose).
    • Non-final stage: [B, S, K, D] -> [S*K, B, D] (pack K into seq for PP P2P).
    • Both helpers raise clear errors on shape mismatches.
  • The packing math is intentionally K-folded-into-seq (not the batch axis) so sequence-parallel chunking lines up cleanly; PP P2P doesn't need to know about K.

Token-ids forward kwarg

  • primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_model.py
    • DeepseekV4Model.forward no longer assigns decoder._v4_token_ids (and removes the try/finally cleanup). It now passes token_ids=input_ids and position_ids=position_ids directly to self.decoder(...).
    • The decoder block + each layer consume them as standard forward kwargs and propagate to mlp.forward -> hash_router.forward.
    • An AST-level audit (test_v4_block_pp.py::test_model_forward_does_not_set_decoder_v4_token_ids_attribute) prevents the attribute stash from regressing.

Spec wiring + MTP block update

  • primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_layer_specs.py renames the four core fields when constructing DeepseekV4HybridLayerSubmodules: attn_norminput_layernorm, attentionself_attention, ffn_normpre_mlp_layernorm, ffnmlp.
  • primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_mtp.py switches the per-MTP-layer call to layer(stream, position_ids=..., token_ids=...) (kwarg, not positional) to match the new layer forward signature.

Tests (tests/unit_tests/megatron/transformer/deepseek_v4/test_v4_block_pp.py, 16 tests)

  • Subclass identity: DeepseekV4HybridLayer is a TransformerLayer; DeepseekV4TransformerBlock is a TransformerBlock; DeepseekV4HybridLayerSubmodules extends TransformerLayerSubmodules and exposes attn_hc / ffn_hc.
  • Lift / lower roundtrip: bit-exact across the four PP-stage permutations (pre_process × post_process), for both single-stream (hc_mult=1) and multi-stream (K=3, K=4).
  • Error paths: misaligned S*K on non-first stage; collapsed input on non-final lower; uncollapsed input on final lower.
  • Token-ids stash: AST audit confirms decoder._v4_token_ids is gone from the model source; token_ids=input_ids kwarg is present.
  • Forward signatures: block.forward exposes position_ids + token_ids kwargs; layer.forward accepts (hidden_states, attention_mask=None, position_ids, token_ids).

Status / blockers

  • deepseek-v4/develop/progress/status.md — Phase 15 tasks ticked except G6 (PP=1 vs PP=2 vs PP=4 equivalence on a 4L toy), which requires distributed init and is tracked into P19 distributed re-validation. The CPU-only sub-gate — _lift_streams_in after _lower_streams_out is bit-exact — is covered by the lift/lower roundtrip tests, which is the math contract a real PP run depends on.
  • Two blocker rows resolved:
    • "Custom V4 block / layer / MoE bypass TransformerBlock / TransformerLayer / MoELayer" — closed by P14 phase-2 + P15.
    • "Token-IDs propagation via decoder._v4_token_ids attribute" — closed by P15.

Schedule

  • Block A (extended): 2026-05-01 — plan-2 P15 commit 25ccdb5e (continuing the early start; May 02 – 05 holiday remains).
  • Block B (planned): 2026-05-06 → 2026-05-09 — P16 – P21 across 4 working days. P14 phase-1 1a8bf32e, P14 phase-2 5fe8bc3c, and P15 25ccdb5e are recorded under May 01 in the daily plan.

What landed in 6c5875d4 (P16 — spec-based MTP via MultiTokenPredictionBlock + process_mtp_loss)

Closes plan-2 P16 except the distributed MTP-loss ablation gate (G7), which is tracked into P19 alongside G6. This commit wires V4 onto Megatron's upstream MTP pipeline so the auxiliary multi-token-prediction loss flows through process_mtp_loss (per-depth shifted logits + MTPLossAutoScaler) instead of the standalone primus-owned MTP block. The legacy DeepseekV4MTPBlock remains behind the v4_use_custom_mtp_block config flag for back-compat with research checkpoints (planned removal: P17 — moved up from P21 by the 2026-05-01 reshuffle) and now emits a DeprecationWarning on construction.

Spec helper (primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_mtp_specs.py, new)

  • get_v4_mtp_block_spec(config, *, transformer_layer_spec, vp_stage) returns
    ModuleSpec(MultiTokenPredictionBlock, submodules=MultiTokenPredictionBlockSubmodules(layer_specs=[...]*mtp_num_layers)).
  • Each per-depth MultiTokenPredictionLayer spec pulls
    • enorm / hnorm / layer_norm from DeepSeekV4SpecProvider.v4_norm_module()
    • eh_proj from provider.column_parallel_linear()
    • mtp_model_layer from the V4 hybrid-layer spec passed in by the model — so each MTP depth shares HC, hash routing, and clamped-SwiGLU with the main decoder exactly.
  • Rejects mtp_num_layers < 1 with a clear ValueError.

DeepseekV4Model updates (deepseek_v4_model.py)

  • New default path: when mtp_num_layers > 0 and not v4_use_custom_mtp_block, __init__ builds self.mtp = MultiTokenPredictionBlock(spec=get_v4_mtp_block_spec(...)) on stages where mtp_on_this_rank() is True. mtp_on_this_rank is wrapped in try/except so CPU smokes (no parallel_state) do not crash; self.mtp_process is False and self.mtp is None on those paths.
  • Legacy DeepseekV4MTPBlock path stays available behind v4_use_custom_mtp_block; self.mtp_block is the legacy slot, self.mtp is the new spec-based slot. Both are None when MTP is disabled.
  • forward now mirrors GPTModel.forward: runs self.mtp(...) on stages with MTP layers (passing input_ids / position_ids / hidden_states / attention_mask / embedding / packed_seq_params), then on post_process with mtp_num_layers > 0 calls process_mtp_loss(...) which chunks the concatenated hidden states, computes the per-depth shifted MTP loss, and folds it into the gradient via MTPLossAutoScaler.
  • New forward kwargs: loss_mask (forwarded to process_mtp_loss) and packed_seq_params.

Layer / block forward contract

  • DeepseekV4HybridLayer.forward now returns (hidden_states, None) instead of just hidden_states. This matches upstream TransformerLayer (which returns (hidden_states, context)) and is required by MultiTokenPredictionLayer._proj_and_transformer_layer which unpacks hidden_states, _ = self.mtp_model_layer(...).
  • DeepseekV4TransformerBlock's per-layer iteration updates to x, _ = layer(...).
  • Legacy DeepseekV4MTPBlock likewise updates to unpack the tuple.

V4 attention spec advertises attn_mask_type

  • The V4 attention spec now declares params={"compress_ratio": ..., "attn_mask_type": AttnMaskType.causal}. MultiTokenPredictionLayer.__init__ validates the inner layer's self_attention.params['attn_mask_type'] against {padding, causal, no_mask, padding_causal}; without this the MTP block fails to construct. The value is functionally inert for V4 (which manages its own SWA / sink mask).
  • DeepseekV4Attention.__init__ accepts and ignores attn_mask_type plus a **kwargs catch-all so the spec lifecycle keeps working.

Legacy DeepseekV4MTPBlock (deepseek_v4_mtp.py)

  • Module docstring annotated as deprecated (planned removal: P17 — moved up from P21 by the 2026-05-01 reshuffle).
  • Construction emits a DeprecationWarning pointing users at get_v4_mtp_block_spec. Code path unchanged otherwise.

Tests (tests/.../test_v4_mtp.py, ~17 tests)

  • get_v4_mtp_block_spec structural assertions: outer module is MultiTokenPredictionBlock; layer_specs length matches mtp_num_layers (parametrised 1/2/3); each per-depth spec is a MultiTokenPredictionLayer; the V4 inner layer is threaded through unchanged; norm + linear come from the V4 provider.
  • Rejects mtp_num_layers=0 with a clear ValueError.
  • DeepseekV4HybridLayerSubmodules extends TransformerLayerSubmodules so MTP picks up the GPT path (not Mamba) in its inner-layer-submodules isinstance check.
  • DeepseekV4HybridLayer.forward returns (hidden_states, None) (source-level assertion on return x, None).
  • V4 attention spec advertises AttnMaskType.causal (source-level assertion).
  • Legacy DeepseekV4MTPBlock emits DeprecationWarning on construction.
  • AST audits on deepseek_v4_model.py: process_mtp_loss is called; upstream MTP machinery is imported; spec helper is invoked; v4_use_custom_mtp_block flag is preserved; the mtp_num_layers > 0 guard keeps the no-MTP path inert.

Status / blockers

  • deepseek-v4/develop/progress/status.md — Phase 16 tasks ticked except G7 (MTP loss appears in train log; mtp_num_layers=0 vs mtp_num_layers=1 ablation matches LM loss to 1e-6), which requires distributed init + MultiTokenPredictionBlock runtime (CP / SP plumbing); tracked into P19 distributed re-validation alongside G6.
  • Two new follow-on rows recorded for the cross-cutting layer-tuple return + attention attn_mask_type declarations (both required by upstream MTP wiring).

Schedule

  • Block A (extended): 2026-05-01 — plan-2 P16 commit 6c5875d4 (continuing the early start; May 02 – 05 holiday remains).
  • Block B (planned): 2026-05-06 → 2026-05-09 — P17 – P21 across 4 working days. P14 phase-1 1a8bf32e, P14 phase-2 5fe8bc3c, P15 25ccdb5e, and P16 6c5875d4 are recorded under May 01 in the daily plan.

What landed in e591b893 (P17 — code cleanup, gate G14)

P17 ships the dead-code retirement that was front-loaded from P21 in the 2026-05-01 reshuffle (f548d8b2). With pre-training as the release path, the HF state-dict adapter slot moved out (deferred to P22+) and the cleanup work moved up so P18's spec audit walks a clean tree.

Retired in this commit

  • primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_mtp.py — the legacy primus-owned DeepseekV4MTPBlock was deprecation-warned since P16 (6c5875d4); the spec-based path (get_v4_mtp_block_spec + upstream MultiTokenPredictionBlock + process_mtp_loss) is the only MTP route now.
  • DeepSeekV4TransformerConfig.v4_use_custom_mtp_block (legacy MTP gate) — removed.
  • DeepSeekV4TransformerConfig.mtp_compress_ratios (legacy-only field) — removed.
  • DeepseekV4Model.__init__ — single MTP branch on the spec path; the if v4_use_custom_mtp_block arm + self.mtp_block field are gone.

Dedup'd in this commit

  • primus/backends/megatron/core/transformer/local_rmsnorm.py (new) — one canonical LocalRMSNorm consumed by deepseek_v4_block.py (input_layernorm / pre_mlp_layernorm / final_layernorm fallback), deepseek_v4_attention.py (q_norm / kv_norm fallback closure), and compressor.py (kv_norm). The three pre-existing _RMSNorm definitions are deleted.

YAML cleanup

  • deepseek_v4_flash.yaml — inverted comment fixed: 4 = CSA (overlap) and 128 = HCA (non-overlap) match DeepseekV4Attention.forward dispatch.
  • deepseek_v4_pro.yaml + deepseek_v4_base.yaml — same canonical comment block added so all three V4 yamls are self-documenting.

Audit gate G14

  • tests/unit_tests/megatron/transformer/deepseek_v4/test_v4_p17_dead_code.py (new):
    • retired files gone (deepseek_v4_mtp.py, csa_attention.py, hca_attention.py).
    • legacy import path raises ImportError; package __all__ no longer exposes DeepseekV4MTPBlock.
    • DeepSeekV4TransformerConfig no longer carries v4_use_custom_mtp_block / mtp_compress_ratios.
    • AST scan over every V4 source for runtime _v4_token_ids access (Attribute / Assign / Name) — docstring mentions are exempt.
    • AST scan over every V4 source for class _RMSNorm shadow definitions — none allowed.
    • parameterised yaml check that the canonical 4 = CSA / 128 = HCA mapping is documented.

Out of scope (kept, with notes in status.md)

  • primus/backends/megatron/core/transformer/dual_rope.py — load-bearing for V4's CSA / HCA dual-base partial RoPE; Megatron's RotaryEmbedding only supports a single base. Plan-2 was over-eager listing this for retirement; it stays.

What landed in b5832672 (P18 — spec-system audit, gate G1 + D1 / D2 / D4)

P18 closes the spec-system audit findings D1 / D2 / D4 from 00-review-findings.md. Walking a clean tree (after P17) makes the audits crisp.

Provider singleton (D1)

  • primus/backends/megatron/core/models/deepseek_v4/build_context.py (new): resolve_v4_provider(config) caches a single DeepSeekV4SpecProvider on the config object via a private attribute. Different configs get different providers; the cache is GC'd when the config is released.
  • All three direct DeepSeekV4SpecProvider(config=config) call sites migrated to the helper:
    • deepseek_v4_block.py (_build_projection + DeepseekV4MoE shared-expert wiring)
    • deepseek_v4_layer_specs.py
    • deepseek_v4_mtp_specs.py
  • AST audit (test_v4_p18_spec_audit.py::test_no_direct_DeepSeekV4SpecProvider_construction_outside_build_context) rejects future regressions; build_context.py is the only allowed instantiation site.

Activation-func consistency (D2)

  • New helper DeepSeekV4SpecProvider.v4_mlp_activation_func() returns:
    • None when config.use_te_activation_func is False — the V4 default; needed so Megatron MLP keeps the eager clamped-SwiGLU path (which applies activation_func_clamp_value).
    • TEActivationOp (the TE class, instantiated by Megatron MLP at build) when the user opts into TE.
  • Layer specs + DeepseekV4MoE shared-expert spec switched to the V4 helper. The base provider's activation_func() is unchanged (BackendSpecProvider contract still says "returns a type").

compress_ratios normalization (D4)

  • DeepSeekV4TransformerConfig.__post_init__ calls _normalize_compress_ratios_field on the raw value once, so downstream consumers see tuple[int, ...] (or None). The helper handles strings ("[0, 0, 4, 128, ...]") and real lists.
  • Runtime helpers (_parse_int_sequence / _normalize_compress_ratios in deepseek_v4_block.py) keep accepting both forms for back-compat, but always receive the normalized form on the live path.

Schema gate G1

  • tests/unit_tests/configs/test_deepseek_v4_yaml.py (new): parameterises over deepseek_v4_{base,flash,pro}.yaml:
    • parse_yaml() succeeds; required fields present.
    • DeepSeekV4TransformerConfig builds from the parsed dict.
    • compress_ratios normalized to tuple[int, ...] with no value drift vs the raw schedule.
    • every compress_ratios entry is in {0, 4, 128} (canonical V4 branches).
    • retired P17 fields (v4_use_custom_mtp_block / mtp_compress_ratios) are gone from the dataclass and from each YAML.
    • V4-specific runtime fields (HC, sliding-window, sink, o_groups / o_lora_rank, MoE extras, swiglu_limit) all declared on the dataclass.
    • provider singleton: resolve_v4_provider(cfg_a) returns the same instance on repeated calls; different configs get different providers.
    • v4_mlp_activation_func contract verified for both branches of use_te_activation_func.

Spec audit (light-weight, AST-only)

  • tests/unit_tests/megatron/transformer/deepseek_v4/test_v4_p18_spec_audit.py (new):
    • D1 / D2 audits described above.
    • package surface __init__.py __all__ does not re-export DeepseekV4MTPBlock (P17 cross-check).
    • spec builders do not eagerly construct TENorm / TE{Column,Row}ParallelLinear / TELinear / TEActivationOp inside __init__ — they emit ModuleSpec(module=...) references that runtime build_module resolves.

Schedule

  • Block A (extended): 2026-05-01 — plan-2 P17 + P18 commits e591b893 + b5832672 (continuing the early start; May 02 – 05 holiday remains).
  • Block B (planned): 2026-05-06 → 2026-05-09 — P19 – P21 across 4 working days. P17 + P18 are recorded under May 01 in the daily plan; P19 (distributed re-validation) is the first item in Block B.

What landed in 83c33ad0 (P19 — distributed re-validation) + dba27163 (plan-2 close-out)

P19 closes the distributed re-validation gate (G10) for the architecture-faithful V4 stack landed across P13 → P18. All four target smokes pass 10/10 iterations on mi355-gpu-12 (BF16, MBS=1 GBS=16, seq=128, 8 layers / 3 hash layers / hc_mult=4); two torch.profiler chrome-trace JSONs (EP=8 and PP=2 EP=4) are captured for the perf-baseline reference.

Smokes (10 iters each)

smoke parallelism result gating patch log
A TP=1 PP=1 EP=1 10/10 none (HC stays in-stage; PP > 1 patches are no-ops) deepseek-v4/develop/progress/p19/smokeA*.log
B TP=1 PP=2 EP=4 10/10 pp_tensor_shape p19/smokeB*.log
C TP=1 PP=4 EP=2 10/10 pp_tensor_shape + pp_token_pre_broadcast p19/smokeC_pp4_ep2_v2.log
D TP=1 PP=2 EP=4 VPP=2 10/10 pp_tensor_shape (also wraps the interleaved schedule) + pp_token_pre_broadcast (upfront) p19/smokeD_pp2_ep4_vpp2_v2_run3.log

Profile traces

torch.profiler chrome-trace JSONs (single active step, iter 6 → 7) under the same V4 smoke config:

  • output/amd/tas-mi355x-20260507/p19_profile_pp1_ep8/tensorboard/...rank[0].*.pt.trace.json — TP=1 PP=1 EP=8 (~99 MB).
  • output/amd/tas-mi355x-20260507/p19_profile_pp2_ep4/tensorboard/...rank[0].*.pt.trace.json — TP=1 PP=2 EP=4 (~105 MB).

Launchers: deepseek-v4/develop/progress/p19/run_profile_ep8.sh and run_profile_pp2_ep4.sh.

megatron.deepseek_v4.pp_tensor_shape (primus/backends/megatron/patches/deepseek_v4_pp_shape_patches.py)

Wraps two Megatron entry points in megatron.core.pipeline_parallel.schedules so V4's mHC K = hc_mult packing is reflected on the PP wire:

  1. get_tensor_shapes (used by 1F1B): seq dim multiplied by hc_mult so the receive buffer matches [S * K, B, D] instead of the stock [S, B, D].
  2. forward_backward_pipelining_with_interleaving (used by VPP): seq_length kwarg multiplied by hc_mult before the schedule's inline tensor_shape = [seq_length, mbs, hidden] runs.

Both wrappers gate on model_type == "deepseek_v4" + hc_mult > 1 + PP > 1 and are strict no-ops otherwise. Without (2) VPP allocates [S, B, D] recv buffers while the sender emits [S * K, B, D], and _lift_streams_in reshapes the truncated copy — surfaces as DeepseekV4HashRouter: hidden=32 vs token_ids=128.

megatron.deepseek_v4.pp_token_pre_broadcast (primus/backends/megatron/patches/deepseek_v4_get_batch_patches.py)

V4's hash-routed MoE layers (the first num_hash_layers) need raw input_ids on every PP stage that owns one, but pretrain_gpt.get_batch returns None on middle PP stages. Two earlier in-loop hooks both deadlocked under VPP — an in-DeepseekV4Model.forward broadcast and a per-call get_batch broadcast each raced the interleaved schedule's pre-warmup recv_forward.wait().

This patch wraps pp_module.get_forward_backward_func so each train_step first runs all num_microbatches × num_chunks PP dist.broadcast collectives upfront, before the schedule's first send / recv, and caches the resulting (tokens, labels, loss_mask, attention_mask, position_ids, packed_seq_params) tuples per (vp_stage, microbatch). A companion wrapper around pretrain_gpt.get_batch consumes the cache when active and falls back to the original implementation otherwise. Cache is reset in a finally after each schedule call. Cost ≈ mbs * seq * 8B per microbatch (~32 KiB / step on the smoke), dwarfed by the activation P2P.

Model-side cleanup (deepseek_v4_model.py, deepseek_v4_layer_specs.py)

  • Drop the in-forward input_ids PP broadcast + VPP fail-fast assert from DeepseekV4Model; the pre-broadcast patch handles both 1F1B and VPP cleanly.
  • Stop pre-assigning self.mtp = None in __init__; Megatron's set_current_microbatch (in cuda_graphs.py) only iterates model.mtp.layers when MTP is actually live, which matches upstream GPTModel. Downstream MTP guards use getattr(self, "mtp", None).
  • Import DeepSeekV4SpecProvider in deepseek_v4_layer_specs.py so the type annotation resolves at module load (NameError surfaced once turbo path was off).

c10d::allreduce_ autograd warning gone

The historical UserWarning: An operator was called with autograd not registered for c10d::allreduce_ came from the early bring-up's "local shard + torch.distributed.all_reduce" path for MoE routed-output aggregation in v4_moe.py. P14 phase-2 migrated MoE to Megatron's token dispatchers (MoEAlltoAllTokenDispatcher / MoEFlexTokenDispatcher); P17 deleted the v4_enable_ep_allreduce_fallback debug gate; and P19 confirms zero c10d::allreduce hits in stderr across all four smokes + the EP=8 / PP=2 EP=4 profile runs.

dba27163 plan-2 close-out (docs-only)

  • status.md — mark c10d::allreduce_ warning as gone (with the verification log paths); mark G11 as [-] deferred (snapshot dump tooling never landed); drop Phase 20 / 21 / 22+ sections (kept as documented intent in plan-2/03-phase-details.md); refresh the Blockers / Risks log entry for c10d to reference the actual P19 verification rather than "still tracked into P19".
  • deepseek-v4/develop/progress/plan-2-summary.md (new) — stand-alone summary of the plan-2 architecture-faithful rewrite (P12 → P19): per-phase outcome with key commits; P19 deep-dive (smokes / profile traces / patches / c10d verification); test-gate ledger (G1 / G3 / G4 / G5 / G6 / G7 / G11 / G14 + smokes); plan-1 → plan-2 architectural-shift table (attention, MoE, layer / block, MTP, token-IDs path, HC × PP, TP, spec hygiene); explicit deferred / out-of-scope list (G6 distributed, G7 MTP, G11, P20, P21, P22+).
  • P19 profile launchersrun_profile_ep8.sh (TP=1 PP=1 EP=8) and run_profile_pp2_ep4.sh (TP=1 PP=2 EP=4); both set PROFILE=True + disable_tensorboard=False so the existing torch_profiler_patches.py hook captures iter 6 → 7.
  • deepseek-v4/download_ref.sh — idempotent helper that ensures git-lfs and clones the V4 reference assets at pinned commits (HF transformers, ROCm TransformerEngine, AMD-AGI/Primus-Turbo, NVIDIA-NeMo/Automodel, plus DeepSeek-V4-Pro / Flash / Flash-Base / Pro-Base) with GIT_LFS_SKIP_SMUDGE=1 so weights are not downloaded by default.

Schedule

  • Block A (extended): 2026-05-01 — plan-2 P12 → P18 commits (636ab3deb5832672).
  • Block B (delivered): 2026-05-07 — plan-2 P19 (83c33ad0) + plan-2 close-out (dba27163).
  • Deferred follow-ups: P20 (200-step Megatron-bridge convergence + TE on / off perf report + FP8 follow-up plan), P21 (techblog / progress timeline / PPT refresh), P22+ (HF state-dict adapter + V4-Flash safetensors round-trip / token-0 logits ≤ 1e-2 vs HF reference). All three are documented in plan-2/03-phase-details.md; they re-enter active work when the next campaign (release, downstream integration ask, SFT / eval) needs them.

Test plan

  • P1 – P5 unit / smoke coverage from previous commits.
  • P6 / P7 functional smoke: 1 node, 8 GPUs, PP=2, EP=4, BF16, 3 iters.
  • P8 (v2) runtime instantiate / forward validation in dev_primus_wenx_691.
  • P9 (v2) provider-mode A/B validation (local forward, TE build, TE CUDA forward, TE host-input guard).
  • P10 (v2) iteration 10/10 smoke with grouped-expert clamped-SwiGLU guard.
  • P12 plan-2 lockdown: docs-only commit; pre-commit hooks (isort / autoflake / black) pass.
  • P13 — V4-faithful attention class (dense + HCA + CSA in one MLASelfAttention-rooted module); inline-reference numerical alignment for dense (≤ 1e-3) and HCA (≤ 1e-3); CSA shape / finiteness; linear_q_up_proj / linear_o_b Column / Row parallel; pre-commit hooks (isort / autoflake / black) pass.
  • P13 — HF-reference numerical alignment within 1e-3 (CPU fp32) — release gate G2 / G3 — deferred to P22+ alongside the state-dict adapter (originally tracked into P17; reshuffled 2026-05-01).
  • P13 — TP=2 sharding-parity bit-equality vs duplicated baseline — scaffold landed (skipif single-rank); execution deferred to P19.
  • P14 phase-1 — pre-mul clamped SwiGLU activation + V4 routers (learned + hash); G3 (≤ 1e-6 fp32 vs HF reference) + G4 (identical (probs, indices) + gradient flow on gate weight) covered by test_clamped_swiglu.py + test_v4_routers.py; pre-commit hooks (isort / autoflake / black) pass.
  • P14 phase-2 — DeepseekV4MoE -> MegatronModule + provider v4_grouped_mlp_spec(swiglu_limit) / v4_router_spec(learned) + 1L MoE forward within 1e-3 of HF reference (gate G5) — covered by test_v4_moe.py; pre-commit hooks pass.
  • P15 — DeepseekV4HybridLayer -> TransformerLayer + DeepseekV4TransformerBlock -> TransformerBlock; HyperHead only on post_process; _lift_streams_in / _lower_streams_out packing helpers (CPU-only G6 sub-gate covered by test_v4_block_pp.py); token_ids forward-kwarg threading + decoder._v4_token_ids AST audit; pre-commit hooks pass.
  • P15 — distributed PP=1 / 2 / 4 equivalence on 4L V4 toy — gate G6 — deferred to P19.
  • P16 — spec-based MTP via upstream MultiTokenPredictionBlock + process_mtp_loss; get_v4_mtp_block_spec helper; layer forward returns (hidden_states, None); legacy DeepseekV4MTPBlock deprecated; pre-commit hooks pass.
  • P16 — distributed MTP loss appears in train log; mtp_num_layers=0 matches LM loss to 1e-6 — gate G7 — deferred to P19.
  • P17 — code cleanup: legacy DeepseekV4MTPBlock deleted; v4_use_custom_mtp_block / mtp_compress_ratios config fields removed; three _RMSNorm shadows replaced by shared LocalRMSNorm; yaml comment inversion fixed (4 = CSA / 128 = HCA); package surface refreshed; AST gate G14 green via test_v4_p17_dead_code.py (retired-files check, retired-config-fields check, _v4_token_ids AST scan, _RMSNorm shadow scan, yaml-comment dispatch). dual_rope.py intentionally kept (load-bearing for V4's CSA / HCA dual-base RoPE; no Megatron equivalent — documented in status.md). Pre-commit hooks pass.
  • P18 — spec-system audit: provider singleton via build_context.resolve_v4_provider(config) (D1); provider.v4_mlp_activation_func() returns None when use_te_activation_func=False and TEActivationOp otherwise (D2); compress_ratios normalized to tuple[int, ...] in __post_init__ (D4); new tests/unit_tests/configs/test_deepseek_v4_yaml.py (G1 schema gate) + test_v4_p18_spec_audit.py (D1 / D2 / package surface / TE eager-construction AST audits). Pre-commit hooks pass.
  • P19 — distributed re-validation (G10): smokes A 1×8 PP=1 EP=1, B 1×8 PP=2 EP=4, C 1×8 PP=4 EP=2, D 1×8 PP=2 EP=4 VPP=2 all 10/10 iters on mi355-gpu-12 (BF16, MBS=1 GBS=16, seq=128, 8 layers / 3 hash layers / hc_mult=4); two torch.profiler chrome-trace JSONs (EP=8 and PP=2 EP=4) captured. Two primus-patches landed (pp_tensor_shape + pp_token_pre_broadcast); c10d::allreduce_ autograd warning verified absent in stderr across all smokes + profile runs.
  • [-] P19 — routing-snapshot diff = 0 across PP / EP changes — gate G11. deferred: snapshot dump tooling never landed; not on the pre-training release path. Runtime stability of the P15 / P19 patches is covered by the smokes above.
  • P20 — 200-step Megatron-bridge convergence (±0.05 loss) + TE on/off perf report + FP8 follow-up plan — gates G12 / G13. Deferred follow-up as of 2026-05-07; not on the pre-training release path. Re-enters active work when a release / perf campaign needs it.
  • 50-iter stability run + TP partitioning end-to-end coverage — superseded by the P19 smoke matrix (10 iters × 4 parallelism configurations); a longer stability sweep is bundled into the deferred P20 perf campaign.
  • P22+ (deferred follow-up) — V4-Flash safetensors round-trip + token-0 logits ≤ 1e-2 vs HF reference — gate G8 / G9. Not on the pre-training release path; activate when SFT / evaluation needs HF weights.

Known risk / follow-up

  • EP routed-output path currently uses all_reduce and emits a PyTorch autograd warning (c10d::allreduce_ kernel registration). Functional for bring-up; gated behind the v4_enable_ep_allreduce_fallback debug toggle on the active path. resolved (P14 phase-2 / P17 audit / P19 runtime verification): the v4_enable_ep_allreduce_fallback flag was removed during the dispatcher migration in P14; the debug gate was deleted in P17 (e591b893); P19 smokes (A/B/C/D + EP=8 / PP=2 EP=4 profile runs on mi355-gpu-12) confirm zero c10d::allreduce warnings in stderr — the EP routed-output reduction now flows entirely through Megatron's MoEAlltoAllTokenDispatcher / MoEFlexTokenDispatcher.
  • HC × PP — HyperHead per-stage application destroys K-stream context. resolved (P15 + P19): DeepseekV4TransformerBlock packs [B, S, K, D] → [S*K, B, D] for PP P2P via _lower_streams_out and only applies HyperHead on the post_process stage. CPU-side bit-exact roundtrip covered by test_v4_block_pp.py. Runtime stability across PP > 1 verified by P19 smokes B / C / D with the pp_tensor_shape patch; distributed bit-equality across PP = 1 / 2 / 4 (G6) is a separate audit and is not on the pre-training release path (deferred follow-up).
  • decoder._v4_token_ids attribute stash — leaks state across PP and microbatches. resolved (P15): DeepseekV4Model.forward now passes token_ids=input_ids directly to the decoder; AST audit prevents regressions.
  • No state-dict adapter — V4-Flash safetensors cannot be loaded. Deferred to P22+ by the 2026-05-01 reshuffle; pre-training does not need HF weights. Plan-2 P22+ (when activated by an SFT / evaluation campaign) lands the adapter and adds the HF numerical-alignment gate (G8 / G9). Design notes preserved in 02-target-architecture.md §7 + 03-phase-details.md (P22+ section).

Initial design / planning materials for integrating DeepSeek-V4 training
support into Primus. Documentation only; no production code changes.

- techblog/: architecture deep dive (CSA / HCA / mHC / Hash routing /
  sqrtsoftplus / clamped SwiGLU / dual RoPE / Muon / MTP) plus 4 PNG
  diagrams rendered via Pillow (see render_diagrams.py).
- plan/: 8-phase roadmap, full code-landing list, per-phase task
  breakdown, and testing strategy.
- progress/status.md: 64-task checklist tracking phase progress.
- develop_deepseek-v4-in-primus.md: top-level goal and development
  cadence.

Made-with: Cursor
Phase 1 of the V4 development plan. Pure config; no Python code paths
exercised yet. Subsequent phases (P2..P4) wire dispatch and modules.

* primus/configs/models/megatron/deepseek_v4_base.yaml
  Extends llama_base, sets model_type=deepseek_v4 and registers V4-specific
  defaults (hc_mult, hybrid_attention_*, q_lora_rank, attn_sink, hash routing,
  swiglu_limit, dual-RoPE knobs, etc.).
* primus/configs/models/megatron/deepseek_v4_flash.yaml
  Hyperparams from DeepSeek-V4-Flash/config.json.
* primus/configs/models/megatron/deepseek_v4_pro.yaml
  Hyperparams from DeepSeek-V4-Pro/config.json.
* examples/megatron/configs/MI355X/deepseek_v4_flash-BF16-pretrain.yaml
  Training scaffold; parallelism / perf knobs are conservative and will be
  retuned during the perf phase.
* primus/backends/megatron/training/tokenizer/tokenizer.py
  Add DeepSeekV4Tokenizer to CUSTOM_TOKENIZER_TYPES so _add_tokenizer_args
  accepts it.

Note: V4 fields do not need to be registered in Megatron's argparse —
Primus's merge_namespace mechanism (train_runtime.py:_initialize_trainer)
copies yaml-only fields onto backend_args after MegatronArgBuilder.update.

Made-with: Cursor
Phase 2 of the V4 development plan. Wires the end-to-end dispatch from
yaml.model_type=deepseek_v4 to a primus-owned model_provider + builder,
without changing model behaviour yet. The model class is still a thin
GPTModel subclass; Phase 3 swaps the decoder for the V4 transformer block.

* primus/core/utils/import_utils.py
  Add a deepseek_v4 branch to get_model_provider() that imports
  primus.backends.megatron.core.models.deepseek_v4.deepseek_v4_builders
  and returns partial(model_provider, deepseek_v4_builder).

* primus/backends/megatron/megatron_pretrain_trainer.py
  Add a model_type == "deepseek_v4" branch alongside gpt / mamba.
  V4 is a causal-LM with the same data shape as GPT, so we reuse
  pretrain_gpt's forward_step + train_valid_test_datasets_provider;
  only the model_provider itself is V4-specific.

* primus/backends/megatron/core/models/deepseek_v4/__init__.py (new)
  Re-export DeepseekV4Model + deepseek_v4_builder + model_provider.

* primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_model.py (new)
  DeepseekV4Model: thin subclass of GPTModel. P3 will replace
  self.decoder with DeepseekV4TransformerBlock.

* primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_builders.py (new)
  deepseek_v4_builder + model_provider. Uses GPT layer specs in P2;
  P3 will swap them for V4 specs.

Made-with: Cursor
Phase 3 of the V4 development plan. Lands the V4 layer-spec helpers and a
transparent V4 transformer-block subclass; attention / MLP behaviour still
matches GPT. Phase 4 will plug HC + hybrid attention into the block, and
Phase 5 will swap in V4 MoE / clamped SwiGLU through the spec-resolution
hooks added here.

* primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_layer_specs.py (new)
  Four V4 layer-spec helpers (layer / decoder_block / decoder_layer_specs /
  mtp_block) that delegate to the GPT helpers in P3, plus two resolution
  hooks (_resolve_attention_module_spec / _resolve_mlp_module_spec) that
  return None for now -- P4 / P5 fill these in.

* primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_block.py (new)
  DeepseekV4TransformerBlock: subclasses TransformerBlock and stashes V4
  config fields (hc_mult, compress_ratios, attn_sliding_window, attn_sink,
  q_lora_rank, index_*) onto self so P4 patches don't have to re-walk the
  config. Forward behaviour unchanged in P3.

* primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_model.py
  Override __init__: after super().__init__() builds the stock decoder,
  swap self.decoder for DeepseekV4TransformerBlock (same call signature
  so GPTModel.forward keeps working).

* primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_builders.py
  _resolve_layer_spec / _resolve_mtp_block_spec now route through the
  V4 layer-spec helpers instead of the GPT helpers directly.

* primus/backends/megatron/core/models/deepseek_v4/__init__.py
  Re-export DeepseekV4TransformerBlock alongside the existing surface.

Made-with: Cursor
…dual-RoPE)

Phase 4 of the V4 development plan. Lands the full V4 transformer block:
mHC multi-stream residual, per-layer hybrid attention dispatch (Dense /
HCA / CSA), sliding-window mask, attention sink, dual-RoPE with YaRN. The
V4 block becomes a standalone nn.Module that bypasses Megatron's
TransformerBlock + ModuleSpec mechanism so the multi-stream HC loop is
expressed cleanly. P5 will swap the placeholder SwiGLU MLP for V4's MoE.

New modules under primus/backends/megatron/core/transformer/ ::

* hyper_connection.py
  HyperMixer (per-layer mHC mixer), HyperHead (final K->1 collapse),
  sinkhorn_normalize (doubly-stochastic projection). Linear weights /
  scales / biases held in fp32 for stability; fp32 sinkhorn iterates.
  Unit-tested: row/col errors ~1e-6, hc_mult=1 degenerate path exact.

* compressor.py
  V4 compressor for KV downsampling. ratio=4 overlap mode (CSA, coff=2),
  ratio=128 non-overlap mode (HCA, coff=1). Internal RMSNorm + learnable
  APE; RoPE applied externally.

* indexer.py
  Sparse top-K position selector for CSA. Internal mini-Compressor builds
  the score grid; causal mask + top-K (-1 fill for invalid positions);
  backward propagates to the indexer params.

* sliding_window_kv.py
  Causal SWA mask + per-query KV index helpers.

* attn_sink.py
  Per-head learnable sink scalar; softmax_with_sink ensures probs.sum() <=
  1 with the sink absorbing the residual mass. Backward propagates to the
  sink params.

* dual_rope.py
  Two RoPE bases (main + compress) with optional YaRN scaling. Partial
  interleaved RoPE: only ``rotary_dim`` of each head's channels rotated;
  remaining channels passed through unchanged.

* deepseek_v4_attention.py
  Shared base for V4 attention: QKV projection (optional Q LoRA),
  partial dual-RoPE, SWA mask, attention sink, output projection.
  ``_extra_kv`` hook lets HCA / CSA augment KV (full pool or sparse top-K).

* hca_attention.py
  Heavily-Compressed Attention. Subclasses DeepseekV4Attention; adds a
  non-overlap Compressor and concatenates the full compressed pool to
  the local KV (always visible).

* csa_attention.py
  Compressed-Sparse Attention. Subclasses DeepseekV4Attention; adds an
  overlap Compressor + Indexer; per-query attention is computed over the
  local SWA + the indexer's top-K compressed positions.

Updated:

* primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_block.py
  Rewritten as a standalone nn.Module. Holds the dual-RoPE for the whole
  stack, builds DeepseekV4HybridLayer per layer (Dense/HCA/CSA picked
  from compress_ratios), and runs the K-stream HC loop. Forward shape:
  [S, B, D] -> [B, S, D] -> [B, S, K, D] -> ... -> [B, S, D] -> [S, B, D].
  Smoke-tested: 8-layer mixed dense/CSA/HCA + hc_mult=4 forward / backward
  / causality OK.

* primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_layer_specs.py
  Cleaned up to a placeholder spec. The V4 block is standalone and
  bypasses Megatron's spec mechanism; we still hand a valid GPT-shaped
  spec to GPTModel.__init__ until P6 refactors that allocation away.

* primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_model.py
  Docstring rewritten for the P4 standalone-block layout; pg_collection
  switched to getattr(self, "pg_collection", None) for safety.

* deepseek-v4/develop/progress/status.md, plan/02-phase-details.md
  Track P1..P4 completion; add the argparse-not-needed note (Primus's
  merge_namespace covers V4 fields).

Made-with: Cursor
Copilot AI review requested due to automatic review settings April 28, 2026 11:01
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new model_type=deepseek_v4 to Primus’ Megatron backend, including V4 configs, model/provider dispatch, and an initial DeepSeek-V4 block implementation with HC + hybrid attention building blocks.

Changes:

  • Add DeepSeek-V4 model dispatch + builders and a Primus-owned V4 model package.
  • Introduce V4 config yamls (base/flash/pro) and a MI355X pretrain scaffold yaml.
  • Implement core V4 transformer components (HC, dual-RoPE, compressor, indexer, CSA/HCA attention, sliding-window helpers, attention sink).

Reviewed changes

Copilot reviewed 33 out of 37 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
primus/core/utils/import_utils.py Adds deepseek_v4 branch to resolve the V4 model provider/builder.
primus/backends/megatron/megatron_pretrain_trainer.py Dispatches model_type=deepseek_v4 while reusing GPT data/forward_step plumbing.
primus/backends/megatron/training/tokenizer/tokenizer.py Allows selecting DeepSeekV4Tokenizer via HF tokenizer wrapper.
primus/configs/models/megatron/deepseek_v4_{base,flash,pro}.yaml Adds V4 model configs and defaults.
examples/megatron/configs/MI355X/deepseek_v4_flash-BF16-pretrain.yaml Adds a training scaffold yaml for MI355X.
primus/backends/megatron/core/models/deepseek_v4/* Adds V4 model/builders/spec placeholders and a standalone V4 block implementation.
primus/backends/megatron/core/transformer/* Implements HC, dual-RoPE, compressor/indexer, CSA/HCA attention, SWA helpers, and attention sink.
deepseek-v4/develop/** Adds development docs/diagrams and planning materials for the V4 integration.

Comment on lines +34 to +38
# Per-layer compression schedule (from config.json:compress_ratios)
# 0 = uncompressed dense layer (full attention with SWA)
# 4 = HCA branch (compress ratio 4)
# 128 = CSA branch (compress ratio 128)
compress_ratios: "[0, 0, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 0]"
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

compress_ratios is currently a quoted string, so YAML will parse it as str rather than a list of ints. DeepseekV4TransformerBlock.__init__ does list(compress_ratios) and checks len(...) == num_layers, so this will either explode into a list of characters or fail the length check at runtime. Define this as a real YAML list (no quotes) or normalize the string to List[int] before the block consumes it; also ensure the list length matches num_layers (43).

Copilot uses AI. Check for mistakes.
Comment on lines +34 to +37
# Per-layer compression schedule (from config.json:compress_ratios)
# 0 = uncompressed dense layer (full attention with SWA)
# 4 = HCA branch (compress ratio 4)
# 128 = CSA branch (compress ratio 128)
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The per-layer schedule comments invert CSA vs HCA: per the V4 design and the rest of this PR, compress_ratio == 4 is CSA and compress_ratio == 128 is HCA. Please fix the comment mapping so it matches the implementation.

Copilot uses AI. Check for mistakes.
Comment on lines +3 to +5
#
# Source: deepseek-v4/deepseek-ai/DeeSeek-v4-Pro/config.json
###############################################################################
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in the referenced source path (DeeSeek-v4-Pro). If this path is meant to mirror the repo directory (DeepSeek-V4-Pro), please correct it to avoid confusion when cross-referencing configs.

Copilot uses AI. Check for mistakes.
Comment on lines +237 to +261
# Sliding-window mask.
window = self.attn_sliding_window
local_mask = sliding_window_causal_mask(S, window, device=device, dtype=dtype) # [S, S]

# Subclass hook: extra K/V (compressed pool, sparse top-K, etc.).
# Subclass should return tensors already broadcast to [B, S_extra, H, head_dim]
# so they can be cat'd along the Sk axis.
extra_k, extra_v, extra_mask = self._extra_kv(hidden, position_ids, q)

# Concatenate sliding-window KV with extra KV (if any).
if extra_k is not None:
k_full = torch.cat([k_local_h, extra_k], dim=1) # [B, Sk_total, H, head_dim]
v_full = torch.cat([v_local_h, extra_v], dim=1)
full_mask = torch.cat([local_mask, extra_mask], dim=-1) # [Sq, Sk_total]
else:
k_full = k_local_h
v_full = v_local_h
full_mask = local_mask

# Move heads dim before sequence: [B, S, H, head_dim] -> [B, H, S, head_dim]
q_bh = q.transpose(1, 2)
k_bh = k_full.transpose(1, 2)
v_bh = v_full.transpose(1, 2)

out_bh = self._compute_attention_output(q_bh, k_bh, v_bh, full_mask)
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sliding_window_causal_mask creates a [S, S] mask, but the attention still computes q @ k^T over all S keys (k_local_h is length S). For realistic training lengths (e.g. 4096), this becomes quadratic memory/compute and is very likely to OOM, even though the model is conceptually sliding-window. Consider actually restricting K/V to the window (e.g. gather with sliding_window_kv_indices, unfold, or use a kernel/backend that supports causal sliding-window attention) so Sk_local is window rather than S.

Suggested change
# Sliding-window mask.
window = self.attn_sliding_window
local_mask = sliding_window_causal_mask(S, window, device=device, dtype=dtype) # [S, S]
# Subclass hook: extra K/V (compressed pool, sparse top-K, etc.).
# Subclass should return tensors already broadcast to [B, S_extra, H, head_dim]
# so they can be cat'd along the Sk axis.
extra_k, extra_v, extra_mask = self._extra_kv(hidden, position_ids, q)
# Concatenate sliding-window KV with extra KV (if any).
if extra_k is not None:
k_full = torch.cat([k_local_h, extra_k], dim=1) # [B, Sk_total, H, head_dim]
v_full = torch.cat([v_local_h, extra_v], dim=1)
full_mask = torch.cat([local_mask, extra_mask], dim=-1) # [Sq, Sk_total]
else:
k_full = k_local_h
v_full = v_local_h
full_mask = local_mask
# Move heads dim before sequence: [B, S, H, head_dim] -> [B, H, S, head_dim]
q_bh = q.transpose(1, 2)
k_bh = k_full.transpose(1, 2)
v_bh = v_full.transpose(1, 2)
out_bh = self._compute_attention_output(q_bh, k_bh, v_bh, full_mask)
# Materialize only the causal sliding-window K/V for each query position
# so local attention scales with `window` rather than the full sequence `S`.
window = self.attn_sliding_window
window = min(window, S)
# Build per-query local indices: for query i attend to [i - window + 1, ..., i].
query_positions = torch.arange(S, device=device)
window_offsets = torch.arange(window, device=device)
local_indices = query_positions.unsqueeze(1) - (window - 1) + window_offsets.unsqueeze(0) # [S, window]
local_valid = local_indices >= 0
local_indices = local_indices.clamp_(min=0, max=S - 1)
# Gather local K/V windows: [B, S, H, D] -> [B, S, window, H, D].
gather_index = local_indices.view(1, S, window, 1, 1).expand(
B, S, window, self.num_heads, self.head_dim
)
k_local = torch.gather(
k_local_h.unsqueeze(2).expand(B, S, window, self.num_heads, self.head_dim),
1,
gather_index,
)
v_local = torch.gather(
v_local_h.unsqueeze(2).expand(B, S, window, self.num_heads, self.head_dim),
1,
gather_index,
)
# Subclass hook: extra K/V (compressed pool, sparse top-K, etc.).
# Subclass should return tensors already broadcast to [B, S_extra, H, head_dim].
extra_k, extra_v, extra_mask = self._extra_kv(hidden, position_ids, q)
# Move heads dim before sequence for local attention:
# q: [B, S, H, D] -> [B, H, S, D]
# local k/v: [B, S, window, H, D] -> [B, H, S, window, D]
q_bh = q.transpose(1, 2)
k_local_bh = k_local.permute(0, 3, 1, 2, 4)
v_local_bh = v_local.permute(0, 3, 1, 2, 4)
scale = self.head_dim ** -0.5
local_scores = (q_bh.unsqueeze(-2) * k_local_bh).sum(dim=-1) * scale # [B, H, S, window]
local_scores = local_scores.masked_fill(
~local_valid.view(1, 1, S, window), torch.finfo(local_scores.dtype).min
)
if extra_k is not None:
extra_k_bh = extra_k.transpose(1, 2) # [B, H, S_extra, D]
extra_v_bh = extra_v.transpose(1, 2) # [B, H, S_extra, D]
extra_scores = torch.einsum("bhsd,bhkd->bhsk", q_bh, extra_k_bh) * scale
if extra_mask is not None:
if extra_mask.dtype == torch.bool:
extra_scores = extra_scores.masked_fill(
~extra_mask.view(1, 1, S, -1), torch.finfo(extra_scores.dtype).min
)
else:
extra_scores = extra_scores + extra_mask.view(1, 1, S, -1).to(extra_scores.dtype)
attn_scores = torch.cat([local_scores, extra_scores], dim=-1)
attn_probs = torch.softmax(attn_scores.float(), dim=-1).to(q_bh.dtype)
local_probs = attn_probs[..., :window]
extra_probs = attn_probs[..., window:]
out_local = (local_probs.unsqueeze(-1) * v_local_bh).sum(dim=-2)
out_extra = torch.einsum("bhsk,bhkd->bhsd", extra_probs, extra_v_bh)
out_bh = out_local + out_extra
else:
attn_probs = torch.softmax(local_scores.float(), dim=-1).to(q_bh.dtype)
out_bh = (attn_probs.unsqueeze(-1) * v_local_bh).sum(dim=-2)

Copilot uses AI. Check for mistakes.
Comment on lines +92 to +94
) -> torch.Tensor:
"""Gather ``[B, P, head_dim]`` along ``P`` per query → ``[B, S, K, head_dim]``.

Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_gather_topk_kv is annotated as returning torch.Tensor, but it actually returns (gathered, valid). This will confuse type-checkers and readers; update the return annotation (and docstring if needed) to reflect the tuple return type.

Suggested change
) -> torch.Tensor:
"""Gather ``[B, P, head_dim]`` along ``P`` per query``[B, S, K, head_dim]``.
) -> Tuple[torch.Tensor, torch.Tensor]:
"""Gather ``[B, P, head_dim]`` along ``P`` per query.
Returns:
A tuple ``(gathered, valid)`` where:
- ``gathered`` has shape ``[B, S, K, head_dim]``.
- ``valid`` has shape ``[B, S, K]`` and marks non-masked indices.

Copilot uses AI. Check for mistakes.
gathered, valid = self._gather_topk_kv(pool_kv, topk_idxs) # [B, S, K, head_dim]

# 5) Stash for ``_compute_attention_output`` to consume.
gathered.shape[2]
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This statement has no effect (gathered.shape[2] is computed and discarded). It looks like a leftover debug line; please remove it to keep the CSA path clean.

Suggested change
gathered.shape[2]

Copilot uses AI. Check for mistakes.
Comment on lines +10 to +30
num_layers: 61
hidden_size: 7168
num_attention_heads: 128
num_query_groups: 1
kv_channels: 512
qk_pos_emb_head_dim: 64
ffn_hidden_size: 18432
moe_ffn_hidden_size: 3072
moe_shared_expert_intermediate_size: 3072

q_lora_rank: 1536
o_lora_rank: 1024
o_groups: 16

num_experts: 384
moe_router_topk: 6
moe_router_topk_scaling_factor: 2.5

index_topk: 1024

compress_ratios: "[128, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 0]"
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as Flash: compress_ratios is a quoted string, which will not deserialize to Sequence[int] and will break DeepseekV4TransformerBlock's len(compress_ratios) == num_layers check. Please make this a real YAML list (or add a normalization step) and verify the schedule length matches num_layers (61).

Copilot uses AI. Check for mistakes.
Comment on lines +4 to +7
# Reference:
# - deepseek-v4/deepseek-ai/DeepSeek-V4-Flash/config.json
# - deepseek-v4/deepseek-ai/DeeSeek-v4-Pro/config.json
# - deepseek-v4/develop/techblog/01-deepseek-v4-architecture-deep-dive.md
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in the reference path (DeeSeek-v4-Pro). Please correct the spelling/casing so the comment points at the actual directory name and is searchable.

Copilot uses AI. Check for mistakes.
# local v we have [B, H, Sk_local, head_dim] (independent of S),
# while sparse v depends on S. Build a "value tensor" with the
# same shape on both paths by broadcasting local v:
v.shape[2]
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This statement has no effect (v.shape[2] is computed and discarded). Please remove it; it reads like a debug remnant and makes the attention path harder to audit.

Suggested change
v.shape[2]

Copilot uses AI. Check for mistakes.
Comment on lines +16 to +24
FONT_REG = "/home/xiewen12/.local/share/fonts/NotoSansSC-Regular.otf"
FONT_BOLD = FONT_REG # we only have Regular; use it for both

OUT_DIR = os.path.join(os.path.dirname(__file__), "diagrams")
os.makedirs(OUT_DIR, exist_ok=True)


def font(sz: int, bold: bool = False) -> ImageFont.FreeTypeFont:
return ImageFont.truetype(FONT_BOLD if bold else FONT_REG, sz)
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FONT_REG is hard-coded to an absolute path under a specific user's home directory, which will fail for other developers/CI. Consider using a repo-relative font path, allowing an environment variable override, and/or falling back to a default font when the file isn't present.

Suggested change
FONT_REG = "/home/xiewen12/.local/share/fonts/NotoSansSC-Regular.otf"
FONT_BOLD = FONT_REG # we only have Regular; use it for both
OUT_DIR = os.path.join(os.path.dirname(__file__), "diagrams")
os.makedirs(OUT_DIR, exist_ok=True)
def font(sz: int, bold: bool = False) -> ImageFont.FreeTypeFont:
return ImageFont.truetype(FONT_BOLD if bold else FONT_REG, sz)
BASE_DIR = os.path.dirname(__file__)
FONT_CANDIDATES = (
os.environ.get("DIAGRAM_FONT"),
os.environ.get("FONT_REG"),
os.path.join(BASE_DIR, "NotoSansSC-Regular.otf"),
os.path.join(BASE_DIR, "fonts", "NotoSansSC-Regular.otf"),
)
def _resolve_font_path() -> str | None:
for path in FONT_CANDIDATES:
if path and os.path.isfile(path):
return path
return None
FONT_REG = _resolve_font_path()
FONT_BOLD = FONT_REG # we only have Regular; use it for both when available
OUT_DIR = os.path.join(BASE_DIR, "diagrams")
os.makedirs(OUT_DIR, exist_ok=True)
def font(sz: int, bold: bool = False) -> ImageFont.ImageFont | ImageFont.FreeTypeFont:
font_path = FONT_BOLD if bold else FONT_REG
if font_path:
return ImageFont.truetype(font_path, sz)
return ImageFont.load_default()

Copilot uses AI. Check for mistakes.
…+ MTP

Phase 5 of the V4 development plan. Lands the FFN side of the V4 stack:
hash-routed and learned top-K MoE, clamped SwiGLU experts, and the V4
MTP head. The V4 block now plugs the V4 MoE in place of P4's placeholder
SwiGLU FFN; the V4 model instantiates a separate-HyperHead MTP block when
mtp_num_layers > 0. Layer-aware YaRN was already done in P4
(DualRoPE.get_rope picks main_rope vs compress_rope by compress_ratio).

New modules:

* primus/backends/megatron/core/transformer/clamped_swiglu.py
  clamped_swiglu(x, alpha=7.0): silu(gate)*up clamped to [-alpha, alpha].
  ClampedSwiGLUMLP wraps it as a fused gate_up + down two-linear MLP.
  Eager (Python) for v1; perf phase will register a fused kernel.

* primus/backends/megatron/core/transformer/moe/v4_hash_router.py
  HashRouter: static [vocab_size, topk] tid2eid table from a fixed seed.
  Active for the first num_hash_layers V4 layers; gives each token a
  permanent expert assignment with uniform weight 1/topk. No learnable
  parameters; deterministic across PP / TP / EP ranks.

* primus/backends/megatron/core/transformer/moe/v4_topk_router.py
  V4TopKRouter: learned gate with score_function in
  {"sqrtsoftplus", "sigmoid", "softmax"}. Top-K with optional renorm
  and optional noaux_tc per-expert bias (selection-only; probs are
  read from the un-biased score).

* primus/backends/megatron/core/transformer/moe/v4_moe.py
  DeepseekV4MoE: per-layer router pick (hash vs learned) + N
  ClampedSwiGLUMLP routed experts + 1 shared expert. Pure-PyTorch
  per-expert dispatch; P6 swaps in Megatron's token-dispatcher /
  grouped-GEMM / EP path.

* primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_mtp.py
  DeepseekV4MTPBlock: mtp_num_layers V4 layers, each owning its own
  HyperHead (separate from the main decoder's). Shares the dual-RoPE
  with the main decoder. Loss-side wiring is deferred to P6; P5 just
  stands the module up so it can be unit-tested standalone.

Updated:

* primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_block.py
  DeepseekV4HybridLayer now picks MoE vs dense FFN based on
  num_routed_experts. forward() threads token_ids through to the MoE
  for hash-routed layers. The block-level forward picks token_ids up
  from a model-side stash (_v4_token_ids) so callers don't have to
  thread it explicitly through every layer of the call stack.

* primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_model.py
  Builds DeepseekV4MTPBlock when mtp_num_layers > 0 (post-process
  rank only). forward() overridden to stash input_ids onto self.decoder
  before delegating to GPTModel.forward, so hash-routed MoE layers can
  consume them. Cross-PP propagation of input_ids is a P6 concern.

* primus/backends/megatron/core/models/deepseek_v4/__init__.py
  Re-export DeepseekV4MTPBlock alongside the existing surface.

Smoke-tested on dev-box PyTorch container (CPU, 7-test suite):
* clamped_swiglu: clamp tight; MLP forward+backward OK.
* HashRouter: per-token top-K distinct, deterministic across re-runs and
  re-instantiations w/ same seed, probs sum to 1.
* V4TopKRouter: top-K honored, renorm OK, backward OK for all three
  score functions (sqrtsoftplus, sigmoid, softmax).
* DeepseekV4MoE (learned & hash modes): forward + backward; same-token
  determinism for hash routing.
* DeepseekV4TransformerBlock with MoE FFN (4 layers, hc_mult=2, mixed
  dense + CSA): forward + backward; deterministic in eval mode.
* DeepseekV4MTPBlock (mtp_num_layers=2, hc_mult=2): forward + backward;
  per-MTP HyperHead state_dict separation verified.

Deferred to P6 (already noted in progress doc):
* Real Megatron-MoE / token-dispatcher / EP integration -- replaces the
  pure-PyTorch dispatch loop in DeepseekV4MoE.forward.
* MTP loss path wiring -- DeepseekV4Model.forward currently builds the
  MTP block but does not yet feed its outputs through lm_head + the
  auxiliary loss term.
* Numerical alignment vs reference inference/model.py (token-0 logits
  within 1e-2) -- needs reference checkpoint loading.

Made-with: Cursor
Copilot AI review requested due to automatic review settings April 28, 2026 11:30
@wenxie-amd wenxie-amd force-pushed the dev/wenx/deepseek-v4 branch from b8e47a3 to 5e4008d Compare April 28, 2026 11:30
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 38 out of 42 changed files in this pull request and generated no new comments.

Wire DeepSeek-V4 through Megatron P6 integration (PP local-layer build, EP expert sharding, and compatibility fixes) and add the P7 single-node launcher plus progress docs after passing PP=2/EP=4 smoke run.

Made-with: Cursor
Copilot AI review requested due to automatic review settings April 29, 2026 12:25
Add the plan-1 roadmap/detail/test documentation plus progress tracker entries, and update the development target doc with TransformerEngine and Primus-Turbo reference pointers.

Made-with: Cursor
@wenxie-amd wenxie-amd force-pushed the dev/wenx/deepseek-v4 branch from ecf8169 to 1030293 Compare April 29, 2026 12:28
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 44 out of 48 changed files in this pull request and generated 8 comments.

Comment on lines +85 to +93
gen = torch.Generator(device="cpu").manual_seed(int(seed))
# For each token id, pick ``topk`` distinct expert ids deterministically.
# randperm(num_experts) is a stable, dense permutation; slicing the
# first ``topk`` rows gives uniform-without-replacement routing.
rows = []
for _ in range(vocab_size):
perm = torch.randperm(num_experts, generator=gen)[:topk]
rows.append(perm)
tid2eid = torch.stack(rows, dim=0).long() # [vocab_size, topk]
Copy link

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HashRouter.__init__ builds tid2eid by looping over every vocab_size entry and calling torch.randperm(num_experts) each time. For real V4 sizes (e.g., vocab≈129k, experts≈384), this will add significant startup time and CPU memory churn at model construction. Consider replacing this with a deterministic hash-based mapping (no table), or generating the table in larger vectorized blocks (and/or only for the subset of vocab used), so model init remains scalable.

Copilot uses AI. Check for mistakes.
Comment on lines +88 to +107
def _gather_topk_kv(
self,
pool: torch.Tensor, # [B, P, head_dim]
topk_idxs: torch.Tensor, # [B, S, K] (-1 for masked)
) -> torch.Tensor:
"""Gather ``[B, P, head_dim]`` along ``P`` per query → ``[B, S, K, head_dim]``.

Out-of-range / masked indices (``-1``) are clamped to ``0`` for the
gather, then *zero-masked* afterwards.
"""
B, S, K = topk_idxs.shape
P, Hd = pool.shape[1], pool.shape[2]
valid = topk_idxs >= 0 # [B, S, K]
safe_idx = topk_idxs.clamp(min=0)
# Expand idx to gather along P for each (B, S, K, Hd).
idx_expand = safe_idx.unsqueeze(-1).expand(B, S, K, Hd)
pool_expand = pool.unsqueeze(1).expand(B, S, P, Hd) # [B, S, P, Hd]
gathered = torch.gather(pool_expand, dim=2, index=idx_expand) # [B, S, K, Hd]
gathered = gathered * valid.unsqueeze(-1).to(gathered.dtype)
return gathered, valid
Copy link

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_gather_topk_kv is annotated as returning only a torch.Tensor, but it actually returns (gathered, valid). This mismatch can break type checking and mislead callers; update the return annotation (and docstring if desired) to Tuple[torch.Tensor, torch.Tensor].

Copilot uses AI. Check for mistakes.
in_dtype = x.dtype
x32 = x.float()
rsqrt = torch.rsqrt(x32.pow(2).mean(dim=-1, keepdim=True) + self.eps)
return (x32 * rsqrt).to(in_dtype) * self.weight
Copy link

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The standalone RMSNorm implementation returns (…to(in_dtype) * self.weight). If self.weight remains fp32 (common in mixed-precision training), this multiplication will upcast the output back to fp32, potentially defeating BF16 activation flow and increasing memory/compute. Consider multiplying by self.weight.to(in_dtype) (or casting the final result back to in_dtype) so the output dtype stays consistent with the input activation dtype.

Suggested change
return (x32 * rsqrt).to(in_dtype) * self.weight
return (x32 * rsqrt).to(in_dtype) * self.weight.to(in_dtype)

Copilot uses AI. Check for mistakes.
Comment on lines +219 to +221
flat = hidden.reshape(-1, D) # [N, D]
flat.shape[0]

Copy link

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This flat.shape[0] statement is a no-op and appears to be leftover debug code. Please remove it to keep the forward path minimal and lint-clean.

Copilot uses AI. Check for mistakes.
Comment on lines +2 to +5
# DeepSeek-V4 Pro (large MoE variant).
#
# Source: deepseek-v4/deepseek-ai/DeeSeek-v4-Pro/config.json
###############################################################################
Copy link

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in the source comment path: DeeSeek-v4-Pro should be DeepSeek-v4-Pro (consistent with the model naming elsewhere).

Copilot uses AI. Check for mistakes.
Comment on lines +47 to +52
def forward(self, x: torch.Tensor) -> torch.Tensor:
in_dtype = x.dtype
x32 = x.float()
rms = torch.rsqrt(x32.pow(2).mean(dim=-1, keepdim=True) + self.eps)
return (x32 * rms).to(in_dtype) * self.weight

Copy link

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same RMSNorm dtype issue here: (…to(in_dtype) * self.weight) can upcast the output back to fp32 if self.weight is fp32, which is likely under mixed precision. To keep the compressor output in the activation dtype, multiply by self.weight.to(in_dtype) or cast the final output back to in_dtype.

Copilot uses AI. Check for mistakes.
Comment on lines +223 to +224
v.shape[2]
v_local_per_q = v.unsqueeze(2).expand(-1, -1, q.shape[2], -1, -1) # [B, H, S, Sk_local, head_dim]
Copy link

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This v.shape[2] line is a no-op (likely leftover from debugging) and should be removed to avoid confusing readers and linters.

Copilot uses AI. Check for mistakes.
Comment on lines +46 to +56
class HashRouter(nn.Module):
"""Static hash-based MoE router.

Args:
num_experts: total number of routed experts.
topk: number of experts each token is routed to.
vocab_size: tokenizer vocabulary size; controls the table length.
seed: deterministic seed for the hash; same across all ranks.
dtype: dtype of the returned ``probs`` tensor; defaults to
``torch.float32``.

Copy link

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR introduces substantial new DeepSeek-V4 core modules (attention variants, compressor/indexer, routers, MoE, HC) but does not add unit tests covering their key invariants (e.g., HashRouter determinism, CSA/HCA causality masks, compressor/indexer shape/validity). The repo already has a Python unit test suite under tests/unit_tests/ (including Megatron transformer tests), so please add focused unit tests for these new modules to prevent regressions.

Copilot uses AI. Check for mistakes.
Remove GPT placeholder/super-init spec coupling so DeepSeek-V4 builds decoder directly from DeepSeek ModuleSpec submodule trees, and update Phase 8 progress records to match the finalized implementation and validation status.

Made-with: Cursor
Unify DeepSeek-V4 runtime module selection under DeepSeekV4SpecProvider and migrate attention/MLP/MoE construction to provider-driven ModuleSpec flows with safe local fallbacks. Document and validate the TE CUDA runtime contract, including an explicit fail-fast guard for non-CUDA TE/Turbo inputs and updated Phase 9 progress records in English.

Made-with: Cursor
Copilot AI review requested due to automatic review settings April 30, 2026 03:08
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 46 out of 50 changed files in this pull request and generated 4 comments.

Comment on lines +164 to +167

# 5) Stash for ``_compute_attention_output`` to consume.
gathered.shape[2]
# Build mask for the compressed branch: ``-inf`` where invalid.
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a couple of no-op statements (e.g., gathered.shape[2]) that have no effect and appear to be leftover debugging. Please remove them to keep the CSA path easier to read/maintain.

Copilot uses AI. Check for mistakes.
Comment on lines +176 to +179
batch, seq = input_ids.shape
position_ids = (
input_ids.new_arange(seq, dtype=input_ids.dtype).unsqueeze(0).expand(batch, -1)
)
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

input_ids.new_arange(...) is not a valid PyTorch Tensor API (and there is no local helper/monkeypatch in the repo), so this will raise AttributeError when position_ids is omitted. Use torch.arange(seq, device=input_ids.device, dtype=...) (or the existing Megatron helper used elsewhere) to build position ids.

Copilot uses AI. Check for mistakes.
Comment thread run_deepseek_v4.sh
Comment on lines +37 to +40
export PRECISION_TYPE=${PRECISION_TYPE:-BF16}
export FP8=null
export FP8_RECIPE=null

Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FP8/FP8_RECIPE default to the literal string null, but the script still passes them via --fp8/--fp8_recipe. That makes args.fp8 truthy and can trigger FP8 validation paths (and failures) even when FP8 is intended to be disabled. Only include these CLI flags when PRECISION_TYPE=FP8, or ensure the disabled state is represented in a way the arg parser treats as false/None.

Copilot uses AI. Check for mistakes.
Comment on lines +321 to +324
B, S, D = hidden.shape
flat = hidden.reshape(-1, D) # [N, D]
flat.shape[0]

Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few no-op statements left in forward (e.g., flat.shape[0]) that don't affect execution and look like leftover debugging. Please remove them to avoid confusion and keep the forward path clean.

Copilot uses AI. Check for mistakes.
…chema

Align phase10 DeepSeek-V4 modules on explicit spec/provider contracts by enforcing SharedExpertMLP-only shared experts and introducing a dedicated DeepSeekV4TransformerConfig for V4-only runtime fields. Update builder/spec/docs so training resolves the new config type and tracks activation clamp through model config.

Made-with: Cursor
Fix HC/attention dtype mismatches and tune the DeepSeek-V4 smoke script defaults so the Phase 10 MI355X run completes reliably end-to-end. Add a dedicated Phase 10 convergence report documenting delivered scope, runtime blockers, and remaining tracked items.

Made-with: Cursor
Copilot AI review requested due to automatic review settings April 30, 2026 12:46
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 48 out of 52 changed files in this pull request and generated 5 comments.

Comment thread run_deepseek_v4.sh
Comment on lines +39 to +90
export PRECISION_TYPE=${PRECISION_TYPE:-BF16}
export FP8=null
export FP8_RECIPE=null

if [ "$PRECISION_TYPE" = "FP8" ]; then
export FP8=${FP8:-hybrid}
export FP8_RECIPE=${FP8_RECIPE:-delayed}
fi

export EXP=${EXP:-examples/megatron/configs/MI355X/deepseek_v4_flash-BF16-pretrain.yaml}
export BACKEND_PATH=${BACKEND_PATH:-"$(pwd)/third_party/Megatron-LM"}
export PRIMUS_TEAM=${PRIMUS_TEAM:-amd}
export PRIMUS_USER=${PRIMUS_USER:-tas-mi355x-$(date +%Y%m%d)}
export PRIMUS_EXP_NAME=${PRIMUS_EXP_NAME:-deepseek_v4_smoke_${PRECISION_TYPE}_MBS${MBS}_GBS${GBS}_PP${PRIMUS_PP}_EP${PRIMUS_EP}}

if [ ! -d "$BACKEND_PATH" ] || [ -z "$(ls -A "$BACKEND_PATH" 2>/dev/null)" ]; then
echo "[ERROR] BACKEND_PATH does not exist or is empty: $BACKEND_PATH"
echo "Run: git submodule update --init --recursive"
exit 1
fi

mkdir -p "output/$PRIMUS_TEAM/$PRIMUS_USER/$PRIMUS_EXP_NAME"

./primus-cli direct \
-- train pretrain --config "$EXP" \
--backend_path "$BACKEND_PATH" \
--num_layers "$PRIMUS_TOTAL_LAYERS" \
--train_iters "$TRAIN_ITERS" \
--lr_warmup_iters 0 \
--lr_decay_iters "$TRAIN_ITERS" \
--micro_batch_size "$MBS" \
--global_batch_size "$GBS" \
--seq_length "$PRIMUS_SEQ_LENGTH" \
--max_position_embeddings "$PRIMUS_MAX_POSITION_EMBEDDINGS" \
--rope_type rope \
--tensor_model_parallel_size "$PRIMUS_TP" \
--pipeline_model_parallel_size "$PRIMUS_PP" \
--expert_model_parallel_size "$PRIMUS_EP" \
--num_experts "$PRIMUS_NUM_EXPERTS" \
--moe_router_topk "$PRIMUS_MOE_TOPK" \
--moe_router_enable_expert_bias "$PRIMUS_MOE_ENABLE_EXPERT_BIAS" \
--moe_ffn_hidden_size "$PRIMUS_MOE_FFN_HIDDEN_SIZE" \
--index_topk "$PRIMUS_INDEX_TOPK" \
--v4_grouped_experts_support_clamped_swiglu "$PRIMUS_V4_GROUPED_EXPERTS_SUPPORT_CLAMPED_SWIGLU" \
--compress_ratios "$PRIMUS_COMPRESS_RATIOS" \
--mtp_num_layers 0 \
--mock_data True \
--use_turbo_attention "$USE_TURBO_ATTENTION" \
--use_turbo_grouped_mlp "$TURBO_USE_GROUPED_MLP" \
--moe_use_legacy_grouped_gemm "$LEGACY_GG" \
--fp8 "$FP8" \
--fp8_recipe "$FP8_RECIPE" \
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FP8/FP8_RECIPE are always passed to primus-cli (defaulting to the literal string null). Other run scripts in this repo gate --fp8 ... args behind an explicit FP8 enable flag; passing null may be rejected by argument parsing or select an unintended FP8 mode. Consider only adding --fp8/--fp8_recipe when PRECISION_TYPE=FP8 (or when a dedicated FP8=True flag is set), and omit them entirely otherwise.

Copilot uses AI. Check for mistakes.
Comment on lines +48 to +52
# Primus-owned: DeepSeek-V4 (Phase 2 stub; full V4 wiring lands in Phase 3+)
if model_type == "deepseek_v4":
deepseek_v4_module = importlib.import_module(
"primus.backends.megatron.core.models.deepseek_v4.deepseek_v4_builders"
)
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment "Phase 2 stub; full V4 wiring lands in Phase 3+" is now misleading since this PR imports the full DeepSeek-V4 builders/specs. Updating/removing it will avoid confusion when debugging model-type dispatch.

Copilot uses AI. Check for mistakes.
Comment on lines +172 to +185
# 5) Stash for ``_compute_attention_output`` to consume.
gathered.shape[2]
# Build mask for the compressed branch: ``-inf`` where invalid.
# This is per-query, shape [S, K]; we keep it on the module as a
# full [B, S, K] additive mask.
sparse_mask = torch.where(valid, 0.0, float("-inf")).to(dtype) # [B, S, K]
self._csa_state = {
"gathered": gathered, # [B, S, K, head_dim]
"sparse_mask": sparse_mask, # [B, S, K]
}

# Tell the parent: no cat-extension; we handle CSA inside
# ``_compute_attention_output``.
return None, None, None
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CSAAttention stores per-forward tensors in self._csa_state and then reads them in _compute_attention_output. This is not safe under pipeline parallel schedules (multiple microbatches in flight) or activation checkpoint recomputation, because the module attribute can be overwritten before earlier microbatches/backward recomputes run, leading to wrong outputs/gradients. Refactor CSA to avoid mutable module-level forward state (e.g., compute the joint local+sparse attention fully inside forward, or thread the gathered KV/mask through the call stack without storing on self).

Copilot uses AI. Check for mistakes.
Comment on lines +172 to +174
# 5) Stash for ``_compute_attention_output`` to consume.
gathered.shape[2]
# Build mask for the compressed branch: ``-inf`` where invalid.
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two no-op expression statements (gathered.shape[2] and later v.shape[2]) that have no effect and look like leftover debug code. They should be removed to avoid confusion (and to keep linters/type checkers from flagging them).

Copilot uses AI. Check for mistakes.
Comment on lines +187 to +198
decoder = getattr(self, "decoder", None)
if decoder is not None:
decoder._v4_token_ids = input_ids
try:
hidden_states = self.decoder(
hidden_states=decoder_input,
attention_mask=attention_mask,
**kwargs,
)
finally:
if decoder is not None:
decoder._v4_token_ids = None
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DeepseekV4Model.forward stashes input_ids onto decoder._v4_token_ids and clears it immediately after the forward. This breaks any activation checkpoint/recompute that re-invokes decoder/layer forwards during backward (token_ids will be missing) and is also unsafe with pipeline schedules that can have multiple microbatches using the same module instance. Prefer passing token_ids=input_ids explicitly into self.decoder(...) (the decoder already accepts a token_ids kwarg) instead of relying on mutable module state.

Copilot uses AI. Check for mistakes.
Copilot AI review requested due to automatic review settings May 8, 2026 15:00
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot wasn't able to review this pull request because it exceeds the maximum number of lines (20,000). Try reducing the number of changed lines and requesting a review from Copilot again.

wenxie-amd and others added 2 commits May 8, 2026 10:11
…) before dispatch + smoke

User directive: P27's first deliverable is now a release-tier
correctness gate that runs the existing G23 / G24 / G26 / G27
equivalence tests on production V4 dimensions (`head_dim=512`, real
`H`, real `swa_window`, real `K_topk`), so any kernel-numerics
regression at the head-dim that plan-4 exists to solve is caught
BEFORE the dispatch + smoke layers are added on top.

Plan-doc updates:
* `01-roadmap.md` — P27 deliverables now lead with the release-tier
  shape gate; Milestone M4 split into M4a (release-tier shape
  correctness) and M4b (smoke).
* `02-phase-details.md` — Phase 27 section rewritten to lead with
  task 1 "Release-tier shape gate (G28)" followed by dispatch
  precedence, run-script docs, dispatch unit test (G29), smoke run +
  smoke gate (G30), and the hand-off note. Design notes explain why
  G28 fits in P27 (not P25 / P26), why eager fp32 reference fits at
  calibrated `S ∈ {512, 1024}`, and why we don't target full
  `S=4096` in unit tests (smoke covers full `S`).
* `03-test-strategy.md` — gate matrix gains G28 (release-tier kernel
  correctness at production V4 dims); previous G28 / G29 renumbered
  to G29 / G30. GPU-toy harness paragraph documents the fast-tier
  vs release-tier split.
* `status.md` — Phase 27 task table re-ordered: G28 row first, then
  dispatch precedence, env-var plumbing, dispatch unit test (G29),
  smoke run, smoke gate (G30), hand-off note.

The actual kernel implementations (P25 / P26) and dispatch plumbing
(P22 / P25 / P26) are unchanged — this is a plan-doc + status-table
reorganisation only.

Co-authored-by: Cursor <cursoragent@cursor.com>
 wiring + EP8 smoke (closes plan-4)

Plan-4 P27 lands the three layered closing gates for the in-tree
Primus Triton V4 attention kernels (P25 dense / HCA, P26 CSA), and
appends the plan-4 hand-off summary.

* G28 release-tier shape gate (lands first per user directive — kernel
  numerics are locked at production V4 dims BEFORE the dispatch +
  smoke layers are stacked on top). Extends `_BASE_SHAPES` in the
  four P25 / P26 fwd/bwd test files with V4-Flash (`H=64,
  head_dim=512, swa_window=128, S=1024, K_topk=512`) and V4-Pro
  (`H=128, head_dim=512, swa_window=128, S=512, K_topk=512`)
  pytest.param entries marked `pytest.mark.slow`. New
  `tests/unit_tests/megatron/transformer/deepseek_v4/conftest.py`
  ships an autouse `torch.cuda.empty_cache()` fixture so the eager
  CSA reference's `[B, H, Sq, K, D]` einsum intermediate doesn't
  accumulate in PyTorch's caching allocator across consecutive
  release-tier tests. Root `tests/unit_tests/conftest.py` registers
  `pytest.mark.slow` and adds a `--run-slow` opt-in (also accepts
  `-m slow`). Release-tier bf16 tolerances bumped to absorb
  `head_dim=512` matmul noise + `tl.atomic_add` jitter on the
  backward (FWD bf16 atol=5e-2; BWD bf16 dq/dk/dv/dgathered atol=2e-1;
  dsink atol=5e-2). 80 / 80 release-tier tests pass on
  mi355-gpu-14 inside dev_primus_wenx_693 in 60.2 s
  (`pytest --run-slow -m slow`); fast-tier suite remains green.

* G29 dispatch precedence + startup log line. New
  `_log_kernel_choice` helper on `DeepseekV4Attention` emits one
  `[V4-attn] Layer N: cr=R, kernel = ...` info line per layer at
  rank 0 so smoke / training logs unambiguously show which kernel
  each layer is firing through. The class docstring grows a
  precedence table covering all three layer kinds plus the
  auto-disable rules for the two flags. New
  `test_v4_p27_dispatch_precedence.py` (16 tests) covers the
  dispatch path at runtime: 7 parametrised log-line tests across
  every (cr, flag) → expected-kernel combo + format / once-per-call
  / layer-number assertions; runtime-mock tests on
  `_attention_forward_via_v4_triton` and `_csa_forward` verifying
  the right kernel symbol is invoked with the right kwargs; two
  auto-disable runtime tests for the cross-layer-kind contracts.
  `run_deepseek_v4.sh` gains a soft `[WARN]` echo when either
  Triton flag is on and `PRIMUS_TP > 1` (kernels are MQA-centric and
  operate per-rank on the local H/TP head slice — TP > 1 should
  work but stays uncovered by plan-4 gates).

* G30 TP=1 PP=1 EP=8 10-iter smoke with both kernels engaged + Turbo
  DeepEP. New `progress/p27/run_smoke_v4_kernels_ep8_pp1.sh` script
  + matching `.gitignore` (excludes `*.log` / `log_*.txt` /
  `debug.log` / `*.tgz` / `*.json` per the plan-3 directive — smoke
  logs MUST NOT land in git). Smoke is green: 10 / 10 iters clean,
  lm_loss converges 11.85 → 11.65, grad norm steady, 0 nan
  iterations, all 8 layers emit the expected kernel-choice log
  line. Steady-state ~17.3 TFLOP/s/GPU (peak ~19.8) at ~500 ms /
  iter — at parity with the P23 Turbo-DeepEP-on-eager-attention
  baseline at the smoke's small seq length (the eager attention is
  matmul-cheap at S=128 and DeepEP dominates iter time; the Triton
  kernels' real win is on full V4-Flash production dims, planned
  as a plan-4 follow-up).

* P27 hand-off block appended to `plan-4/02-phase-details.md`
  recording: commit chain P24 → P27, fast-tier + release-tier
  test totals, G30 smoke perf delta vs. eager / DeepEP baselines,
  and the follow-up list (Megatron-side `layer_number` plumbing,
  full-S=4096 smoke, HCA LSE-merge for Turbo, CSA in-kernel
  gather, FP8, default-True flip).

Plan-4 ends. The two switches (`use_v4_triton_attention`,
`use_v4_triton_csa_attention`) ship at default `False` so this PR
is a pure safety-net add; the Triton path is opt-in via the
existing run-script env vars.

Co-authored-by: Cursor <cursoragent@cursor.com>
Copilot AI review requested due to automatic review settings May 9, 2026 00:40
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot wasn't able to review this pull request because it exceeds the maximum number of lines (20,000). Try reducing the number of changed lines and requesting a review from Copilot again.

wenxie-amd and others added 2 commits May 8, 2026 19:44
…P27 SHA (e19663f)

Replaces the TBD-p27 / TBD-p27a placeholders in
`deepseek-v4/develop/progress/status.md` and the P27 hand-off
block in `deepseek-v4/develop/plan-4/02-phase-details.md` with the
actual commit SHA `e19663f7`.

Mirrors the P25 / P26 SHA-pin convention (`1ba38ba5` / `36dfca66`).
Plan-4 ends here.

Co-authored-by: Cursor <cursoragent@cursor.com>
…+ flip run_deepseek_v4.sh defaults to PP1EP8 + V4 Triton kernels on

Plan-5 picks up where plan-4 closed (in-tree V4 Triton kernels for
dense / HCA / CSA shipped behind use_v4_triton_attention /
use_v4_triton_csa_attention; +37 % vs P22 eager at smoke seq) and is
strictly scoped to taking V4-Flash EP=8 single-node training from its
plan-4 P27 G30 steady-state up the throughput curve at production-shape
sequence length, by attacking the bottlenecks visible in a real
torch.profiler trace.

Five phases (every P29..P32 task list is seeded; refined in writing
against the P28 trace at phase open; targets < 10 % of step time get
de-scoped on the spot):

  - P28 (kick-off) — run_deepseek_v4_flash_proxy.sh (V4-Flash widths,
    8 layers, all four perf knobs on: USE_V4_TRITON_ATTENTION,
    USE_V4_TRITON_CSA_ATTENTION, USE_TURBO_DEEPEP, TURBO_USE_GROUPED_MLP)
    calibrated for one MI355X node at EP=8; chrome-trace JSON for one
    steady iter; baseline analysis report (md + html) under
    develop/profile/profile-baseline-ep8-<date>.{md,html}. The
    report's ranked bottleneck list pins the X / Y / Z / W per-phase
    improvement budgets for everything that follows. Gates: G31 (smoke)
    + G31a (report).

  - P29 (small-op fusion) — seeded targets: (a) Q-projection chain,
    (b) KV-projection chain, (c) O-projection group, (d) Compressor +
    Indexer, (e) MoE router. Each behind own use_v4_fused_* switch
    (default False); functional fusion (not module-level) so
    Megatron's spec walker stays untouched. Gates: G32.{a..e} + G33
    (smoke + perf, >= +X % TFLOP/s/GPU vs P28 baseline).

  - P30 (V4 Triton attention perf) — per-shape autotune for FWD + BWD
    (BLOCK_M / BLOCK_N / num_warps / num_stages keyed on H, head_dim,
    swa_window; SMEM heuristic prunes > 160 KiB at compile time);
    persistent FWD kernel; HCA LSE-merge variant (was a plan-4
    follow-up; runs SWA + compressed-pool branches as two flash
    kernels and merges via online softmax — avoids the materialised
    additive bias). use_v4_attention_lse_merge switch (default False);
    G34 asserts FWD + BWD equivalence within bf16 budget.

  - P31 (V4 Triton CSA perf) — in-kernel topk_idxs gather (drops the
    ~64 GiB / microbatch wrapper-side materialisation at V4-Flash
    production dims; this is also the structural fix that eventually
    lets the proxy reach Sq=4096); K-tile prefetching.
    use_v4_csa_in_kernel_gather switch (default False); reuses
    plan-4 G26 / G27 release-tier with dgathered -> dpool assertion.

  - P32 (overlap + recompute) — re-enable --overlap_grad_reduce True
    --overlap_param_gather True (currently False; plan-4 G30 obsoletes
    the plan-2 stability hedge that turned them off); MoE
    shared-expert overlap investigation; recompute granularity
    tuning if P31's in-kernel gather frees enough HBM. Final EP=8
    trace at develop/profile/profile-final-*. Gate: G35 (smoke +
    cumulative perf, >= W % vs P28 baseline) + plan-4 ratchet
    (G23..G30) all green.

Ratchet — every plan-5 phase MUST keep plan-4 gates G23 / G24 / G25 /
G26 / G27 / G28 / G29 / G30 green. Banned-warning ratchet adds
"v4_fused_* compile error" and "DeepEP contract violation". Plan-5 is
measurement-driven: no per-phase budget number is committed in the
plan docs; P28's report owns picking and writing them.

Out of scope (plan-5): FP8 / FP4 / mxfp4 quantised forward,
convergence run, long-context (1M-token) bring-up, multi-node EP
scaling, HF state-dict adapter, V3 / V2 backports of plan-5 fusions.

run_deepseek_v4.sh — defaults flipped per the user directive that
preceded plan-5 planning so the V4-Flash production smoke runs
end-to-end without any env-var override:

  - PRIMUS_PP defaults 2 -> 1
  - PRIMUS_EP defaults 4 -> 8
  - USE_V4_TRITON_ATTENTION defaults False -> True
  - USE_V4_TRITON_CSA_ATTENTION defaults False -> True

All four are still env-var overridable; the existing TP > 1 soft
warning (plan-4 P27) still fires when the V4 Triton kernels are on at
TP > 1. plan-4 G30 evidence (10/10 iters clean, lm_loss 11.85 ->
11.65, throughput 17.3 TFLOP/s/GPU steady at PP=1 EP=8 with both V4
Triton kernels + Turbo DeepEP on) gates this default flip.

Documents: - deepseek-v4/develop/plan-5/README.md (overview, scope, phase map)
  - deepseek-v4/develop/plan-5/01-roadmap.md (phase overview, dep
    graph, milestones, top risks, out-of-scope)
  - deepseek-v4/develop/plan-5/02-phase-details.md (per-phase tasks,
    design notes, edge cases; hand-off note placeholder)
  - deepseek-v4/develop/plan-5/03-test-strategy.md (gate matrix
    G31..G35, plan-4 ratchet contract, banned-warning ratchet,
    perf-budget contract)
  - deepseek-v4/develop/progress/status.md (Phase 28..32 task tables
    with TBD-p2X commit cells)
Co-authored-by: Cursor <cursoragent@cursor.com>
Copilot AI review requested due to automatic review settings May 9, 2026 01:39
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot wasn't able to review this pull request because it exceeds the maximum number of lines (20,000). Try reducing the number of changed lines and requesting a review from Copilot again.

wenxie-amd and others added 4 commits May 8, 2026 21:40
…+ bottleneck report

Phase 28 ships the foundation that every other plan-5 phase reports
its delta against:

  * `run_deepseek_v4_flash_proxy.sh` — thin wrapper over `run_deepseek_v4.sh`
    that pins V4-Flash production widths (`H=64, head_dim=512,
    num_experts=256, moe_router_topk=6, moe_ffn_hidden_size=2048,
    index_topk=512`), 8 layers, `compress_ratios=[0,0,4,128,4,128,4,0]`
    (every layer kind exercised: 3 dense, 3 CSA, 2 HCA), `TP=1 PP=1 EP=8`,
    and all four perf knobs on (`USE_V4_TRITON_ATTENTION`,
    `USE_V4_TRITON_CSA_ATTENTION`, `USE_TURBO_DEEPEP`,
    `TURBO_USE_GROUPED_MLP`).

  * `progress/p28/run_baseline_trace_ep8.sh` — self-contained trace-capture
    script (mirrors plan-3 P23 / plan-4 P25 pattern; `run_deepseek_v4.sh`
    hard-codes `--disable_tensorboard True` which blocks the profiler's
    TB writer). Captures iter 6 -> 7 (one steady iter) at `Sq=4096` with
    `PROFILE=True --use_pytorch_profiler True`. `progress/p28/.gitignore`
    excludes the raw `*.log` / `*.json` / `*.tgz` outputs.

  * `develop/profile/_tools/render_baseline_report.py` — chrome-trace
    consumer that emits the markdown + HTML bottleneck-analysis report.
    Multi-stream-overlap-aware GPU-active math (interval-union sweep —
    single-stream `Sigma dur` over-counts on multi-stream HIP); top-1
    reduce-kernel signature isolated; module-level CPU op-time numbers
    carry an explicit "nests" caveat so readers do not misread bloated
    `Sigma event dur` totals. Tool is reused by P32 for `profile-final-*`.

  * `develop/profile/profile-baseline-ep8-20260508.{md,html}` — the P28
    report. Headline findings:

      - GPU active = 99.7 % (CPU-bound floor 0.3 %, multi-stream overlap
        factor 1.87x). The pre-trace hypothesis that small-kernel-launch
        tail is the bottleneck DOES NOT HOLD at V4-Flash production
        widths.
      - Top kernel by far is one specific `aten::sum` fp32 reduce
        (`reduce_kernel<512, 1, ReduceOp<float, sum_functor<float, float,
        float>>>`) at 7.61 s (87.3 % of step) over 717 launches x 10.62 ms
        each.
      - V4 Triton attention kernels are BWD-heavy: dense / HCA = 3.90 s
        (44.7 %), CSA = 4.19 s (48.1 %).
      - Comm time = 12.85 ms (0.1 %); HBM peak = 195 / 287 GiB ~ 68 %.
      - Per-phase de-scope decisions (data-driven, 10 % rule): P29 KEEP
        but RESCOPE (drop small-op fusion mandate, redirect to root-
        causing the 7.6 s `aten::sum`); P30 KEEP (BWD prioritised); P31
        KEEP but RESCOPE (HBM motivation gone, kept for BWD speed-up);
        P32 DE-SCOPED.
      - Combined target: plan-5 final >= 110 TFLOP/s/GPU steady at
        Sq=4096 EP=8 single-node (40 %+ over the 78 TFLOP/s/GPU baseline).

  * Calibration outcome: `Sq=4096` (production target) confirmed fitting
    on a single MI355X node at EP=8 (peak rocm HBM 195 GiB / 287 GiB ~
    68 %, 5/5 calibration iters clean, 10/10 baseline iters clean,
    lm_loss 11.16 -> 9.26, 0 NaN, banned-warning grep on plan-3 / plan-4
    ratchet patterns returns 0 for every term). No fall-back to
    Sq=2048 / 1024 / 512 needed; `Sq=4096` adopted as the proxy default.

  * `progress/status.md` Phase 28 row: all 6 P28 task cells checked, with
    `TBD-p28` SHA placeholders that will be SHA-pinned in a follow-up
    commit (mirrors the plan-4 P27 -> 03bacc2 pattern).

Closes plan-5 P28. P29 / P30 / P31 task lists open against this
baseline; P32 is de-scoped pending evidence.

Co-authored-by: Cursor <cursoragent@cursor.com>
Plan-5 P28 closed in afd7ea5; replace the `TBD-p28` placeholders in
the Phase 28 task table with the actual commit SHA. No content change
beyond the SHA pin.

Mirrors the plan-4 P27 -> 03bacc2 pattern.

Co-authored-by: Cursor <cursoragent@cursor.com>
…rn + project-wide rules doc

Plan-5 P29 (RESCOPED) — kill the dominant aten::sum fp32 reduce kernel that
the P28 baseline trace pinned at 7.61 s / 87.3 % of step time.

Forensic root cause (progress/p29/refinement.md + _forensics{,2,3}.py):
624 / 717 of all dominant `reduce_kernel<512, 1, ...>` launches (96 %
by count, 99.95 % by Σ kernel duration) come from
`hyper_connection.py:47 sinkhorn_normalize` — 39 reductions / call ×
8 layers × 2 (FWD + AOT-autograd BWD) = 624. Inputs are
`(1, 4096, 4, 4) → keepdim=True dim=-1` fp32. HIP's default
`reduce_kernel<512, 1, ...>` is sized for huge reductions; for our
4-elements-per-output shape it runs at ~250× over the memory-bound
floor (~12.5 % occupancy + 624 × 5 µs launch overhead).

Fix: a `torch.compile(fullgraph=True, dynamic=True)` build of
`sinkhorn_normalize`, cached on `(n_iters, eps, in_dtype)` (shape NOT
in key — `dynamic=True` ships ONE shape-generic Inductor kernel; that
also avoids Dynamo's `cache_size_limit=8` collision when closures
from the same factory share a `code` object). Algorithm is byte-
identical; only the kernel boundary moves. AOT autograd handles BWD.

Behind a default-off feature flag `use_v4_compiled_sinkhorn` plumbed
through `DeepSeekV4TransformerConfig` → V4 base + V4-Flash YAML →
`DeepseekV4HybridLayer` → `HyperMixer.__init__` →
`HyperMixer.compute_weights` → `sinkhorn_normalize(use_compiled=...)`.
`run_deepseek_v4.sh` exports `USE_V4_COMPILED_SINKHORN` (default
`False`); `run_deepseek_v4_flash_proxy.sh` flips its default to `True`
so plan-5 P30 / P31 measure against the post-P29 baseline.

Gates (all green):
* G32 — FWD + BWD parity (compiled vs eager); 10 / 10 tests pass; fast
  tier (B=2, S=64, K=4) atol=1e-5; release tier (B=1, S=4096, K=4)
  marked pytest.mark.slow; cache-hit assertion on second call;
  HyperMixer flag-propagation test included.
* G33a — 10-iter EP=8 proxy smoke; no NaN / Inf / banned warnings;
  `lm_loss[10] = 9.258` vs P28 baseline `9.258` (bit-for-bit); steady
  79.1 vs 77.5 TFLOP/s/GPU (+2.0 %).
* G33b — post-P29 chrome-trace + bottleneck report at
  `develop/profile/profile-after-p29-ep8-20260509.{md,html}`. Budget
  X1 (≥ 50 % drop in aten::sum kernel time) MET BY ~1000×: critical
  shape kernel time 7607.9 ms → 0.2 ms (−99.997 %), launches
  624 → 16. Multi-stream overlap factor collapsed 1.87× → 1.00× —
  explains why wall-time gain is only +2 % despite the kernel-time
  delta (the reduce was a parallel hitchhiker on stream-1; the V4
  Triton attention BWD on stream-0 was already wall-time gating).
  New top wall-time bottleneck: V4 Triton CSA BWD (4.03 s, 46.8 %)
  + V4 Triton dense BWD (3.18 s, 36.8 %) = 92.6 % of step. P30 / P31
  mandate confirmed unchanged.

De-scope decisions recorded at P29 close:
* Hand-Triton fall-back kernel — NOT NEEDED (X1 over-shot ~1000×).
* Global default flip — DEFERRED to P32 hand-off (G35) because the
  +2 % wall-time gain does not justify the cold-compile footgun for
  short-iter unit-test harnesses; proxy default is enough for plan-5
  P30 / P31 perf work.

Also lands `develop/rules/rule.md` — project-wide working rules doc
codifying the standing decisions accumulated across plan-2..plan-5
(review-before-commit, status-pin commit pattern, per-phase summary
file convention introduced at this phase, banned-warning ratchet,
dispatch precedence, DeepEP best-practice config, dtype contract,
TFLOPs counting rule, 10 % de-scope rule, etc.). README + status.md
+ plan-5/01-roadmap.md now point at it as the single source of
truth.

Co-authored-by: Cursor <cursoragent@cursor.com>
Pin the Plan-5 P29 tracker rows, post-P29 profile provenance, and
`progress/p29/p29-summary.md` commit chain to the feature commit
`1ea7e7a8`.

This is the standard docs-only status-pin commit that follows every
DeepSeek-V4 phase feature commit.

Co-authored-by: Cursor <cursoragent@cursor.com>
Copilot AI review requested due to automatic review settings May 9, 2026 07:40
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot wasn't able to review this pull request because it exceeds the maximum number of lines (20,000). Try reducing the number of changed lines and requesting a review from Copilot again.

)[0]
)
print(f"loading {fp} ...")
data = json.load(open(fp))
trace_dir = "output/amd/tas-mi355x-20260509/p28_profile_baseline_pp1_ep8_seq4096/tensorboard"
fp = glob.glob(os.path.join(trace_dir, "*.pt.trace.json"))[0]
print(f"loading {fp} ...")
data = json.load(open(fp))
trace_dir = "output/amd/tas-mi355x-20260509/p28_profile_baseline_pp1_ep8_seq4096/tensorboard"
fp = glob.glob(os.path.join(trace_dir, "*.pt.trace.json"))[0]
print(f"loading {fp} ...")
data = json.load(open(fp))
wenxie-amd and others added 3 commits May 9, 2026 03:58
…tiles

Optimize the in-tree V4 Triton attention path by routing dense and HCA layers through kernel-native SWA pruning, including an HCA split-mask mode that preserves the joint softmax while avoiding dead local-key tiles.

Co-authored-by: Cursor <cursoragent@cursor.com>
…nd sparse pool kernels

Co-authored-by: Cursor <cursoragent@cursor.com>
Copilot AI review requested due to automatic review settings May 9, 2026 11:15
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot wasn't able to review this pull request because it exceeds the maximum number of lines (20,000). Try reducing the number of changed lines and requesting a review from Copilot again.

wenxie-amd and others added 2 commits May 9, 2026 06:26
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Copilot AI review requested due to automatic review settings May 9, 2026 12:32
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot wasn't able to review this pull request because it exceeds the maximum number of lines (20,000). Try reducing the number of changed lines and requesting a review from Copilot again.

wenxie-amd and others added 2 commits May 10, 2026 22:44
…D kernels

P32 closes the residual single-kernel attention bottlenecks pinned by the
post-P31b microbenchmark using `progress/p31/bench_csa_attention_ep8.py`
and a new `progress/p32/bench_v4_attention_ep8.py` (dense `cr=0` + HCA
`cr=128` modes; mirrors the CSA bench argparse + timing).

CSA FWD: 48.17 ms -> 3.16 ms (-93.4 %, 15.2x; target <=6 ms MET).
  Replace the monolithic `_v4_csa_attention_pool_fwd_kernel` with three
  kernels joined by an online-softmax LSE merge so the local SWA and
  sparse top-K branches no longer serialise through a single program:
  reuse the P30-pruned dense FWD for local, add
  `_v4_csa_attention_pool_sparse_fwd_kernel` for head-block sparse, and
  add `_v4_csa_attention_lse_merge_kernel` to combine the two.
  `PRIMUS_V4_CSA_FWD_FORCE_MONOLITHIC=1` keeps the legacy kernel.

V4 attention BWD: dense 17.27 ms -> 7.65 ms (-55.7 %, 2.26x; target
  <=15 ms MET); HCA 20.87 ms -> 11.91 ms (-42.9 %, 1.75x; target
  <=15 ms MET). Split `_v4_attention_bwd_kernel` into
  `_v4_attention_bwd_dq_kernel` (parallel over `m`) and
  `_v4_attention_bwd_dkv_kernel` (parallel over `n`) so dQ, dK, dV are
  written atomic-free. MHA fast path drops the kvgroup head loop when
  `HEAD_K == HEAD_Q`. `PRIMUS_V4_ATTN_BWD_FORCE_MONOLITHIC=1` keeps the
  legacy kernel.

CSA BWD: 35.43 ms -> 16.31 ms (-54.0 %, 2.17x; target <=15 ms missed by
  ~1.3 ms). Local SWA reuses the new split dq + dkv kernels with CSA's
  joint `lse / D`. Sparse pool branch defaults to a two-pass segmented
  reduction (`PRIMUS_V4_CSA_BWD_SEGREDUCE=1`): a new
  `_v4_csa_attention_pool_sparse_bwd_partial_kernel` writes per-visit
  dpool contributions to a compact `[B, M, K_topk, D]` partial buffer
  with `tl.store` (no atomics), then a new
  `_v4_csa_attention_pool_segreduce_kernel` folds them into `dpool_fp32`
  segment-by-segment via a sorted inverse index, also atomic-free. Sweep
  + ship `BLOCK_K_PARTIAL=16`, `partial warps=8`, `partial stages=2`,
  `segreduce BLOCK_D=512`, `BLOCK_I=64`, `warps=8`, `stages=3`. Fallback
  retains the legacy gather + dpool atomics path (sparse `BLOCK_K=32`,
  `num_warps=4` defaults after a sweep).

Tests: `pytest -x -q tests/.../deepseek_v4/{test_v4_p25_v4_attention_bwd
  ,test_v4_p26_v4_csa_attention_bwd,test_v4_p31_v4_csa_in_kernel_gather}.py`
  -> 51 passed, 48 skipped. Pre-existing unrelated `test_v4_mtp`
  failure verified on `git stash` baseline.

Docs: `progress/status.md` P32 row checked through the kernel work
(EP8 trace + report + `proxy_ep8.md` left for follow-up);
`develop/perf/attention_perf.md` P32 row added with effective TFLOP/s
re-derived from the microbench wall times; full eight-section summary
in `progress/p32/p32-summary.md` per rule R2.1, including the negative
probes (dense-mask scatter, bf16 partial buffer, fused dpool matmuls,
multi-stream overlap).

Bench shape: `B=1, H=64, S=4096, D=512, P=1024, K_topk=512,
swa_window=128, bf16, sink=on` on `mi355-gpu-8` / `dev_primus_wenx_693`,
median of 60 iters after 20 warmup.

Co-authored-by: Cursor <cursoragent@cursor.com>
…sive BWD optimizations behind opt-in env vars

After landing the P32 split CSA FWD + atomic-free V4/CSA BWD kernels in
the prior commit, the EP8 proxy trace surfaced a HBM-contention story
that does not show up in standalone microbenchmarks:

- CSA FWD split (local SWA + sparse pool + LSE merge) wins both: bench
  48.17 -> 3.22 ms (-93.3%, 15.0x) AND proxy iter 10 963 -> 891 ms
  (-7.5%) / 711 -> 768 TFLOP/s/GPU (+8.1%). KEEP DEFAULT ON.

- V4 attention BWD split (dQ kernel + dK/dV kernel, atomic-free) wins
  the bench (dense 17.27 -> 7.65 ms, HCA 20.87 -> 11.91 ms; both clear
  <=15 ms target) but regresses EP8 proxy iter time by ~190 ms because
  the split design reads Q / K / V twice (2x HBM traffic per BWD step)
  and loses the bandwidth fight against concurrent MoE work. FLIP
  DEFAULT TO MONOLITHIC; opt in via PRIMUS_V4_ATTN_BWD_USE_SPLIT=1.

- CSA BWD segmented reduction (4 GiB partial buffer + sorted inverse
  index, atomic-free dpool) wins the bench (35.43 -> 16.31 ms, -54%)
  but regresses EP8 proxy iter time by ~40 ms for the same HBM-
  contention reason. FLIP DEFAULT TO gather + atomic_add dpool (the
  P31b path, now ~7.9% faster at 32.62 ms vs P31b 35.43 ms thanks to
  incidental Triton autotuner improvements). Opt in via
  PRIMUS_V4_CSA_BWD_SEGREDUCE=1.

EP8 proxy trace 1778476971738245137 (mi355-gpu-8, dev_primus_wenx_693):

  iter 10            963.0 -> 890.5 ms     (-7.5%)
  TFLOP/s/GPU        709.3 -> 768.4         (+8.1%)
  profiler steady    980.9 -> 899.99 ms    (-8.2%)
  GPU active         940.04 -> 859.54 ms   (-8.6%)
  CSA FWD trace      123.07 -> ~50.6 ms    (-59%)
  V4 attn BWD trace  259.74 -> 256.97 ms   (-1.1%, monolithic kept)
  CSA sparse BWD     80.81 -> 72.54 ms     (-10.2%)
  Attention family   ~493 -> ~410 ms       (-16.8%)

Tests: 114 passed, 88 skipped across test_v4_p25/p26/p27/p31 attention
suites on the shipped defaults.

Docs:
- profile/profile-after-p32-ep8-20260511.{md,html}: full P32 trace report.
- progress/p32/p32-summary.md: rewritten with shipped + opt-in numbers,
  proxy attribution, and HBM-contention rationale for the opt-in gates.
- perf/proxy_ep8.md: P32 row added (890.5 ms / 768.4 TFLOP/s/GPU, 9.92x
  vs P28 baseline 8837 ms).
- perf/attention_perf.md: P32 (shipped) and P32 (bench-opt opt-in) rows.
- progress/status.md: P32 rows checked through trace + report; opt-in
  rationale recorded for the BWD rows.
- progress/p32/_render_html.py: helper that renders the markdown profile
  report to HTML using the same style as the P28..P31b reports.
- progress/p32/run_baseline_trace_ep8_p32.sh: trace script (iter 6->7
  profiler window, same harness as P31b).

Co-authored-by: Cursor <cursoragent@cursor.com>
Copilot AI review requested due to automatic review settings May 11, 2026 05:39
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot wasn't able to review this pull request because it exceeds the maximum number of lines (20,000). Try reducing the number of changed lines and requesting a review from Copilot again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants