feat: deepseek-v4 model support#698
Conversation
Initial design / planning materials for integrating DeepSeek-V4 training support into Primus. Documentation only; no production code changes. - techblog/: architecture deep dive (CSA / HCA / mHC / Hash routing / sqrtsoftplus / clamped SwiGLU / dual RoPE / Muon / MTP) plus 4 PNG diagrams rendered via Pillow (see render_diagrams.py). - plan/: 8-phase roadmap, full code-landing list, per-phase task breakdown, and testing strategy. - progress/status.md: 64-task checklist tracking phase progress. - develop_deepseek-v4-in-primus.md: top-level goal and development cadence. Made-with: Cursor
Phase 1 of the V4 development plan. Pure config; no Python code paths exercised yet. Subsequent phases (P2..P4) wire dispatch and modules. * primus/configs/models/megatron/deepseek_v4_base.yaml Extends llama_base, sets model_type=deepseek_v4 and registers V4-specific defaults (hc_mult, hybrid_attention_*, q_lora_rank, attn_sink, hash routing, swiglu_limit, dual-RoPE knobs, etc.). * primus/configs/models/megatron/deepseek_v4_flash.yaml Hyperparams from DeepSeek-V4-Flash/config.json. * primus/configs/models/megatron/deepseek_v4_pro.yaml Hyperparams from DeepSeek-V4-Pro/config.json. * examples/megatron/configs/MI355X/deepseek_v4_flash-BF16-pretrain.yaml Training scaffold; parallelism / perf knobs are conservative and will be retuned during the perf phase. * primus/backends/megatron/training/tokenizer/tokenizer.py Add DeepSeekV4Tokenizer to CUSTOM_TOKENIZER_TYPES so _add_tokenizer_args accepts it. Note: V4 fields do not need to be registered in Megatron's argparse — Primus's merge_namespace mechanism (train_runtime.py:_initialize_trainer) copies yaml-only fields onto backend_args after MegatronArgBuilder.update. Made-with: Cursor
Phase 2 of the V4 development plan. Wires the end-to-end dispatch from yaml.model_type=deepseek_v4 to a primus-owned model_provider + builder, without changing model behaviour yet. The model class is still a thin GPTModel subclass; Phase 3 swaps the decoder for the V4 transformer block. * primus/core/utils/import_utils.py Add a deepseek_v4 branch to get_model_provider() that imports primus.backends.megatron.core.models.deepseek_v4.deepseek_v4_builders and returns partial(model_provider, deepseek_v4_builder). * primus/backends/megatron/megatron_pretrain_trainer.py Add a model_type == "deepseek_v4" branch alongside gpt / mamba. V4 is a causal-LM with the same data shape as GPT, so we reuse pretrain_gpt's forward_step + train_valid_test_datasets_provider; only the model_provider itself is V4-specific. * primus/backends/megatron/core/models/deepseek_v4/__init__.py (new) Re-export DeepseekV4Model + deepseek_v4_builder + model_provider. * primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_model.py (new) DeepseekV4Model: thin subclass of GPTModel. P3 will replace self.decoder with DeepseekV4TransformerBlock. * primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_builders.py (new) deepseek_v4_builder + model_provider. Uses GPT layer specs in P2; P3 will swap them for V4 specs. Made-with: Cursor
Phase 3 of the V4 development plan. Lands the V4 layer-spec helpers and a transparent V4 transformer-block subclass; attention / MLP behaviour still matches GPT. Phase 4 will plug HC + hybrid attention into the block, and Phase 5 will swap in V4 MoE / clamped SwiGLU through the spec-resolution hooks added here. * primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_layer_specs.py (new) Four V4 layer-spec helpers (layer / decoder_block / decoder_layer_specs / mtp_block) that delegate to the GPT helpers in P3, plus two resolution hooks (_resolve_attention_module_spec / _resolve_mlp_module_spec) that return None for now -- P4 / P5 fill these in. * primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_block.py (new) DeepseekV4TransformerBlock: subclasses TransformerBlock and stashes V4 config fields (hc_mult, compress_ratios, attn_sliding_window, attn_sink, q_lora_rank, index_*) onto self so P4 patches don't have to re-walk the config. Forward behaviour unchanged in P3. * primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_model.py Override __init__: after super().__init__() builds the stock decoder, swap self.decoder for DeepseekV4TransformerBlock (same call signature so GPTModel.forward keeps working). * primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_builders.py _resolve_layer_spec / _resolve_mtp_block_spec now route through the V4 layer-spec helpers instead of the GPT helpers directly. * primus/backends/megatron/core/models/deepseek_v4/__init__.py Re-export DeepseekV4TransformerBlock alongside the existing surface. Made-with: Cursor
…dual-RoPE) Phase 4 of the V4 development plan. Lands the full V4 transformer block: mHC multi-stream residual, per-layer hybrid attention dispatch (Dense / HCA / CSA), sliding-window mask, attention sink, dual-RoPE with YaRN. The V4 block becomes a standalone nn.Module that bypasses Megatron's TransformerBlock + ModuleSpec mechanism so the multi-stream HC loop is expressed cleanly. P5 will swap the placeholder SwiGLU MLP for V4's MoE. New modules under primus/backends/megatron/core/transformer/ :: * hyper_connection.py HyperMixer (per-layer mHC mixer), HyperHead (final K->1 collapse), sinkhorn_normalize (doubly-stochastic projection). Linear weights / scales / biases held in fp32 for stability; fp32 sinkhorn iterates. Unit-tested: row/col errors ~1e-6, hc_mult=1 degenerate path exact. * compressor.py V4 compressor for KV downsampling. ratio=4 overlap mode (CSA, coff=2), ratio=128 non-overlap mode (HCA, coff=1). Internal RMSNorm + learnable APE; RoPE applied externally. * indexer.py Sparse top-K position selector for CSA. Internal mini-Compressor builds the score grid; causal mask + top-K (-1 fill for invalid positions); backward propagates to the indexer params. * sliding_window_kv.py Causal SWA mask + per-query KV index helpers. * attn_sink.py Per-head learnable sink scalar; softmax_with_sink ensures probs.sum() <= 1 with the sink absorbing the residual mass. Backward propagates to the sink params. * dual_rope.py Two RoPE bases (main + compress) with optional YaRN scaling. Partial interleaved RoPE: only ``rotary_dim`` of each head's channels rotated; remaining channels passed through unchanged. * deepseek_v4_attention.py Shared base for V4 attention: QKV projection (optional Q LoRA), partial dual-RoPE, SWA mask, attention sink, output projection. ``_extra_kv`` hook lets HCA / CSA augment KV (full pool or sparse top-K). * hca_attention.py Heavily-Compressed Attention. Subclasses DeepseekV4Attention; adds a non-overlap Compressor and concatenates the full compressed pool to the local KV (always visible). * csa_attention.py Compressed-Sparse Attention. Subclasses DeepseekV4Attention; adds an overlap Compressor + Indexer; per-query attention is computed over the local SWA + the indexer's top-K compressed positions. Updated: * primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_block.py Rewritten as a standalone nn.Module. Holds the dual-RoPE for the whole stack, builds DeepseekV4HybridLayer per layer (Dense/HCA/CSA picked from compress_ratios), and runs the K-stream HC loop. Forward shape: [S, B, D] -> [B, S, D] -> [B, S, K, D] -> ... -> [B, S, D] -> [S, B, D]. Smoke-tested: 8-layer mixed dense/CSA/HCA + hc_mult=4 forward / backward / causality OK. * primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_layer_specs.py Cleaned up to a placeholder spec. The V4 block is standalone and bypasses Megatron's spec mechanism; we still hand a valid GPT-shaped spec to GPTModel.__init__ until P6 refactors that allocation away. * primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_model.py Docstring rewritten for the P4 standalone-block layout; pg_collection switched to getattr(self, "pg_collection", None) for safety. * deepseek-v4/develop/progress/status.md, plan/02-phase-details.md Track P1..P4 completion; add the argparse-not-needed note (Primus's merge_namespace covers V4 fields). Made-with: Cursor
There was a problem hiding this comment.
Pull request overview
Adds a new model_type=deepseek_v4 to Primus’ Megatron backend, including V4 configs, model/provider dispatch, and an initial DeepSeek-V4 block implementation with HC + hybrid attention building blocks.
Changes:
- Add DeepSeek-V4 model dispatch + builders and a Primus-owned V4 model package.
- Introduce V4 config yamls (base/flash/pro) and a MI355X pretrain scaffold yaml.
- Implement core V4 transformer components (HC, dual-RoPE, compressor, indexer, CSA/HCA attention, sliding-window helpers, attention sink).
Reviewed changes
Copilot reviewed 33 out of 37 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
primus/core/utils/import_utils.py |
Adds deepseek_v4 branch to resolve the V4 model provider/builder. |
primus/backends/megatron/megatron_pretrain_trainer.py |
Dispatches model_type=deepseek_v4 while reusing GPT data/forward_step plumbing. |
primus/backends/megatron/training/tokenizer/tokenizer.py |
Allows selecting DeepSeekV4Tokenizer via HF tokenizer wrapper. |
primus/configs/models/megatron/deepseek_v4_{base,flash,pro}.yaml |
Adds V4 model configs and defaults. |
examples/megatron/configs/MI355X/deepseek_v4_flash-BF16-pretrain.yaml |
Adds a training scaffold yaml for MI355X. |
primus/backends/megatron/core/models/deepseek_v4/* |
Adds V4 model/builders/spec placeholders and a standalone V4 block implementation. |
primus/backends/megatron/core/transformer/* |
Implements HC, dual-RoPE, compressor/indexer, CSA/HCA attention, SWA helpers, and attention sink. |
deepseek-v4/develop/** |
Adds development docs/diagrams and planning materials for the V4 integration. |
| # Per-layer compression schedule (from config.json:compress_ratios) | ||
| # 0 = uncompressed dense layer (full attention with SWA) | ||
| # 4 = HCA branch (compress ratio 4) | ||
| # 128 = CSA branch (compress ratio 128) | ||
| compress_ratios: "[0, 0, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 0]" |
There was a problem hiding this comment.
compress_ratios is currently a quoted string, so YAML will parse it as str rather than a list of ints. DeepseekV4TransformerBlock.__init__ does list(compress_ratios) and checks len(...) == num_layers, so this will either explode into a list of characters or fail the length check at runtime. Define this as a real YAML list (no quotes) or normalize the string to List[int] before the block consumes it; also ensure the list length matches num_layers (43).
| # Per-layer compression schedule (from config.json:compress_ratios) | ||
| # 0 = uncompressed dense layer (full attention with SWA) | ||
| # 4 = HCA branch (compress ratio 4) | ||
| # 128 = CSA branch (compress ratio 128) |
There was a problem hiding this comment.
The per-layer schedule comments invert CSA vs HCA: per the V4 design and the rest of this PR, compress_ratio == 4 is CSA and compress_ratio == 128 is HCA. Please fix the comment mapping so it matches the implementation.
| # | ||
| # Source: deepseek-v4/deepseek-ai/DeeSeek-v4-Pro/config.json | ||
| ############################################################################### |
There was a problem hiding this comment.
Typo in the referenced source path (DeeSeek-v4-Pro). If this path is meant to mirror the repo directory (DeepSeek-V4-Pro), please correct it to avoid confusion when cross-referencing configs.
| # Sliding-window mask. | ||
| window = self.attn_sliding_window | ||
| local_mask = sliding_window_causal_mask(S, window, device=device, dtype=dtype) # [S, S] | ||
|
|
||
| # Subclass hook: extra K/V (compressed pool, sparse top-K, etc.). | ||
| # Subclass should return tensors already broadcast to [B, S_extra, H, head_dim] | ||
| # so they can be cat'd along the Sk axis. | ||
| extra_k, extra_v, extra_mask = self._extra_kv(hidden, position_ids, q) | ||
|
|
||
| # Concatenate sliding-window KV with extra KV (if any). | ||
| if extra_k is not None: | ||
| k_full = torch.cat([k_local_h, extra_k], dim=1) # [B, Sk_total, H, head_dim] | ||
| v_full = torch.cat([v_local_h, extra_v], dim=1) | ||
| full_mask = torch.cat([local_mask, extra_mask], dim=-1) # [Sq, Sk_total] | ||
| else: | ||
| k_full = k_local_h | ||
| v_full = v_local_h | ||
| full_mask = local_mask | ||
|
|
||
| # Move heads dim before sequence: [B, S, H, head_dim] -> [B, H, S, head_dim] | ||
| q_bh = q.transpose(1, 2) | ||
| k_bh = k_full.transpose(1, 2) | ||
| v_bh = v_full.transpose(1, 2) | ||
|
|
||
| out_bh = self._compute_attention_output(q_bh, k_bh, v_bh, full_mask) |
There was a problem hiding this comment.
sliding_window_causal_mask creates a [S, S] mask, but the attention still computes q @ k^T over all S keys (k_local_h is length S). For realistic training lengths (e.g. 4096), this becomes quadratic memory/compute and is very likely to OOM, even though the model is conceptually sliding-window. Consider actually restricting K/V to the window (e.g. gather with sliding_window_kv_indices, unfold, or use a kernel/backend that supports causal sliding-window attention) so Sk_local is window rather than S.
| # Sliding-window mask. | |
| window = self.attn_sliding_window | |
| local_mask = sliding_window_causal_mask(S, window, device=device, dtype=dtype) # [S, S] | |
| # Subclass hook: extra K/V (compressed pool, sparse top-K, etc.). | |
| # Subclass should return tensors already broadcast to [B, S_extra, H, head_dim] | |
| # so they can be cat'd along the Sk axis. | |
| extra_k, extra_v, extra_mask = self._extra_kv(hidden, position_ids, q) | |
| # Concatenate sliding-window KV with extra KV (if any). | |
| if extra_k is not None: | |
| k_full = torch.cat([k_local_h, extra_k], dim=1) # [B, Sk_total, H, head_dim] | |
| v_full = torch.cat([v_local_h, extra_v], dim=1) | |
| full_mask = torch.cat([local_mask, extra_mask], dim=-1) # [Sq, Sk_total] | |
| else: | |
| k_full = k_local_h | |
| v_full = v_local_h | |
| full_mask = local_mask | |
| # Move heads dim before sequence: [B, S, H, head_dim] -> [B, H, S, head_dim] | |
| q_bh = q.transpose(1, 2) | |
| k_bh = k_full.transpose(1, 2) | |
| v_bh = v_full.transpose(1, 2) | |
| out_bh = self._compute_attention_output(q_bh, k_bh, v_bh, full_mask) | |
| # Materialize only the causal sliding-window K/V for each query position | |
| # so local attention scales with `window` rather than the full sequence `S`. | |
| window = self.attn_sliding_window | |
| window = min(window, S) | |
| # Build per-query local indices: for query i attend to [i - window + 1, ..., i]. | |
| query_positions = torch.arange(S, device=device) | |
| window_offsets = torch.arange(window, device=device) | |
| local_indices = query_positions.unsqueeze(1) - (window - 1) + window_offsets.unsqueeze(0) # [S, window] | |
| local_valid = local_indices >= 0 | |
| local_indices = local_indices.clamp_(min=0, max=S - 1) | |
| # Gather local K/V windows: [B, S, H, D] -> [B, S, window, H, D]. | |
| gather_index = local_indices.view(1, S, window, 1, 1).expand( | |
| B, S, window, self.num_heads, self.head_dim | |
| ) | |
| k_local = torch.gather( | |
| k_local_h.unsqueeze(2).expand(B, S, window, self.num_heads, self.head_dim), | |
| 1, | |
| gather_index, | |
| ) | |
| v_local = torch.gather( | |
| v_local_h.unsqueeze(2).expand(B, S, window, self.num_heads, self.head_dim), | |
| 1, | |
| gather_index, | |
| ) | |
| # Subclass hook: extra K/V (compressed pool, sparse top-K, etc.). | |
| # Subclass should return tensors already broadcast to [B, S_extra, H, head_dim]. | |
| extra_k, extra_v, extra_mask = self._extra_kv(hidden, position_ids, q) | |
| # Move heads dim before sequence for local attention: | |
| # q: [B, S, H, D] -> [B, H, S, D] | |
| # local k/v: [B, S, window, H, D] -> [B, H, S, window, D] | |
| q_bh = q.transpose(1, 2) | |
| k_local_bh = k_local.permute(0, 3, 1, 2, 4) | |
| v_local_bh = v_local.permute(0, 3, 1, 2, 4) | |
| scale = self.head_dim ** -0.5 | |
| local_scores = (q_bh.unsqueeze(-2) * k_local_bh).sum(dim=-1) * scale # [B, H, S, window] | |
| local_scores = local_scores.masked_fill( | |
| ~local_valid.view(1, 1, S, window), torch.finfo(local_scores.dtype).min | |
| ) | |
| if extra_k is not None: | |
| extra_k_bh = extra_k.transpose(1, 2) # [B, H, S_extra, D] | |
| extra_v_bh = extra_v.transpose(1, 2) # [B, H, S_extra, D] | |
| extra_scores = torch.einsum("bhsd,bhkd->bhsk", q_bh, extra_k_bh) * scale | |
| if extra_mask is not None: | |
| if extra_mask.dtype == torch.bool: | |
| extra_scores = extra_scores.masked_fill( | |
| ~extra_mask.view(1, 1, S, -1), torch.finfo(extra_scores.dtype).min | |
| ) | |
| else: | |
| extra_scores = extra_scores + extra_mask.view(1, 1, S, -1).to(extra_scores.dtype) | |
| attn_scores = torch.cat([local_scores, extra_scores], dim=-1) | |
| attn_probs = torch.softmax(attn_scores.float(), dim=-1).to(q_bh.dtype) | |
| local_probs = attn_probs[..., :window] | |
| extra_probs = attn_probs[..., window:] | |
| out_local = (local_probs.unsqueeze(-1) * v_local_bh).sum(dim=-2) | |
| out_extra = torch.einsum("bhsk,bhkd->bhsd", extra_probs, extra_v_bh) | |
| out_bh = out_local + out_extra | |
| else: | |
| attn_probs = torch.softmax(local_scores.float(), dim=-1).to(q_bh.dtype) | |
| out_bh = (attn_probs.unsqueeze(-1) * v_local_bh).sum(dim=-2) |
| ) -> torch.Tensor: | ||
| """Gather ``[B, P, head_dim]`` along ``P`` per query → ``[B, S, K, head_dim]``. | ||
|
|
There was a problem hiding this comment.
_gather_topk_kv is annotated as returning torch.Tensor, but it actually returns (gathered, valid). This will confuse type-checkers and readers; update the return annotation (and docstring if needed) to reflect the tuple return type.
| ) -> torch.Tensor: | |
| """Gather ``[B, P, head_dim]`` along ``P`` per query → ``[B, S, K, head_dim]``. | |
| ) -> Tuple[torch.Tensor, torch.Tensor]: | |
| """Gather ``[B, P, head_dim]`` along ``P`` per query. | |
| Returns: | |
| A tuple ``(gathered, valid)`` where: | |
| - ``gathered`` has shape ``[B, S, K, head_dim]``. | |
| - ``valid`` has shape ``[B, S, K]`` and marks non-masked indices. |
| gathered, valid = self._gather_topk_kv(pool_kv, topk_idxs) # [B, S, K, head_dim] | ||
|
|
||
| # 5) Stash for ``_compute_attention_output`` to consume. | ||
| gathered.shape[2] |
There was a problem hiding this comment.
This statement has no effect (gathered.shape[2] is computed and discarded). It looks like a leftover debug line; please remove it to keep the CSA path clean.
| gathered.shape[2] |
| num_layers: 61 | ||
| hidden_size: 7168 | ||
| num_attention_heads: 128 | ||
| num_query_groups: 1 | ||
| kv_channels: 512 | ||
| qk_pos_emb_head_dim: 64 | ||
| ffn_hidden_size: 18432 | ||
| moe_ffn_hidden_size: 3072 | ||
| moe_shared_expert_intermediate_size: 3072 | ||
|
|
||
| q_lora_rank: 1536 | ||
| o_lora_rank: 1024 | ||
| o_groups: 16 | ||
|
|
||
| num_experts: 384 | ||
| moe_router_topk: 6 | ||
| moe_router_topk_scaling_factor: 2.5 | ||
|
|
||
| index_topk: 1024 | ||
|
|
||
| compress_ratios: "[128, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 0]" |
There was a problem hiding this comment.
Same issue as Flash: compress_ratios is a quoted string, which will not deserialize to Sequence[int] and will break DeepseekV4TransformerBlock's len(compress_ratios) == num_layers check. Please make this a real YAML list (or add a normalization step) and verify the schedule length matches num_layers (61).
| # Reference: | ||
| # - deepseek-v4/deepseek-ai/DeepSeek-V4-Flash/config.json | ||
| # - deepseek-v4/deepseek-ai/DeeSeek-v4-Pro/config.json | ||
| # - deepseek-v4/develop/techblog/01-deepseek-v4-architecture-deep-dive.md |
There was a problem hiding this comment.
Typo in the reference path (DeeSeek-v4-Pro). Please correct the spelling/casing so the comment points at the actual directory name and is searchable.
| # local v we have [B, H, Sk_local, head_dim] (independent of S), | ||
| # while sparse v depends on S. Build a "value tensor" with the | ||
| # same shape on both paths by broadcasting local v: | ||
| v.shape[2] |
There was a problem hiding this comment.
This statement has no effect (v.shape[2] is computed and discarded). Please remove it; it reads like a debug remnant and makes the attention path harder to audit.
| v.shape[2] |
| FONT_REG = "/home/xiewen12/.local/share/fonts/NotoSansSC-Regular.otf" | ||
| FONT_BOLD = FONT_REG # we only have Regular; use it for both | ||
|
|
||
| OUT_DIR = os.path.join(os.path.dirname(__file__), "diagrams") | ||
| os.makedirs(OUT_DIR, exist_ok=True) | ||
|
|
||
|
|
||
| def font(sz: int, bold: bool = False) -> ImageFont.FreeTypeFont: | ||
| return ImageFont.truetype(FONT_BOLD if bold else FONT_REG, sz) |
There was a problem hiding this comment.
FONT_REG is hard-coded to an absolute path under a specific user's home directory, which will fail for other developers/CI. Consider using a repo-relative font path, allowing an environment variable override, and/or falling back to a default font when the file isn't present.
| FONT_REG = "/home/xiewen12/.local/share/fonts/NotoSansSC-Regular.otf" | |
| FONT_BOLD = FONT_REG # we only have Regular; use it for both | |
| OUT_DIR = os.path.join(os.path.dirname(__file__), "diagrams") | |
| os.makedirs(OUT_DIR, exist_ok=True) | |
| def font(sz: int, bold: bool = False) -> ImageFont.FreeTypeFont: | |
| return ImageFont.truetype(FONT_BOLD if bold else FONT_REG, sz) | |
| BASE_DIR = os.path.dirname(__file__) | |
| FONT_CANDIDATES = ( | |
| os.environ.get("DIAGRAM_FONT"), | |
| os.environ.get("FONT_REG"), | |
| os.path.join(BASE_DIR, "NotoSansSC-Regular.otf"), | |
| os.path.join(BASE_DIR, "fonts", "NotoSansSC-Regular.otf"), | |
| ) | |
| def _resolve_font_path() -> str | None: | |
| for path in FONT_CANDIDATES: | |
| if path and os.path.isfile(path): | |
| return path | |
| return None | |
| FONT_REG = _resolve_font_path() | |
| FONT_BOLD = FONT_REG # we only have Regular; use it for both when available | |
| OUT_DIR = os.path.join(BASE_DIR, "diagrams") | |
| os.makedirs(OUT_DIR, exist_ok=True) | |
| def font(sz: int, bold: bool = False) -> ImageFont.ImageFont | ImageFont.FreeTypeFont: | |
| font_path = FONT_BOLD if bold else FONT_REG | |
| if font_path: | |
| return ImageFont.truetype(font_path, sz) | |
| return ImageFont.load_default() |
…+ MTP
Phase 5 of the V4 development plan. Lands the FFN side of the V4 stack:
hash-routed and learned top-K MoE, clamped SwiGLU experts, and the V4
MTP head. The V4 block now plugs the V4 MoE in place of P4's placeholder
SwiGLU FFN; the V4 model instantiates a separate-HyperHead MTP block when
mtp_num_layers > 0. Layer-aware YaRN was already done in P4
(DualRoPE.get_rope picks main_rope vs compress_rope by compress_ratio).
New modules:
* primus/backends/megatron/core/transformer/clamped_swiglu.py
clamped_swiglu(x, alpha=7.0): silu(gate)*up clamped to [-alpha, alpha].
ClampedSwiGLUMLP wraps it as a fused gate_up + down two-linear MLP.
Eager (Python) for v1; perf phase will register a fused kernel.
* primus/backends/megatron/core/transformer/moe/v4_hash_router.py
HashRouter: static [vocab_size, topk] tid2eid table from a fixed seed.
Active for the first num_hash_layers V4 layers; gives each token a
permanent expert assignment with uniform weight 1/topk. No learnable
parameters; deterministic across PP / TP / EP ranks.
* primus/backends/megatron/core/transformer/moe/v4_topk_router.py
V4TopKRouter: learned gate with score_function in
{"sqrtsoftplus", "sigmoid", "softmax"}. Top-K with optional renorm
and optional noaux_tc per-expert bias (selection-only; probs are
read from the un-biased score).
* primus/backends/megatron/core/transformer/moe/v4_moe.py
DeepseekV4MoE: per-layer router pick (hash vs learned) + N
ClampedSwiGLUMLP routed experts + 1 shared expert. Pure-PyTorch
per-expert dispatch; P6 swaps in Megatron's token-dispatcher /
grouped-GEMM / EP path.
* primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_mtp.py
DeepseekV4MTPBlock: mtp_num_layers V4 layers, each owning its own
HyperHead (separate from the main decoder's). Shares the dual-RoPE
with the main decoder. Loss-side wiring is deferred to P6; P5 just
stands the module up so it can be unit-tested standalone.
Updated:
* primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_block.py
DeepseekV4HybridLayer now picks MoE vs dense FFN based on
num_routed_experts. forward() threads token_ids through to the MoE
for hash-routed layers. The block-level forward picks token_ids up
from a model-side stash (_v4_token_ids) so callers don't have to
thread it explicitly through every layer of the call stack.
* primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_model.py
Builds DeepseekV4MTPBlock when mtp_num_layers > 0 (post-process
rank only). forward() overridden to stash input_ids onto self.decoder
before delegating to GPTModel.forward, so hash-routed MoE layers can
consume them. Cross-PP propagation of input_ids is a P6 concern.
* primus/backends/megatron/core/models/deepseek_v4/__init__.py
Re-export DeepseekV4MTPBlock alongside the existing surface.
Smoke-tested on dev-box PyTorch container (CPU, 7-test suite):
* clamped_swiglu: clamp tight; MLP forward+backward OK.
* HashRouter: per-token top-K distinct, deterministic across re-runs and
re-instantiations w/ same seed, probs sum to 1.
* V4TopKRouter: top-K honored, renorm OK, backward OK for all three
score functions (sqrtsoftplus, sigmoid, softmax).
* DeepseekV4MoE (learned & hash modes): forward + backward; same-token
determinism for hash routing.
* DeepseekV4TransformerBlock with MoE FFN (4 layers, hc_mult=2, mixed
dense + CSA): forward + backward; deterministic in eval mode.
* DeepseekV4MTPBlock (mtp_num_layers=2, hc_mult=2): forward + backward;
per-MTP HyperHead state_dict separation verified.
Deferred to P6 (already noted in progress doc):
* Real Megatron-MoE / token-dispatcher / EP integration -- replaces the
pure-PyTorch dispatch loop in DeepseekV4MoE.forward.
* MTP loss path wiring -- DeepseekV4Model.forward currently builds the
MTP block but does not yet feed its outputs through lm_head + the
auxiliary loss term.
* Numerical alignment vs reference inference/model.py (token-0 logits
within 1e-2) -- needs reference checkpoint loading.
Made-with: Cursor
b8e47a3 to
5e4008d
Compare
Wire DeepSeek-V4 through Megatron P6 integration (PP local-layer build, EP expert sharding, and compatibility fixes) and add the P7 single-node launcher plus progress docs after passing PP=2/EP=4 smoke run. Made-with: Cursor
Add the plan-1 roadmap/detail/test documentation plus progress tracker entries, and update the development target doc with TransformerEngine and Primus-Turbo reference pointers. Made-with: Cursor
ecf8169 to
1030293
Compare
| gen = torch.Generator(device="cpu").manual_seed(int(seed)) | ||
| # For each token id, pick ``topk`` distinct expert ids deterministically. | ||
| # randperm(num_experts) is a stable, dense permutation; slicing the | ||
| # first ``topk`` rows gives uniform-without-replacement routing. | ||
| rows = [] | ||
| for _ in range(vocab_size): | ||
| perm = torch.randperm(num_experts, generator=gen)[:topk] | ||
| rows.append(perm) | ||
| tid2eid = torch.stack(rows, dim=0).long() # [vocab_size, topk] |
There was a problem hiding this comment.
HashRouter.__init__ builds tid2eid by looping over every vocab_size entry and calling torch.randperm(num_experts) each time. For real V4 sizes (e.g., vocab≈129k, experts≈384), this will add significant startup time and CPU memory churn at model construction. Consider replacing this with a deterministic hash-based mapping (no table), or generating the table in larger vectorized blocks (and/or only for the subset of vocab used), so model init remains scalable.
| def _gather_topk_kv( | ||
| self, | ||
| pool: torch.Tensor, # [B, P, head_dim] | ||
| topk_idxs: torch.Tensor, # [B, S, K] (-1 for masked) | ||
| ) -> torch.Tensor: | ||
| """Gather ``[B, P, head_dim]`` along ``P`` per query → ``[B, S, K, head_dim]``. | ||
|
|
||
| Out-of-range / masked indices (``-1``) are clamped to ``0`` for the | ||
| gather, then *zero-masked* afterwards. | ||
| """ | ||
| B, S, K = topk_idxs.shape | ||
| P, Hd = pool.shape[1], pool.shape[2] | ||
| valid = topk_idxs >= 0 # [B, S, K] | ||
| safe_idx = topk_idxs.clamp(min=0) | ||
| # Expand idx to gather along P for each (B, S, K, Hd). | ||
| idx_expand = safe_idx.unsqueeze(-1).expand(B, S, K, Hd) | ||
| pool_expand = pool.unsqueeze(1).expand(B, S, P, Hd) # [B, S, P, Hd] | ||
| gathered = torch.gather(pool_expand, dim=2, index=idx_expand) # [B, S, K, Hd] | ||
| gathered = gathered * valid.unsqueeze(-1).to(gathered.dtype) | ||
| return gathered, valid |
There was a problem hiding this comment.
_gather_topk_kv is annotated as returning only a torch.Tensor, but it actually returns (gathered, valid). This mismatch can break type checking and mislead callers; update the return annotation (and docstring if desired) to Tuple[torch.Tensor, torch.Tensor].
| in_dtype = x.dtype | ||
| x32 = x.float() | ||
| rsqrt = torch.rsqrt(x32.pow(2).mean(dim=-1, keepdim=True) + self.eps) | ||
| return (x32 * rsqrt).to(in_dtype) * self.weight |
There was a problem hiding this comment.
The standalone RMSNorm implementation returns (…to(in_dtype) * self.weight). If self.weight remains fp32 (common in mixed-precision training), this multiplication will upcast the output back to fp32, potentially defeating BF16 activation flow and increasing memory/compute. Consider multiplying by self.weight.to(in_dtype) (or casting the final result back to in_dtype) so the output dtype stays consistent with the input activation dtype.
| return (x32 * rsqrt).to(in_dtype) * self.weight | |
| return (x32 * rsqrt).to(in_dtype) * self.weight.to(in_dtype) |
| flat = hidden.reshape(-1, D) # [N, D] | ||
| flat.shape[0] | ||
|
|
There was a problem hiding this comment.
This flat.shape[0] statement is a no-op and appears to be leftover debug code. Please remove it to keep the forward path minimal and lint-clean.
| # DeepSeek-V4 Pro (large MoE variant). | ||
| # | ||
| # Source: deepseek-v4/deepseek-ai/DeeSeek-v4-Pro/config.json | ||
| ############################################################################### |
There was a problem hiding this comment.
Typo in the source comment path: DeeSeek-v4-Pro should be DeepSeek-v4-Pro (consistent with the model naming elsewhere).
| def forward(self, x: torch.Tensor) -> torch.Tensor: | ||
| in_dtype = x.dtype | ||
| x32 = x.float() | ||
| rms = torch.rsqrt(x32.pow(2).mean(dim=-1, keepdim=True) + self.eps) | ||
| return (x32 * rms).to(in_dtype) * self.weight | ||
|
|
There was a problem hiding this comment.
Same RMSNorm dtype issue here: (…to(in_dtype) * self.weight) can upcast the output back to fp32 if self.weight is fp32, which is likely under mixed precision. To keep the compressor output in the activation dtype, multiply by self.weight.to(in_dtype) or cast the final output back to in_dtype.
| v.shape[2] | ||
| v_local_per_q = v.unsqueeze(2).expand(-1, -1, q.shape[2], -1, -1) # [B, H, S, Sk_local, head_dim] |
There was a problem hiding this comment.
This v.shape[2] line is a no-op (likely leftover from debugging) and should be removed to avoid confusing readers and linters.
| class HashRouter(nn.Module): | ||
| """Static hash-based MoE router. | ||
|
|
||
| Args: | ||
| num_experts: total number of routed experts. | ||
| topk: number of experts each token is routed to. | ||
| vocab_size: tokenizer vocabulary size; controls the table length. | ||
| seed: deterministic seed for the hash; same across all ranks. | ||
| dtype: dtype of the returned ``probs`` tensor; defaults to | ||
| ``torch.float32``. | ||
|
|
There was a problem hiding this comment.
This PR introduces substantial new DeepSeek-V4 core modules (attention variants, compressor/indexer, routers, MoE, HC) but does not add unit tests covering their key invariants (e.g., HashRouter determinism, CSA/HCA causality masks, compressor/indexer shape/validity). The repo already has a Python unit test suite under tests/unit_tests/ (including Megatron transformer tests), so please add focused unit tests for these new modules to prevent regressions.
Remove GPT placeholder/super-init spec coupling so DeepSeek-V4 builds decoder directly from DeepSeek ModuleSpec submodule trees, and update Phase 8 progress records to match the finalized implementation and validation status. Made-with: Cursor
Unify DeepSeek-V4 runtime module selection under DeepSeekV4SpecProvider and migrate attention/MLP/MoE construction to provider-driven ModuleSpec flows with safe local fallbacks. Document and validate the TE CUDA runtime contract, including an explicit fail-fast guard for non-CUDA TE/Turbo inputs and updated Phase 9 progress records in English. Made-with: Cursor
|
|
||
| # 5) Stash for ``_compute_attention_output`` to consume. | ||
| gathered.shape[2] | ||
| # Build mask for the compressed branch: ``-inf`` where invalid. |
There was a problem hiding this comment.
There are a couple of no-op statements (e.g., gathered.shape[2]) that have no effect and appear to be leftover debugging. Please remove them to keep the CSA path easier to read/maintain.
| batch, seq = input_ids.shape | ||
| position_ids = ( | ||
| input_ids.new_arange(seq, dtype=input_ids.dtype).unsqueeze(0).expand(batch, -1) | ||
| ) |
There was a problem hiding this comment.
input_ids.new_arange(...) is not a valid PyTorch Tensor API (and there is no local helper/monkeypatch in the repo), so this will raise AttributeError when position_ids is omitted. Use torch.arange(seq, device=input_ids.device, dtype=...) (or the existing Megatron helper used elsewhere) to build position ids.
| export PRECISION_TYPE=${PRECISION_TYPE:-BF16} | ||
| export FP8=null | ||
| export FP8_RECIPE=null | ||
|
|
There was a problem hiding this comment.
FP8/FP8_RECIPE default to the literal string null, but the script still passes them via --fp8/--fp8_recipe. That makes args.fp8 truthy and can trigger FP8 validation paths (and failures) even when FP8 is intended to be disabled. Only include these CLI flags when PRECISION_TYPE=FP8, or ensure the disabled state is represented in a way the arg parser treats as false/None.
| B, S, D = hidden.shape | ||
| flat = hidden.reshape(-1, D) # [N, D] | ||
| flat.shape[0] | ||
|
|
There was a problem hiding this comment.
There are a few no-op statements left in forward (e.g., flat.shape[0]) that don't affect execution and look like leftover debugging. Please remove them to avoid confusion and keep the forward path clean.
…chema Align phase10 DeepSeek-V4 modules on explicit spec/provider contracts by enforcing SharedExpertMLP-only shared experts and introducing a dedicated DeepSeekV4TransformerConfig for V4-only runtime fields. Update builder/spec/docs so training resolves the new config type and tracks activation clamp through model config. Made-with: Cursor
Fix HC/attention dtype mismatches and tune the DeepSeek-V4 smoke script defaults so the Phase 10 MI355X run completes reliably end-to-end. Add a dedicated Phase 10 convergence report documenting delivered scope, runtime blockers, and remaining tracked items. Made-with: Cursor
| export PRECISION_TYPE=${PRECISION_TYPE:-BF16} | ||
| export FP8=null | ||
| export FP8_RECIPE=null | ||
|
|
||
| if [ "$PRECISION_TYPE" = "FP8" ]; then | ||
| export FP8=${FP8:-hybrid} | ||
| export FP8_RECIPE=${FP8_RECIPE:-delayed} | ||
| fi | ||
|
|
||
| export EXP=${EXP:-examples/megatron/configs/MI355X/deepseek_v4_flash-BF16-pretrain.yaml} | ||
| export BACKEND_PATH=${BACKEND_PATH:-"$(pwd)/third_party/Megatron-LM"} | ||
| export PRIMUS_TEAM=${PRIMUS_TEAM:-amd} | ||
| export PRIMUS_USER=${PRIMUS_USER:-tas-mi355x-$(date +%Y%m%d)} | ||
| export PRIMUS_EXP_NAME=${PRIMUS_EXP_NAME:-deepseek_v4_smoke_${PRECISION_TYPE}_MBS${MBS}_GBS${GBS}_PP${PRIMUS_PP}_EP${PRIMUS_EP}} | ||
|
|
||
| if [ ! -d "$BACKEND_PATH" ] || [ -z "$(ls -A "$BACKEND_PATH" 2>/dev/null)" ]; then | ||
| echo "[ERROR] BACKEND_PATH does not exist or is empty: $BACKEND_PATH" | ||
| echo "Run: git submodule update --init --recursive" | ||
| exit 1 | ||
| fi | ||
|
|
||
| mkdir -p "output/$PRIMUS_TEAM/$PRIMUS_USER/$PRIMUS_EXP_NAME" | ||
|
|
||
| ./primus-cli direct \ | ||
| -- train pretrain --config "$EXP" \ | ||
| --backend_path "$BACKEND_PATH" \ | ||
| --num_layers "$PRIMUS_TOTAL_LAYERS" \ | ||
| --train_iters "$TRAIN_ITERS" \ | ||
| --lr_warmup_iters 0 \ | ||
| --lr_decay_iters "$TRAIN_ITERS" \ | ||
| --micro_batch_size "$MBS" \ | ||
| --global_batch_size "$GBS" \ | ||
| --seq_length "$PRIMUS_SEQ_LENGTH" \ | ||
| --max_position_embeddings "$PRIMUS_MAX_POSITION_EMBEDDINGS" \ | ||
| --rope_type rope \ | ||
| --tensor_model_parallel_size "$PRIMUS_TP" \ | ||
| --pipeline_model_parallel_size "$PRIMUS_PP" \ | ||
| --expert_model_parallel_size "$PRIMUS_EP" \ | ||
| --num_experts "$PRIMUS_NUM_EXPERTS" \ | ||
| --moe_router_topk "$PRIMUS_MOE_TOPK" \ | ||
| --moe_router_enable_expert_bias "$PRIMUS_MOE_ENABLE_EXPERT_BIAS" \ | ||
| --moe_ffn_hidden_size "$PRIMUS_MOE_FFN_HIDDEN_SIZE" \ | ||
| --index_topk "$PRIMUS_INDEX_TOPK" \ | ||
| --v4_grouped_experts_support_clamped_swiglu "$PRIMUS_V4_GROUPED_EXPERTS_SUPPORT_CLAMPED_SWIGLU" \ | ||
| --compress_ratios "$PRIMUS_COMPRESS_RATIOS" \ | ||
| --mtp_num_layers 0 \ | ||
| --mock_data True \ | ||
| --use_turbo_attention "$USE_TURBO_ATTENTION" \ | ||
| --use_turbo_grouped_mlp "$TURBO_USE_GROUPED_MLP" \ | ||
| --moe_use_legacy_grouped_gemm "$LEGACY_GG" \ | ||
| --fp8 "$FP8" \ | ||
| --fp8_recipe "$FP8_RECIPE" \ |
There was a problem hiding this comment.
FP8/FP8_RECIPE are always passed to primus-cli (defaulting to the literal string null). Other run scripts in this repo gate --fp8 ... args behind an explicit FP8 enable flag; passing null may be rejected by argument parsing or select an unintended FP8 mode. Consider only adding --fp8/--fp8_recipe when PRECISION_TYPE=FP8 (or when a dedicated FP8=True flag is set), and omit them entirely otherwise.
| # Primus-owned: DeepSeek-V4 (Phase 2 stub; full V4 wiring lands in Phase 3+) | ||
| if model_type == "deepseek_v4": | ||
| deepseek_v4_module = importlib.import_module( | ||
| "primus.backends.megatron.core.models.deepseek_v4.deepseek_v4_builders" | ||
| ) |
There was a problem hiding this comment.
The comment "Phase 2 stub; full V4 wiring lands in Phase 3+" is now misleading since this PR imports the full DeepSeek-V4 builders/specs. Updating/removing it will avoid confusion when debugging model-type dispatch.
| # 5) Stash for ``_compute_attention_output`` to consume. | ||
| gathered.shape[2] | ||
| # Build mask for the compressed branch: ``-inf`` where invalid. | ||
| # This is per-query, shape [S, K]; we keep it on the module as a | ||
| # full [B, S, K] additive mask. | ||
| sparse_mask = torch.where(valid, 0.0, float("-inf")).to(dtype) # [B, S, K] | ||
| self._csa_state = { | ||
| "gathered": gathered, # [B, S, K, head_dim] | ||
| "sparse_mask": sparse_mask, # [B, S, K] | ||
| } | ||
|
|
||
| # Tell the parent: no cat-extension; we handle CSA inside | ||
| # ``_compute_attention_output``. | ||
| return None, None, None |
There was a problem hiding this comment.
CSAAttention stores per-forward tensors in self._csa_state and then reads them in _compute_attention_output. This is not safe under pipeline parallel schedules (multiple microbatches in flight) or activation checkpoint recomputation, because the module attribute can be overwritten before earlier microbatches/backward recomputes run, leading to wrong outputs/gradients. Refactor CSA to avoid mutable module-level forward state (e.g., compute the joint local+sparse attention fully inside forward, or thread the gathered KV/mask through the call stack without storing on self).
| # 5) Stash for ``_compute_attention_output`` to consume. | ||
| gathered.shape[2] | ||
| # Build mask for the compressed branch: ``-inf`` where invalid. |
There was a problem hiding this comment.
There are two no-op expression statements (gathered.shape[2] and later v.shape[2]) that have no effect and look like leftover debug code. They should be removed to avoid confusion (and to keep linters/type checkers from flagging them).
| decoder = getattr(self, "decoder", None) | ||
| if decoder is not None: | ||
| decoder._v4_token_ids = input_ids | ||
| try: | ||
| hidden_states = self.decoder( | ||
| hidden_states=decoder_input, | ||
| attention_mask=attention_mask, | ||
| **kwargs, | ||
| ) | ||
| finally: | ||
| if decoder is not None: | ||
| decoder._v4_token_ids = None |
There was a problem hiding this comment.
DeepseekV4Model.forward stashes input_ids onto decoder._v4_token_ids and clears it immediately after the forward. This breaks any activation checkpoint/recompute that re-invokes decoder/layer forwards during backward (token_ids will be missing) and is also unsafe with pipeline schedules that can have multiple microbatches using the same module instance. Prefer passing token_ids=input_ids explicitly into self.decoder(...) (the decoder already accepts a token_ids kwarg) instead of relying on mutable module state.
…) before dispatch + smoke
User directive: P27's first deliverable is now a release-tier
correctness gate that runs the existing G23 / G24 / G26 / G27
equivalence tests on production V4 dimensions (`head_dim=512`, real
`H`, real `swa_window`, real `K_topk`), so any kernel-numerics
regression at the head-dim that plan-4 exists to solve is caught
BEFORE the dispatch + smoke layers are added on top.
Plan-doc updates:
* `01-roadmap.md` — P27 deliverables now lead with the release-tier
shape gate; Milestone M4 split into M4a (release-tier shape
correctness) and M4b (smoke).
* `02-phase-details.md` — Phase 27 section rewritten to lead with
task 1 "Release-tier shape gate (G28)" followed by dispatch
precedence, run-script docs, dispatch unit test (G29), smoke run +
smoke gate (G30), and the hand-off note. Design notes explain why
G28 fits in P27 (not P25 / P26), why eager fp32 reference fits at
calibrated `S ∈ {512, 1024}`, and why we don't target full
`S=4096` in unit tests (smoke covers full `S`).
* `03-test-strategy.md` — gate matrix gains G28 (release-tier kernel
correctness at production V4 dims); previous G28 / G29 renumbered
to G29 / G30. GPU-toy harness paragraph documents the fast-tier
vs release-tier split.
* `status.md` — Phase 27 task table re-ordered: G28 row first, then
dispatch precedence, env-var plumbing, dispatch unit test (G29),
smoke run, smoke gate (G30), hand-off note.
The actual kernel implementations (P25 / P26) and dispatch plumbing
(P22 / P25 / P26) are unchanged — this is a plan-doc + status-table
reorganisation only.
Co-authored-by: Cursor <cursoragent@cursor.com>
wiring + EP8 smoke (closes plan-4) Plan-4 P27 lands the three layered closing gates for the in-tree Primus Triton V4 attention kernels (P25 dense / HCA, P26 CSA), and appends the plan-4 hand-off summary. * G28 release-tier shape gate (lands first per user directive — kernel numerics are locked at production V4 dims BEFORE the dispatch + smoke layers are stacked on top). Extends `_BASE_SHAPES` in the four P25 / P26 fwd/bwd test files with V4-Flash (`H=64, head_dim=512, swa_window=128, S=1024, K_topk=512`) and V4-Pro (`H=128, head_dim=512, swa_window=128, S=512, K_topk=512`) pytest.param entries marked `pytest.mark.slow`. New `tests/unit_tests/megatron/transformer/deepseek_v4/conftest.py` ships an autouse `torch.cuda.empty_cache()` fixture so the eager CSA reference's `[B, H, Sq, K, D]` einsum intermediate doesn't accumulate in PyTorch's caching allocator across consecutive release-tier tests. Root `tests/unit_tests/conftest.py` registers `pytest.mark.slow` and adds a `--run-slow` opt-in (also accepts `-m slow`). Release-tier bf16 tolerances bumped to absorb `head_dim=512` matmul noise + `tl.atomic_add` jitter on the backward (FWD bf16 atol=5e-2; BWD bf16 dq/dk/dv/dgathered atol=2e-1; dsink atol=5e-2). 80 / 80 release-tier tests pass on mi355-gpu-14 inside dev_primus_wenx_693 in 60.2 s (`pytest --run-slow -m slow`); fast-tier suite remains green. * G29 dispatch precedence + startup log line. New `_log_kernel_choice` helper on `DeepseekV4Attention` emits one `[V4-attn] Layer N: cr=R, kernel = ...` info line per layer at rank 0 so smoke / training logs unambiguously show which kernel each layer is firing through. The class docstring grows a precedence table covering all three layer kinds plus the auto-disable rules for the two flags. New `test_v4_p27_dispatch_precedence.py` (16 tests) covers the dispatch path at runtime: 7 parametrised log-line tests across every (cr, flag) → expected-kernel combo + format / once-per-call / layer-number assertions; runtime-mock tests on `_attention_forward_via_v4_triton` and `_csa_forward` verifying the right kernel symbol is invoked with the right kwargs; two auto-disable runtime tests for the cross-layer-kind contracts. `run_deepseek_v4.sh` gains a soft `[WARN]` echo when either Triton flag is on and `PRIMUS_TP > 1` (kernels are MQA-centric and operate per-rank on the local H/TP head slice — TP > 1 should work but stays uncovered by plan-4 gates). * G30 TP=1 PP=1 EP=8 10-iter smoke with both kernels engaged + Turbo DeepEP. New `progress/p27/run_smoke_v4_kernels_ep8_pp1.sh` script + matching `.gitignore` (excludes `*.log` / `log_*.txt` / `debug.log` / `*.tgz` / `*.json` per the plan-3 directive — smoke logs MUST NOT land in git). Smoke is green: 10 / 10 iters clean, lm_loss converges 11.85 → 11.65, grad norm steady, 0 nan iterations, all 8 layers emit the expected kernel-choice log line. Steady-state ~17.3 TFLOP/s/GPU (peak ~19.8) at ~500 ms / iter — at parity with the P23 Turbo-DeepEP-on-eager-attention baseline at the smoke's small seq length (the eager attention is matmul-cheap at S=128 and DeepEP dominates iter time; the Triton kernels' real win is on full V4-Flash production dims, planned as a plan-4 follow-up). * P27 hand-off block appended to `plan-4/02-phase-details.md` recording: commit chain P24 → P27, fast-tier + release-tier test totals, G30 smoke perf delta vs. eager / DeepEP baselines, and the follow-up list (Megatron-side `layer_number` plumbing, full-S=4096 smoke, HCA LSE-merge for Turbo, CSA in-kernel gather, FP8, default-True flip). Plan-4 ends. The two switches (`use_v4_triton_attention`, `use_v4_triton_csa_attention`) ship at default `False` so this PR is a pure safety-net add; the Triton path is opt-in via the existing run-script env vars. Co-authored-by: Cursor <cursoragent@cursor.com>
…P27 SHA (e19663f) Replaces the TBD-p27 / TBD-p27a placeholders in `deepseek-v4/develop/progress/status.md` and the P27 hand-off block in `deepseek-v4/develop/plan-4/02-phase-details.md` with the actual commit SHA `e19663f7`. Mirrors the P25 / P26 SHA-pin convention (`1ba38ba5` / `36dfca66`). Plan-4 ends here. Co-authored-by: Cursor <cursoragent@cursor.com>
…+ flip run_deepseek_v4.sh defaults to PP1EP8 + V4 Triton kernels on
Plan-5 picks up where plan-4 closed (in-tree V4 Triton kernels for
dense / HCA / CSA shipped behind use_v4_triton_attention /
use_v4_triton_csa_attention; +37 % vs P22 eager at smoke seq) and is
strictly scoped to taking V4-Flash EP=8 single-node training from its
plan-4 P27 G30 steady-state up the throughput curve at production-shape
sequence length, by attacking the bottlenecks visible in a real
torch.profiler trace.
Five phases (every P29..P32 task list is seeded; refined in writing
against the P28 trace at phase open; targets < 10 % of step time get
de-scoped on the spot):
- P28 (kick-off) — run_deepseek_v4_flash_proxy.sh (V4-Flash widths,
8 layers, all four perf knobs on: USE_V4_TRITON_ATTENTION,
USE_V4_TRITON_CSA_ATTENTION, USE_TURBO_DEEPEP, TURBO_USE_GROUPED_MLP)
calibrated for one MI355X node at EP=8; chrome-trace JSON for one
steady iter; baseline analysis report (md + html) under
develop/profile/profile-baseline-ep8-<date>.{md,html}. The
report's ranked bottleneck list pins the X / Y / Z / W per-phase
improvement budgets for everything that follows. Gates: G31 (smoke)
+ G31a (report).
- P29 (small-op fusion) — seeded targets: (a) Q-projection chain,
(b) KV-projection chain, (c) O-projection group, (d) Compressor +
Indexer, (e) MoE router. Each behind own use_v4_fused_* switch
(default False); functional fusion (not module-level) so
Megatron's spec walker stays untouched. Gates: G32.{a..e} + G33
(smoke + perf, >= +X % TFLOP/s/GPU vs P28 baseline).
- P30 (V4 Triton attention perf) — per-shape autotune for FWD + BWD
(BLOCK_M / BLOCK_N / num_warps / num_stages keyed on H, head_dim,
swa_window; SMEM heuristic prunes > 160 KiB at compile time);
persistent FWD kernel; HCA LSE-merge variant (was a plan-4
follow-up; runs SWA + compressed-pool branches as two flash
kernels and merges via online softmax — avoids the materialised
additive bias). use_v4_attention_lse_merge switch (default False);
G34 asserts FWD + BWD equivalence within bf16 budget.
- P31 (V4 Triton CSA perf) — in-kernel topk_idxs gather (drops the
~64 GiB / microbatch wrapper-side materialisation at V4-Flash
production dims; this is also the structural fix that eventually
lets the proxy reach Sq=4096); K-tile prefetching.
use_v4_csa_in_kernel_gather switch (default False); reuses
plan-4 G26 / G27 release-tier with dgathered -> dpool assertion.
- P32 (overlap + recompute) — re-enable --overlap_grad_reduce True
--overlap_param_gather True (currently False; plan-4 G30 obsoletes
the plan-2 stability hedge that turned them off); MoE
shared-expert overlap investigation; recompute granularity
tuning if P31's in-kernel gather frees enough HBM. Final EP=8
trace at develop/profile/profile-final-*. Gate: G35 (smoke +
cumulative perf, >= W % vs P28 baseline) + plan-4 ratchet
(G23..G30) all green.
Ratchet — every plan-5 phase MUST keep plan-4 gates G23 / G24 / G25 /
G26 / G27 / G28 / G29 / G30 green. Banned-warning ratchet adds
"v4_fused_* compile error" and "DeepEP contract violation". Plan-5 is
measurement-driven: no per-phase budget number is committed in the
plan docs; P28's report owns picking and writing them.
Out of scope (plan-5): FP8 / FP4 / mxfp4 quantised forward,
convergence run, long-context (1M-token) bring-up, multi-node EP
scaling, HF state-dict adapter, V3 / V2 backports of plan-5 fusions.
run_deepseek_v4.sh — defaults flipped per the user directive that
preceded plan-5 planning so the V4-Flash production smoke runs
end-to-end without any env-var override:
- PRIMUS_PP defaults 2 -> 1
- PRIMUS_EP defaults 4 -> 8
- USE_V4_TRITON_ATTENTION defaults False -> True
- USE_V4_TRITON_CSA_ATTENTION defaults False -> True
All four are still env-var overridable; the existing TP > 1 soft
warning (plan-4 P27) still fires when the V4 Triton kernels are on at
TP > 1. plan-4 G30 evidence (10/10 iters clean, lm_loss 11.85 ->
11.65, throughput 17.3 TFLOP/s/GPU steady at PP=1 EP=8 with both V4
Triton kernels + Turbo DeepEP on) gates this default flip.
Documents: - deepseek-v4/develop/plan-5/README.md (overview, scope, phase map)
- deepseek-v4/develop/plan-5/01-roadmap.md (phase overview, dep
graph, milestones, top risks, out-of-scope)
- deepseek-v4/develop/plan-5/02-phase-details.md (per-phase tasks,
design notes, edge cases; hand-off note placeholder)
- deepseek-v4/develop/plan-5/03-test-strategy.md (gate matrix
G31..G35, plan-4 ratchet contract, banned-warning ratchet,
perf-budget contract)
- deepseek-v4/develop/progress/status.md (Phase 28..32 task tables
with TBD-p2X commit cells)
Co-authored-by: Cursor <cursoragent@cursor.com>
…+ bottleneck report
Phase 28 ships the foundation that every other plan-5 phase reports
its delta against:
* `run_deepseek_v4_flash_proxy.sh` — thin wrapper over `run_deepseek_v4.sh`
that pins V4-Flash production widths (`H=64, head_dim=512,
num_experts=256, moe_router_topk=6, moe_ffn_hidden_size=2048,
index_topk=512`), 8 layers, `compress_ratios=[0,0,4,128,4,128,4,0]`
(every layer kind exercised: 3 dense, 3 CSA, 2 HCA), `TP=1 PP=1 EP=8`,
and all four perf knobs on (`USE_V4_TRITON_ATTENTION`,
`USE_V4_TRITON_CSA_ATTENTION`, `USE_TURBO_DEEPEP`,
`TURBO_USE_GROUPED_MLP`).
* `progress/p28/run_baseline_trace_ep8.sh` — self-contained trace-capture
script (mirrors plan-3 P23 / plan-4 P25 pattern; `run_deepseek_v4.sh`
hard-codes `--disable_tensorboard True` which blocks the profiler's
TB writer). Captures iter 6 -> 7 (one steady iter) at `Sq=4096` with
`PROFILE=True --use_pytorch_profiler True`. `progress/p28/.gitignore`
excludes the raw `*.log` / `*.json` / `*.tgz` outputs.
* `develop/profile/_tools/render_baseline_report.py` — chrome-trace
consumer that emits the markdown + HTML bottleneck-analysis report.
Multi-stream-overlap-aware GPU-active math (interval-union sweep —
single-stream `Sigma dur` over-counts on multi-stream HIP); top-1
reduce-kernel signature isolated; module-level CPU op-time numbers
carry an explicit "nests" caveat so readers do not misread bloated
`Sigma event dur` totals. Tool is reused by P32 for `profile-final-*`.
* `develop/profile/profile-baseline-ep8-20260508.{md,html}` — the P28
report. Headline findings:
- GPU active = 99.7 % (CPU-bound floor 0.3 %, multi-stream overlap
factor 1.87x). The pre-trace hypothesis that small-kernel-launch
tail is the bottleneck DOES NOT HOLD at V4-Flash production
widths.
- Top kernel by far is one specific `aten::sum` fp32 reduce
(`reduce_kernel<512, 1, ReduceOp<float, sum_functor<float, float,
float>>>`) at 7.61 s (87.3 % of step) over 717 launches x 10.62 ms
each.
- V4 Triton attention kernels are BWD-heavy: dense / HCA = 3.90 s
(44.7 %), CSA = 4.19 s (48.1 %).
- Comm time = 12.85 ms (0.1 %); HBM peak = 195 / 287 GiB ~ 68 %.
- Per-phase de-scope decisions (data-driven, 10 % rule): P29 KEEP
but RESCOPE (drop small-op fusion mandate, redirect to root-
causing the 7.6 s `aten::sum`); P30 KEEP (BWD prioritised); P31
KEEP but RESCOPE (HBM motivation gone, kept for BWD speed-up);
P32 DE-SCOPED.
- Combined target: plan-5 final >= 110 TFLOP/s/GPU steady at
Sq=4096 EP=8 single-node (40 %+ over the 78 TFLOP/s/GPU baseline).
* Calibration outcome: `Sq=4096` (production target) confirmed fitting
on a single MI355X node at EP=8 (peak rocm HBM 195 GiB / 287 GiB ~
68 %, 5/5 calibration iters clean, 10/10 baseline iters clean,
lm_loss 11.16 -> 9.26, 0 NaN, banned-warning grep on plan-3 / plan-4
ratchet patterns returns 0 for every term). No fall-back to
Sq=2048 / 1024 / 512 needed; `Sq=4096` adopted as the proxy default.
* `progress/status.md` Phase 28 row: all 6 P28 task cells checked, with
`TBD-p28` SHA placeholders that will be SHA-pinned in a follow-up
commit (mirrors the plan-4 P27 -> 03bacc2 pattern).
Closes plan-5 P28. P29 / P30 / P31 task lists open against this
baseline; P32 is de-scoped pending evidence.
Co-authored-by: Cursor <cursoragent@cursor.com>
…rn + project-wide rules doc
Plan-5 P29 (RESCOPED) — kill the dominant aten::sum fp32 reduce kernel that
the P28 baseline trace pinned at 7.61 s / 87.3 % of step time.
Forensic root cause (progress/p29/refinement.md + _forensics{,2,3}.py):
624 / 717 of all dominant `reduce_kernel<512, 1, ...>` launches (96 %
by count, 99.95 % by Σ kernel duration) come from
`hyper_connection.py:47 sinkhorn_normalize` — 39 reductions / call ×
8 layers × 2 (FWD + AOT-autograd BWD) = 624. Inputs are
`(1, 4096, 4, 4) → keepdim=True dim=-1` fp32. HIP's default
`reduce_kernel<512, 1, ...>` is sized for huge reductions; for our
4-elements-per-output shape it runs at ~250× over the memory-bound
floor (~12.5 % occupancy + 624 × 5 µs launch overhead).
Fix: a `torch.compile(fullgraph=True, dynamic=True)` build of
`sinkhorn_normalize`, cached on `(n_iters, eps, in_dtype)` (shape NOT
in key — `dynamic=True` ships ONE shape-generic Inductor kernel; that
also avoids Dynamo's `cache_size_limit=8` collision when closures
from the same factory share a `code` object). Algorithm is byte-
identical; only the kernel boundary moves. AOT autograd handles BWD.
Behind a default-off feature flag `use_v4_compiled_sinkhorn` plumbed
through `DeepSeekV4TransformerConfig` → V4 base + V4-Flash YAML →
`DeepseekV4HybridLayer` → `HyperMixer.__init__` →
`HyperMixer.compute_weights` → `sinkhorn_normalize(use_compiled=...)`.
`run_deepseek_v4.sh` exports `USE_V4_COMPILED_SINKHORN` (default
`False`); `run_deepseek_v4_flash_proxy.sh` flips its default to `True`
so plan-5 P30 / P31 measure against the post-P29 baseline.
Gates (all green):
* G32 — FWD + BWD parity (compiled vs eager); 10 / 10 tests pass; fast
tier (B=2, S=64, K=4) atol=1e-5; release tier (B=1, S=4096, K=4)
marked pytest.mark.slow; cache-hit assertion on second call;
HyperMixer flag-propagation test included.
* G33a — 10-iter EP=8 proxy smoke; no NaN / Inf / banned warnings;
`lm_loss[10] = 9.258` vs P28 baseline `9.258` (bit-for-bit); steady
79.1 vs 77.5 TFLOP/s/GPU (+2.0 %).
* G33b — post-P29 chrome-trace + bottleneck report at
`develop/profile/profile-after-p29-ep8-20260509.{md,html}`. Budget
X1 (≥ 50 % drop in aten::sum kernel time) MET BY ~1000×: critical
shape kernel time 7607.9 ms → 0.2 ms (−99.997 %), launches
624 → 16. Multi-stream overlap factor collapsed 1.87× → 1.00× —
explains why wall-time gain is only +2 % despite the kernel-time
delta (the reduce was a parallel hitchhiker on stream-1; the V4
Triton attention BWD on stream-0 was already wall-time gating).
New top wall-time bottleneck: V4 Triton CSA BWD (4.03 s, 46.8 %)
+ V4 Triton dense BWD (3.18 s, 36.8 %) = 92.6 % of step. P30 / P31
mandate confirmed unchanged.
De-scope decisions recorded at P29 close:
* Hand-Triton fall-back kernel — NOT NEEDED (X1 over-shot ~1000×).
* Global default flip — DEFERRED to P32 hand-off (G35) because the
+2 % wall-time gain does not justify the cold-compile footgun for
short-iter unit-test harnesses; proxy default is enough for plan-5
P30 / P31 perf work.
Also lands `develop/rules/rule.md` — project-wide working rules doc
codifying the standing decisions accumulated across plan-2..plan-5
(review-before-commit, status-pin commit pattern, per-phase summary
file convention introduced at this phase, banned-warning ratchet,
dispatch precedence, DeepEP best-practice config, dtype contract,
TFLOPs counting rule, 10 % de-scope rule, etc.). README + status.md
+ plan-5/01-roadmap.md now point at it as the single source of
truth.
Co-authored-by: Cursor <cursoragent@cursor.com>
Pin the Plan-5 P29 tracker rows, post-P29 profile provenance, and `progress/p29/p29-summary.md` commit chain to the feature commit `1ea7e7a8`. This is the standard docs-only status-pin commit that follows every DeepSeek-V4 phase feature commit. Co-authored-by: Cursor <cursoragent@cursor.com>
| )[0] | ||
| ) | ||
| print(f"loading {fp} ...") | ||
| data = json.load(open(fp)) |
| trace_dir = "output/amd/tas-mi355x-20260509/p28_profile_baseline_pp1_ep8_seq4096/tensorboard" | ||
| fp = glob.glob(os.path.join(trace_dir, "*.pt.trace.json"))[0] | ||
| print(f"loading {fp} ...") | ||
| data = json.load(open(fp)) |
| trace_dir = "output/amd/tas-mi355x-20260509/p28_profile_baseline_pp1_ep8_seq4096/tensorboard" | ||
| fp = glob.glob(os.path.join(trace_dir, "*.pt.trace.json"))[0] | ||
| print(f"loading {fp} ...") | ||
| data = json.load(open(fp)) |
…tiles Optimize the in-tree V4 Triton attention path by routing dense and HCA layers through kernel-native SWA pruning, including an HCA split-mask mode that preserves the joint softmax while avoiding dead local-key tiles. Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
…nd sparse pool kernels Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
…D kernels
P32 closes the residual single-kernel attention bottlenecks pinned by the
post-P31b microbenchmark using `progress/p31/bench_csa_attention_ep8.py`
and a new `progress/p32/bench_v4_attention_ep8.py` (dense `cr=0` + HCA
`cr=128` modes; mirrors the CSA bench argparse + timing).
CSA FWD: 48.17 ms -> 3.16 ms (-93.4 %, 15.2x; target <=6 ms MET).
Replace the monolithic `_v4_csa_attention_pool_fwd_kernel` with three
kernels joined by an online-softmax LSE merge so the local SWA and
sparse top-K branches no longer serialise through a single program:
reuse the P30-pruned dense FWD for local, add
`_v4_csa_attention_pool_sparse_fwd_kernel` for head-block sparse, and
add `_v4_csa_attention_lse_merge_kernel` to combine the two.
`PRIMUS_V4_CSA_FWD_FORCE_MONOLITHIC=1` keeps the legacy kernel.
V4 attention BWD: dense 17.27 ms -> 7.65 ms (-55.7 %, 2.26x; target
<=15 ms MET); HCA 20.87 ms -> 11.91 ms (-42.9 %, 1.75x; target
<=15 ms MET). Split `_v4_attention_bwd_kernel` into
`_v4_attention_bwd_dq_kernel` (parallel over `m`) and
`_v4_attention_bwd_dkv_kernel` (parallel over `n`) so dQ, dK, dV are
written atomic-free. MHA fast path drops the kvgroup head loop when
`HEAD_K == HEAD_Q`. `PRIMUS_V4_ATTN_BWD_FORCE_MONOLITHIC=1` keeps the
legacy kernel.
CSA BWD: 35.43 ms -> 16.31 ms (-54.0 %, 2.17x; target <=15 ms missed by
~1.3 ms). Local SWA reuses the new split dq + dkv kernels with CSA's
joint `lse / D`. Sparse pool branch defaults to a two-pass segmented
reduction (`PRIMUS_V4_CSA_BWD_SEGREDUCE=1`): a new
`_v4_csa_attention_pool_sparse_bwd_partial_kernel` writes per-visit
dpool contributions to a compact `[B, M, K_topk, D]` partial buffer
with `tl.store` (no atomics), then a new
`_v4_csa_attention_pool_segreduce_kernel` folds them into `dpool_fp32`
segment-by-segment via a sorted inverse index, also atomic-free. Sweep
+ ship `BLOCK_K_PARTIAL=16`, `partial warps=8`, `partial stages=2`,
`segreduce BLOCK_D=512`, `BLOCK_I=64`, `warps=8`, `stages=3`. Fallback
retains the legacy gather + dpool atomics path (sparse `BLOCK_K=32`,
`num_warps=4` defaults after a sweep).
Tests: `pytest -x -q tests/.../deepseek_v4/{test_v4_p25_v4_attention_bwd
,test_v4_p26_v4_csa_attention_bwd,test_v4_p31_v4_csa_in_kernel_gather}.py`
-> 51 passed, 48 skipped. Pre-existing unrelated `test_v4_mtp`
failure verified on `git stash` baseline.
Docs: `progress/status.md` P32 row checked through the kernel work
(EP8 trace + report + `proxy_ep8.md` left for follow-up);
`develop/perf/attention_perf.md` P32 row added with effective TFLOP/s
re-derived from the microbench wall times; full eight-section summary
in `progress/p32/p32-summary.md` per rule R2.1, including the negative
probes (dense-mask scatter, bf16 partial buffer, fused dpool matmuls,
multi-stream overlap).
Bench shape: `B=1, H=64, S=4096, D=512, P=1024, K_topk=512,
swa_window=128, bf16, sink=on` on `mi355-gpu-8` / `dev_primus_wenx_693`,
median of 60 iters after 20 warmup.
Co-authored-by: Cursor <cursoragent@cursor.com>
…sive BWD optimizations behind opt-in env vars
After landing the P32 split CSA FWD + atomic-free V4/CSA BWD kernels in
the prior commit, the EP8 proxy trace surfaced a HBM-contention story
that does not show up in standalone microbenchmarks:
- CSA FWD split (local SWA + sparse pool + LSE merge) wins both: bench
48.17 -> 3.22 ms (-93.3%, 15.0x) AND proxy iter 10 963 -> 891 ms
(-7.5%) / 711 -> 768 TFLOP/s/GPU (+8.1%). KEEP DEFAULT ON.
- V4 attention BWD split (dQ kernel + dK/dV kernel, atomic-free) wins
the bench (dense 17.27 -> 7.65 ms, HCA 20.87 -> 11.91 ms; both clear
<=15 ms target) but regresses EP8 proxy iter time by ~190 ms because
the split design reads Q / K / V twice (2x HBM traffic per BWD step)
and loses the bandwidth fight against concurrent MoE work. FLIP
DEFAULT TO MONOLITHIC; opt in via PRIMUS_V4_ATTN_BWD_USE_SPLIT=1.
- CSA BWD segmented reduction (4 GiB partial buffer + sorted inverse
index, atomic-free dpool) wins the bench (35.43 -> 16.31 ms, -54%)
but regresses EP8 proxy iter time by ~40 ms for the same HBM-
contention reason. FLIP DEFAULT TO gather + atomic_add dpool (the
P31b path, now ~7.9% faster at 32.62 ms vs P31b 35.43 ms thanks to
incidental Triton autotuner improvements). Opt in via
PRIMUS_V4_CSA_BWD_SEGREDUCE=1.
EP8 proxy trace 1778476971738245137 (mi355-gpu-8, dev_primus_wenx_693):
iter 10 963.0 -> 890.5 ms (-7.5%)
TFLOP/s/GPU 709.3 -> 768.4 (+8.1%)
profiler steady 980.9 -> 899.99 ms (-8.2%)
GPU active 940.04 -> 859.54 ms (-8.6%)
CSA FWD trace 123.07 -> ~50.6 ms (-59%)
V4 attn BWD trace 259.74 -> 256.97 ms (-1.1%, monolithic kept)
CSA sparse BWD 80.81 -> 72.54 ms (-10.2%)
Attention family ~493 -> ~410 ms (-16.8%)
Tests: 114 passed, 88 skipped across test_v4_p25/p26/p27/p31 attention
suites on the shipped defaults.
Docs:
- profile/profile-after-p32-ep8-20260511.{md,html}: full P32 trace report.
- progress/p32/p32-summary.md: rewritten with shipped + opt-in numbers,
proxy attribution, and HBM-contention rationale for the opt-in gates.
- perf/proxy_ep8.md: P32 row added (890.5 ms / 768.4 TFLOP/s/GPU, 9.92x
vs P28 baseline 8837 ms).
- perf/attention_perf.md: P32 (shipped) and P32 (bench-opt opt-in) rows.
- progress/status.md: P32 rows checked through trace + report; opt-in
rationale recorded for the BWD rows.
- progress/p32/_render_html.py: helper that renders the markdown profile
report to HTML using the same style as the P28..P31b reports.
- progress/p32/run_baseline_trace_ep8_p32.sh: trace script (iter 6->7
profiler window, same harness as P31b).
Co-authored-by: Cursor <cursoragent@cursor.com>
Summary
This PR brings DeepSeek-V4 training support into Primus on the Megatron backend.
It now spans the full bring-up arc (P0 – P10) and the plan-2 lockdown (P12) that closes out plan-0 / plan-1 with an architecture-faithful rewrite plan for the remaining work (P13 – P21).
Plan timeline
develop/plan-0/)PP=2 EP=4)develop/plan-1/)develop/plan-2/)Plan-2 reshuffle — 2026-05-01 (commit
f548d8b2, docs-only)Pre-training is the release path; HF-weight loading is not required for the release. Plan-2 phase shape after this reshuffle:
_RMSNormduplicates /dual_rope.py/csa_attention.py/hca_attention.py/ legacyDeepseekV4MTPBlock/ EPall_reducefallback gate /_v4_token_idsresidue / yaml comment fixes. New gate G14 (static dead-code audit)._v4_token_idsremoval moved to P17)develop_deepseek-v4-in-primus.mdonly02-target-architecture.md§7 +03-phase-details.md(P22+ section). G8 / G9 deferred from P17; HF-numerical-alignment portion of G12 also deferred here.Why plan-2
A code review of
dev/wenx/deepseek-v4against real DeepSeek-V4 (HF reference, NeMo port, official inference) and Megatron'sspec + config + provider + submodule + build_modulepattern surfaced 28 findings (10 CRIT / 11 HIGH / 6 MED / 5 LOW). Highlights:linear_k_proj/linear_v_proj; real V4 has a single-latentwkv(K = V = kv).q_norm/kv_normper-head RMSNorms are missing.HashRouteroutputs uniform1/topkweights with no learnable gate.clamped_swigluclamps post-mul; real V4 clamps pre-mul onsilu(gate)andup.V4-Flash/V4-ProHF safetensors cannot be loaded.DeepseekV4Attention/DeepseekV4TransformerBlock/DeepseekV4HybridLayer/DeepseekV4MoEreinvent rather than subclassMLASelfAttention/TransformerBlock/TransformerLayer/MoELayer.Plan-2 (
develop/plan-2/) is the architecture-faithful rewrite. Full review indevelop/plan-2/00-review-findings.md; rewrite map in02-target-architecture.md; phase-by-phase plan in03-phase-details.md; gates in04-test-strategy.md.Commit map
e194e039d3383c028ae10000model_type=deepseek_v4dispatcha5d2a5613b7ad8c85e4008dc97b9720ddf273a45e5fec968b38e83cf752b7534636ab3decad0fb38MLASelfAttention(faithful dense path)aa9929a01a8bf32e5fe8bc3cDeepseekV4MoE->MegatronModule+ CPU local-experts path;v4_grouped_mlp_spec/v4_router_specproviders; G5 (1L MoE forward <= 1e-3 vs HF reference)25ccdb5eDeepseekV4HybridLayer->TransformerLayer;DeepseekV4TransformerBlock->TransformerBlock; HC x PP K-stream packing helpers;HyperHeadonly on post_process;token_idsforward kwarg replacesdecoder._v4_token_idsstash; 16 unit tests6c5875d4MultiTokenPredictionBlock+process_mtp_loss;get_v4_mtp_block_spechelper; layer forward returns(hidden_states, None)for MTP-call compatibility; legacyDeepseekV4MTPBlockdeprecated; 17 unit testsf548d8b2e591b893DeepseekV4MTPBlock+v4_use_custom_mtp_block/mtp_compress_ratiosconfig fields; introduce sharedLocalRMSNormhelper and dedup three_RMSNormshadows (block.py/attention.py/compressor.py); fix inverted yaml comment (4=CSA / 128=HCA); refresh package__init__surface; addtests/unit_tests/megatron/transformer/deepseek_v4/test_v4_p17_dead_code.py(G14 audit).dual_rope.pyis intentionally kept — load-bearing for V4's CSA / HCA dual-base RoPE; no Megatron equivalent.b5832672build_context.resolve_v4_provider(config)caches the V4 provider on the config object (replaces three directDeepSeekV4SpecProvider(...)call sites); newprovider.v4_mlp_activation_func()returnsNonewhenuse_te_activation_func=False(V4 default — clamped-SwiGLU eager path) andTEActivationOpotherwise;compress_ratiosnormalized totuple[int, ...]in__post_init__(so runtime never re-runsast.literal_eval); newtests/unit_tests/configs/test_deepseek_v4_yaml.py(G1 schema gate) +tests/unit_tests/megatron/transformer/deepseek_v4/test_v4_p18_spec_audit.py(D1 / D2 / package-surface AST audits).83c33ad0megatron.deepseek_v4.pp_tensor_shape(wraps bothschedules.get_tensor_shapesfor 1F1B andforward_backward_pipelining_with_interleavingfor VPP, multiplies the seq dim byhc_multso the PP wire carries V4's mHC[S*K, B, D]packing) andmegatron.deepseek_v4.pp_token_pre_broadcast(pre-broadcasts all microbatch / chunkinput_idsfrom PP rank 0 across the PP group upfront in a wrapper aroundget_forward_backward_func, so middle PP stages owning hash-routed MoE layers see real token IDs without deadlocking the interleaved-1F1B / VPP schedule). Drops the in-forwardPP broadcast + VPP fail-fast assert fromDeepseekV4Model, and stops pre-assigningself.mtp = Noneso Megatron'sset_current_microbatchonly iteratesmodel.mtp.layerswhen MTP is live (matches upstreamGPTModel).dba27163c10d::allreduce_autograd warning as gone (verified absent in P19 smokes A/B/C/D + EP=8 / PP=2 EP=4 profile runs onmi355-gpu-12); mark G11 (routing-snapshot diff = 0 across PP / EP changes) as deferred (snapshot dump tooling never landed; not on the pre-training release path); drop Phase 20 / 21 / 22+ sections fromstatus.md(kept as documented intent inplan-2/03-phase-details.md); adddeepseek-v4/develop/progress/plan-2-summary.md(stand-alone summary of the architecture-faithful rewrite from P12 → P19, including a per-phase outcome table, a P19 deep-dive, the test-gate ledger, the plan-1 → plan-2 architectural-shift table, and pointers to logs / profile traces); add P19 profile launchers (run_profile_ep8.shfor TP=1 PP=1 EP=8 andrun_profile_pp2_ep4.shfor TP=1 PP=2 EP=4) plusdeepseek-v4/download_ref.sh(idempotent helper that ensures git-lfs and clones the V4 reference assets — HF transformers, ROCm/TransformerEngine, AMD-AGI/Primus-Turbo, NVIDIA-NeMo/Automodel, and the four DeepSeek-V4 model repos — at pinned commits withGIT_LFS_SKIP_SMUDGE=1so weights are not downloaded by default).What landed in
97b9720d(P6/P7)P6 integration
deepseek_v4_builders.pymodel_providerwith upstream Megatron signature (config,pg_collection).deepseek_v4_block.pyget_num_layers_to_build+get_transformer_layer_offset.set_input_tensorsupport for non-first PP stages.compress_ratiosmore robustly.make_viewless_tensorfor PP schedule compatibility.v4_moe.pydeepseek_v4_model.pyv4_use_custom_mtp_block; default to native GPTModel MTP path for stable bring-up.dual_rope.py,deepseek_v4_attention.py,attn_sink.pyDualRoPE.apply->apply_rope(avoidnn.Module.applyconflict).P7 bring-up
run_deepseek_v4.sh(based onrun_qwen.bak.sh) with fixed knobs:MBS=1,GBS=16,TP=1,PP=2,EP=4num_layers=8,num_experts=8,mtp_num_layers=0)uswslocpm2m-106-2371dev_primus_wenx_691TRAIN_ITERS=3 ./run_deepseek_v4.shiteration 3/3, torchrun exit code 0What landed in
df273a45(P8 v2)deepseek_v4_model.pyDeepseekV4Modelnow inherits fromLanguageModule(no longerGPTModel).super_init_transformer_layer_specpath.transformer_layer_spec.deepseek_v4_layer_specs.pydeepseek_v4_builders.pydeepseek-v4/develop/progress/status.mdRuntime verification:
uswslocpm2m-106-2371, containerdev_primus_wenx_691:DeepseekV4Model(LanguageModule-based) with runtime spec tree.(128, 2, 256).What landed in
e5fec968(P9 v2)core/extensions/transformer_engine_spec_provider.pyDeepSeekV4SpecProvider(PrimusTurboSpecProvider)as the V4 provider entry point.local/te/turbo) and expose V4-specific provider helpers for norm/grouped-MLP selection.deepseek_v4_layer_specs.pyModuleSpecconstruction.deepseek_v4_attention.pysubmodules + build_moduleviaDeepseekV4AttentionSubmodules(q_a,q_b,k_proj,v_proj,o_proj) with local fallback.deepseek_v4_block.pyhidden_states.v4_moe.pydeepseek-v4/develop/plan-1/03-phase9-provider-ab-report.md.deepseek-v4/develop/progress/status.mdwith completed Phase 9(v2) items and English-only notes.Runtime verification:
uswslocpm2m-106-2371, containerdev_primus_wenx_691:Linearprojections).TELinearprojections.decoder.cuda()+ CUDA inputs).What landed in
b38e83cf(P10)core/transformer/moe/v4_moe.pyClampedSwiGLUMLPfallback for shared experts).core/models/deepseek_v4/deepseek_v4_transformer_config.pyDeepSeekV4TransformerConfig(inheritsMLATransformerConfig) with DeepSeek-V4 specific fields used by V4 runtime modules.__post_init__(norm_epsilon,moe_intermediate_size, clamp sync, vocab/padded vocab sync).deepseek_v4_builders.pycore_transformer_config_from_args(..., config_class=DeepSeekV4TransformerConfig).DeepSeekV4TransformerConfig.activation_func_clamp_valuetoprimus/configs/models/megatron/deepseek_v4_base.yamlwith clamped-SwiGLU comment.deepseek-v4/develop/plan-1/*anddeepseek-v4/develop/progress/status.mdfor Phase10 implementation notes.Validation in this commit:
What landed in
752b7534(P10 runtime stabilization + report)run_deepseek_v4.shseq_length/max_position_embeddings=128,index_topk=8).v4_grouped_experts_support_clamped_swiglu=Truefor grouped-expert clamped-SwiGLU runtime guard compliance.overlap_grad_reduceandoverlap_param_gatherin smoke mode to avoid DDP bucket reset assertion between iterations.primus/backends/megatron/core/transformer/hyper_connection.pyF.linearweight dtype to activation dtype inHyperMixerandHyperHeadto fix BF16 runtime mismatch.primus/backends/megatron/core/transformer/deepseek_v4_attention.pydeepseek-v4/develop/plan-1/04-phase10-moe-distributed-convergence-report.mdRuntime verification in this update:
uswslocpm2m-106-2371dev_primus_wenx_691./run_deepseek_v4.shiteration 10/10, andtorchrun finished successfully (code 0).What landed in
636ab3de(P12 — plan-2 lockdown)Documentation-only commit; no runtime code changes.
Architecture review
e194e039..HEADagainst:deepseek-v4/deepseek-ai/DeepSeek-V4-Flash/{config.json, inference/model.py}deepseek-v4/transformers/.../deepseek_v4/)deepseek-v4/NVIDIA-NeMo/Automodel/...)Plan-2 documents (active plan of record)
deepseek-v4/develop/plan-2/README.mddeepseek-v4/develop/plan-2/00-review-findings.md— full severity-ranked findings ledgerdeepseek-v4/develop/plan-2/01-roadmap.md— phases P12 → P21, dependency graph, milestones, top risksdeepseek-v4/develop/plan-2/02-target-architecture.md— module-by-module rewrite map (rebases onMLASelfAttention,TransformerLayer,TransformerBlock,MoELayer,MultiTokenPredictionBlock,(Yarn)RotaryEmbedding)deepseek-v4/develop/plan-2/03-phase-details.md— granular tasks / exit criteria / risks per phasedeepseek-v4/develop/plan-2/04-test-strategy.md— L0..L3 test pyramid and release gates G1..G14 (G8 / G9 marked deferred → P22+ since the 2026-05-01 reshuffle)Plan-1 phases 9 / 10 / 11 are paused — their tracking rows in
status.mdremain for history.Tech blog closure
deepseek-v4/develop/techblog/02-plan-1-as-built-and-plan-2-pointer.md: closes plan-0 / plan-1 with an as-built note (what shipped, what fell short) and points readers at plan-2.deepseek-v4/develop/techblog/README.mdwith a banner declaring plan-2 the active plan of record.Layout cleanup + visuals
develop/plan/→develop/plan-0/(the original bring-up plan; tracked as a rename).develop/progress/timeline.html: standard system-fonts version of the project timeline; daily-column Gantt with aMay 02 – 05 Holidayband; remaining nine phases (P13 – P21) packed into the May 06 – 09 working window.develop/progress/build_roadmap_pptx.py(generator) +develop/progress/deepseek_v4_roadmap_v1.pptx(13-slide tech-style deck on a black background, 16:9). Slide 7 —07 · 开发计划 · DEVELOPMENT SCHEDULE— is the day-by-day plan with a 3-row layout (date chip /P0~P7-style phase chip / work-content card) plus a directional arrow with the holiday-gap marker.Status tracker
develop/progress/status.mdnow has explicit Phase 12 → Phase 21 (v3) sections.Schedule
What landed in
cad0fb38+aa9929a0(P13 — faithful attention)Plan-2 P13 lands in two commits inside the May 06 budget. Both are scoped strictly to the dense / CSA / HCA attention path; faithful MoE / router / MTP are tracked in P14 / P15 / P16. (HF state-dict adapter — originally planned for P17 — has since been deferred to P22+ by the 2026-05-01 reshuffle; pre-training does not need it.)
cad0fb38— V4-faithful attention rooted onMLASelfAttention(dense path)Rewrite the dense (
compress_ratio == 0) path of DeepSeek-V4 attention to be faithful to the releasedDeepSeek-V4-Flashcheckpoint and rooted on Megatron'sMLASelfAttention.primus/backends/megatron/core/transformer/deepseek_v4_attention.pyDeepseekV4Attention(MLASelfAttention)subclasses MLA for type identity but bypasses the parent__init__chain because V4's KV layout differs from MLA's compressed-KV form.linear_kvprojection (hidden -> head_dim) feeds both K and V, broadcast across all query heads.q_rms: parameter-less RMS onhead_dimafterlinear_q_up_projand before partial RoPE (noq_rms.weightin the released checkpoint).linear_o_aper group +linear_o_bwheno_lora_rank > 0. Falls back to MLA-style flatlinear_projwheno_lora_rank == 0.attn_sink: directnn.Parameteron the attention (matches the released keylayers.{i}.attn.attn_sinkexactly), with inline softmax-with-sink in_attention_forward.DeepseekV4AttentionSubmodulesdataclass with MLA-canonical names (linear_q_down_proj,linear_q_up_proj,q_layernorm,kv_layernorm) plus V4 extras (linear_kv,linear_o_a,linear_o_b,attn_sink)._LegacyDeepseekV4Attentionretained temporarily as the parent forCSAAttention/HCAAttentionuntil the P13 follow-up commit folds the compressor / indexer into the new class.primus/backends/megatron/core/extensions/transformer_engine_spec_provider.pyv4_q_layernorm(),v4_kv_layernorm(),v4_attention_sink()factory methods.primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_layer_specs.pycompress_ratio == 0to the new class with V4-canonical submodules; legacy path retained for{4, 128}.primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_transformer_config.pyo_groups: int = 8ando_lora_rank: int = 0.tests/unit_tests/megatron/transformer/deepseek_v4/test_deepseek_v4_attention.pyq_rmsis parameter-less;o_lora_rank == 0fallback path; rejection paths.aa9929a0— Fold Compressor / Indexer + TP-shard projections; retire CSA / HCA legacyCloses P13 by folding the compressed-branch attention into the V4-faithful class as spec submodules, switching the TP-sensitive projections to ColumnParallel / RowParallel, and retiring the plan-1 legacy attention classes.
primus/backends/megatron/core/transformer/deepseek_v4_attention.pyDeepseekV4Attention.__init__acceptscompress_ratio in {0, 4, 128}. Whencompress_ratio > 0it buildsself.compressorfromsubmodules.compressor; whencompress_ratio == 4it also buildsself.indexerfromsubmodules.indexer.DeepseekV4AttentionSubmodulesextended withcompressorandindexerfields.DeepseekV4Attention.forwardnow dispatches onself.compress_ratio:0— dense / SWA over local KV.128— HCA: compressed pool with compress-base partial RoPE on indices[0..P), broadcast toHheads, concat to local KV with a compressed-causal mask, joint softmax-with-sink shared across local + compressed branches.4— CSA: per-query top-K from compressed pool via Indexer + overlap-mode Compressor, joint softmax-with-sink across local + sparse keys._LegacyDeepseekV4Attentionand_LegacyDeepseekV4AttentionSubmodulesremoved.primus/backends/megatron/core/transformer/{csa,hca}_attention.pydeleted.primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_layer_specs.py_build_v4_attention_submodulesnow also buildscompressor/indexerModuleSpecs for compressed branches.linear_q_up_projswitched toprovider.column_parallel_linear()(gather_output=True);linear_o_b(grouped) andlinear_proj(flat-O fallback) switched toprovider.row_parallel_linear()(input_is_parallel=False). Attp > 1the projection weights are sharded across TP ranks; attp = 1the result is bit-identical to the previous duplicated path.linear_q_down_proj,linear_kv,linear_o_astay duplicated; full grouped-O TP plan is tracked in P14.primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_block.py_build_attention(no-spec fallback) now constructsDeepseekV4Attentionfor all branches; the new class builds its own Compressor / Indexer locally when no spec is provided.tests/unit_tests/megatron/transformer/deepseek_v4/test_deepseek_v4_attention.pytorchrun --nproc_per_node=2parity scaffold (skipif single-rank).Status
deepseek-v4/develop/progress/status.md: P13 fully checked off (including the items previously deferred to the follow-up commit). Items routed to P14 (full grouped-O TP plan) / P22+ — deferred (HF-reference numerical alignment via the state-dict adapter, originally P17) / P19 (full TP=2 sharding-parity bit-equality check) are noted as such on each row.Schedule
cad0fb38(early start; the May 02 – 05 holiday remains).aa9929a0is recorded under May 06 in the daily plan.What landed in
1a8bf32e(P14 phase-1 — faithful pre-mul clamped SwiGLU + V4 routers)P14 ships in two commits. This one lands the math + parameter-layout faithfulness so V4-Flash checkpoints will load through the future state-dict adapter (originally P17, now deferred to P22+ by the 2026-05-01 reshuffle) without remapping. The structural refactor (
DeepseekV4MoE(MoELayer)subclassing, provider helpers, G5 1L MoE forward) is the P14 phase-2 follow-up.Activation (G3)
primus/backends/megatron/core/transformer/clamped_swiglu.pySiLU(clamp(gate, max=alpha)) * clamp(up, +/- alpha). New helpersclamped_swiglu_pre_mul(gate, up, alpha)(split inputs) andclamped_swiglu_pre_mul_fused(x, alpha)([gate | up] last-dim concat for grouped-gemm experts).ClampedSwiGLUMLPnow uses separatew1/w2/w3Linears so the released checkpoint (Expert(w1, w2, w3, swiglu_limit)) loads without remapping. Optionalfused_gate_up=Truefuses the gate / up GEMMs at forward time only; the saved / loadedstate_dictkeys remainw1.weight/w2.weight/w3.weight.primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_block.py_DenseSwiGLUMLPnow applies the same pre-mul clamp on its dense head/tail layers; previously it computed vanillaSiLU(gate) * upand ignoredswiglu_limit.Learned router (G4)
primus/backends/megatron/core/transformer/moe/v4_topk_router.pyV4TopKRouter→DeepseekV4LearnedRouter(back-compat alias retained).weightParameter of shape[num_experts, hidden_size]— matches Megatron'sTopKRouter.weightAND HF referenceGate.weightexactly (nogate.weightindirection).expert_biasis selection-only: routing weights gather from the un-biased scores so probs gradient flows toweight, never toexpert_bias.score_function != "softmax"(HF parity; softmax probs already sum to 1).topk_scaling_factorhonorsmoe_router_topk_scaling_factor(HFroute_scale).v4_score_fncoverssoftmax,sigmoid,sqrtsoftplus.Hash router (G4)
primus/backends/megatron/core/transformer/moe/v4_hash_router.pyHashRouter→DeepseekV4HashRouter(back-compat alias retained).weightParameter same shape as the learned router; previously the hash router emitted uniform1/topkweights, which broke gradient flow into the gate weights and silently differed from the released checkpoint.tid2eidis now a frozennn.Parameter(requires_grad=False, dtype=torch.int32)(matches HF reference layout — released checkpoint stores it as a parameter so state-dict round-trips preserve it without polluting the optimizer state).forward(hidden, token_ids)gathers learned scores at the static expert ids prescribed bytid2eid[token_ids]; renorm + scale parity with the learned router.MoE wiring
primus/backends/megatron/core/transformer/moe/v4_moe.py_routenow passes(hidden, token_ids)to the hash router; both routers receivehidden_size/score_function/topk_scaling_factorat init.Tests
tests/unit_tests/megatron/transformer/deepseek_v4/test_clamped_swiglu.py— 7 tests cover pre-mul activation vs HF reference (≤ 1e-6 fp32, fouralphavalues),alpha = 0disables clamp, fused-vs-split agreement, one-sided gate clamp behavior,w1/w2/w3state-dict keys (nogate_up.weightleak),fused_gate_upforward equivalence, end-to-endClampedSwiGLUMLPvs HFExpert.forward.tests/unit_tests/megatron/transformer/deepseek_v4/test_v4_routers.py— 13 tests:expert_bias) ≤ 1e-6; back-compat alias; gradient flows to gateweight;expert_biasdetached from probs graph; softmax skips renorm.tid2eidis a frozenParameter(requires_grad=False, dtypeint32); state-dict keys; deterministic table across seeds; OOB / shape-mismatch error paths; gradient flows toweightwhiletid2eid.grad is None.Status
deepseek-v4/develop/progress/status.md: P14 phase-1 tasks checked off with this commit hash (1a8bf32e); deferred items listed for the phase-2 follow-up; the "HashRouter has no learnable gate weight / clamped SwiGLU clamps post-mul" blocker is marked resolved.Schedule
1a8bf32e(continuing the early start; May 02 – 05 holiday remains).aa9929a0and P14 phase-11a8bf32eare recorded under May 01 / 06 in the daily plan.What landed in
5fe8bc3c(P14 phase-2 — V4 MoE structural bring-up + G5)Closes plan-2 P14 by bringing
DeepseekV4MoEinto Megatron's spec lifecycle, exposing a CPU-testable forward path so the MoE math is pinned against the released HF reference, and adding the V4 provider helpers that plan-2 §5 / §6 call for.DeepseekV4MoE→MegatronModuleprimus/backends/megatron/core/transformer/moe/v4_moe.pynn.ModuletoMegatronModuleso it inherits the standard config plumbing and integrates withTransformerLayer.mlpvia the spec lifecycle.BaseMoELayer-compatible public surface:set_layer_number(layer_number)mirrorsBaseMoELayer.set_layer_number;local_expert_indicesis exposed as a list attribute.CPU local-experts path
primus/backends/megatron/core/transformer/moe/v4_moe.pypg_collection is None,__init__skips the dispatcher / grouped-experts construction and instead builds:local_experts: nn.ModuleList[ClampedSwiGLUMLP]— oneClampedSwiGLUMLPper local expert (mirrors HF referenceExpertexactly: separatew1/w2/w3Linears + V4 pre-multiplication clamp).shared_expert: ClampedSwiGLUMLP— a single shared expert with the same activation._local_experts_forwardruns a per-expert dispatch loop matchingDeepSeek-V4-Flash/inference/model.py:MoE.forwardexactly (for each routed expert, gather routed tokens, multiply by per-token routing weight, accumulate). Production path (pg_collectionprovided) continues to use the Megatron dispatcher + grouped experts unchanged.Provider helpers (plan-2 P14 §5 / §6)
primus/backends/megatron/core/extensions/transformer_engine_spec_provider.pyDeepSeekV4SpecProvider.v4_grouped_mlp_spec(swiglu_limit, moe_use_grouped_gemm=True, ...)returns a ready-to-useModuleSpec(grouped_module, MLPSubmodules)for the V4 MoE expert path. The pre-mul clamp itself is applied viaconfig.activation_func_clamp_value— Megatron's eagerglu()(mlp.py:312-321) already implementsSiLU(clamp(gate, max=alpha)) * clamp(up, +/- alpha), which is bit-equal to the HF reference math; the spec only commits to the right grouped module + the column / row-parallel linears.DeepSeekV4SpecProvider.v4_router_spec(learned=True/False)returns a bareModuleSpecfor eitherDeepseekV4LearnedRouterorDeepseekV4HashRouter.G5 numerical alignment
tests/unit_tests/megatron/transformer/deepseek_v4/test_v4_moe.py— 11 tests:MegatronModule; CPU path buildslocal_experts(ClampedSwiGLUMLP) +shared_expert; thetoken_dispatcher/grouped_expertsattributes stayNone;set_layer_numberpropagates.(sqrtsoftplus, sigmoid, softmax) × (shared expert on / off)— ≤ 1e-3 fp32 CPU.token_idsfeedingtid2eid— ≤ 1e-3 fp32 CPU.moe_router_topk_scaling_factor(HFroute_scale) propagates to the output.router.weight, on the shared expert, and on at least one routed expert'sw1/w2/w3.token_idsis missing.Status
deepseek-v4/develop/progress/status.md— P14 phase-2 tasks ticked with this commit; the structural row records theMegatronModule-via-CPU-path approach and explicitly defers theTopKRouter-rooted aux-loss / z-loss path to P19 alongside the distributed re-validation matrix (rationale: upstreamTopKRouter.__init__registers CUDA buffers unconditionally, which is impractical for CPU-clean V4 routers; gating that on a device check is out-of-scope for this commit).Schedule
5fe8bc3c(continuing the early start; May 02 – 05 holiday remains).aa9929a0, P14 phase-11a8bf32e, and P14 phase-25fe8bc3care recorded under May 01 in the daily plan.What landed in
25ccdb5e(P15 — V4 layer / block subclass refactor + token-ids forward kwarg + HC × PP packing)Closes plan-2 P15 except the distributed PP-equivalence gate (G6) which is tracked into P19. This commit brings V4's layer / block onto Megatron's
TransformerLayer/TransformerBlockparents, drops thedecoder._v4_token_idsattribute stash in favor of a real forward kwarg, gatesHyperHeadto thepost_processstage, and extracts HC × PP K-stream packing helpers.DeepseekV4HybridLayer→TransformerLayerprimus/backends/megatron/core/models/deepseek_v4/deepseek_v4_block.pyGraphableMegatronModuletoTransformerLayer.TransformerLayer.__init__is bypassed (V4's submodule contract differs — no cross-attention, no BDA, V4-specific attention signature);MegatronModule.__init__is called directly.DeepseekV4HybridLayerSubmodulesnow extendsTransformerLayerSubmodulesand uses upstream-canonical field names:input_layernorm/self_attention/pre_mlp_layernorm/mlp. The two V4-specific HC mixer hooksattn_hc/ffn_hcremain, both default toNoneforhc_mult == 1.forwardsignature is now upstream-compatible:(hidden_states, attention_mask=None, *, position_ids=None, token_ids=None, **kwargs).attention_maskis accepted and ignored (V4 manages SWA / sink mask internally);position_idsis consumed from the caller (fallback toarange(S)for tiny smokes);**kwargslets the layer plug intoMultiTokenPredictionLayer(P16) without bespoke adapters.DeepseekV4TransformerBlock→TransformerBlockprimus/backends/megatron/core/models/deepseek_v4/deepseek_v4_block.pynn.ModuletoTransformerBlock(init bypass viaMegatronModulefor CPU instantiability; V4 has its own layer-spec / lift-lower pipeline). Type identity unlocks Megatronisinstancechecks + sharded-state-dict integration.HyperHeadis built only on thepost_processstage. Earlier PP stages forward the K-stream tensor via_lower_streams_out(no per-stageHyperHead), saving memory and removing a correctness drift risk.HC × PP K-stream packing helpers
_lift_streams_in(hidden_states, pre_process, hc_mult)/_lower_streams_out(x, post_process, hc_mult)extracted as module-level helpers indeepseek_v4_block.py.[S, B, D] -> [B, S, K, D](broadcast across K).[S*K, B, D] -> [B, S, K, D](unfold packed K).[B, S, D] -> [S, B, D](post-HyperHeadtranspose).[B, S, K, D] -> [S*K, B, D](pack K into seq for PP P2P).Token-ids forward kwarg
primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_model.pyDeepseekV4Model.forwardno longer assignsdecoder._v4_token_ids(and removes thetry/finallycleanup). It now passestoken_ids=input_idsandposition_ids=position_idsdirectly toself.decoder(...).mlp.forward -> hash_router.forward.test_v4_block_pp.py::test_model_forward_does_not_set_decoder_v4_token_ids_attribute) prevents the attribute stash from regressing.Spec wiring + MTP block update
primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_layer_specs.pyrenames the four core fields when constructingDeepseekV4HybridLayerSubmodules:attn_norm→input_layernorm,attention→self_attention,ffn_norm→pre_mlp_layernorm,ffn→mlp.primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_mtp.pyswitches the per-MTP-layer call tolayer(stream, position_ids=..., token_ids=...)(kwarg, not positional) to match the new layer forward signature.Tests (
tests/unit_tests/megatron/transformer/deepseek_v4/test_v4_block_pp.py, 16 tests)DeepseekV4HybridLayeris aTransformerLayer;DeepseekV4TransformerBlockis aTransformerBlock;DeepseekV4HybridLayerSubmodulesextendsTransformerLayerSubmodulesand exposesattn_hc/ffn_hc.pre_process × post_process), for both single-stream (hc_mult=1) and multi-stream (K=3,K=4).S*Kon non-first stage; collapsed input on non-final lower; uncollapsed input on final lower.decoder._v4_token_idsis gone from the model source;token_ids=input_idskwarg is present.position_ids+token_idskwargs; layer.forward accepts(hidden_states, attention_mask=None, position_ids, token_ids).Status / blockers
deepseek-v4/develop/progress/status.md— Phase 15 tasks ticked except G6 (PP=1 vs PP=2 vs PP=4 equivalence on a 4L toy), which requires distributed init and is tracked into P19 distributed re-validation. The CPU-only sub-gate —_lift_streams_inafter_lower_streams_outis bit-exact — is covered by the lift/lower roundtrip tests, which is the math contract a real PP run depends on.TransformerBlock/TransformerLayer/MoELayer" — closed by P14 phase-2 + P15.decoder._v4_token_idsattribute" — closed by P15.Schedule
25ccdb5e(continuing the early start; May 02 – 05 holiday remains).1a8bf32e, P14 phase-25fe8bc3c, and P1525ccdb5eare recorded under May 01 in the daily plan.What landed in
6c5875d4(P16 — spec-based MTP viaMultiTokenPredictionBlock+process_mtp_loss)Closes plan-2 P16 except the distributed MTP-loss ablation gate (G7), which is tracked into P19 alongside G6. This commit wires V4 onto Megatron's upstream MTP pipeline so the auxiliary multi-token-prediction loss flows through
process_mtp_loss(per-depth shifted logits +MTPLossAutoScaler) instead of the standalone primus-owned MTP block. The legacyDeepseekV4MTPBlockremains behind thev4_use_custom_mtp_blockconfig flag for back-compat with research checkpoints (planned removal: P17 — moved up from P21 by the 2026-05-01 reshuffle) and now emits aDeprecationWarningon construction.Spec helper (
primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_mtp_specs.py, new)get_v4_mtp_block_spec(config, *, transformer_layer_spec, vp_stage)returnsModuleSpec(MultiTokenPredictionBlock, submodules=MultiTokenPredictionBlockSubmodules(layer_specs=[...]*mtp_num_layers)).MultiTokenPredictionLayerspec pullsenorm/hnorm/layer_normfromDeepSeekV4SpecProvider.v4_norm_module()eh_projfromprovider.column_parallel_linear()mtp_model_layerfrom the V4 hybrid-layer spec passed in by the model — so each MTP depth shares HC, hash routing, and clamped-SwiGLU with the main decoder exactly.mtp_num_layers < 1with a clearValueError.DeepseekV4Modelupdates (deepseek_v4_model.py)mtp_num_layers > 0and notv4_use_custom_mtp_block,__init__buildsself.mtp = MultiTokenPredictionBlock(spec=get_v4_mtp_block_spec(...))on stages wheremtp_on_this_rank()is True.mtp_on_this_rankis wrapped in try/except so CPU smokes (noparallel_state) do not crash;self.mtp_processis False andself.mtpis None on those paths.DeepseekV4MTPBlockpath stays available behindv4_use_custom_mtp_block;self.mtp_blockis the legacy slot,self.mtpis the new spec-based slot. Both are None when MTP is disabled.forwardnow mirrorsGPTModel.forward: runsself.mtp(...)on stages with MTP layers (passinginput_ids/position_ids/hidden_states/attention_mask/embedding/packed_seq_params), then onpost_processwithmtp_num_layers > 0callsprocess_mtp_loss(...)which chunks the concatenated hidden states, computes the per-depth shifted MTP loss, and folds it into the gradient viaMTPLossAutoScaler.loss_mask(forwarded toprocess_mtp_loss) andpacked_seq_params.Layer / block forward contract
DeepseekV4HybridLayer.forwardnow returns(hidden_states, None)instead of justhidden_states. This matches upstreamTransformerLayer(which returns(hidden_states, context)) and is required byMultiTokenPredictionLayer._proj_and_transformer_layerwhich unpackshidden_states, _ = self.mtp_model_layer(...).DeepseekV4TransformerBlock's per-layer iteration updates tox, _ = layer(...).DeepseekV4MTPBlocklikewise updates to unpack the tuple.V4 attention spec advertises
attn_mask_typeparams={"compress_ratio": ..., "attn_mask_type": AttnMaskType.causal}.MultiTokenPredictionLayer.__init__validates the inner layer'sself_attention.params['attn_mask_type']against{padding, causal, no_mask, padding_causal}; without this the MTP block fails to construct. The value is functionally inert for V4 (which manages its own SWA / sink mask).DeepseekV4Attention.__init__accepts and ignoresattn_mask_typeplus a**kwargscatch-all so the spec lifecycle keeps working.Legacy
DeepseekV4MTPBlock(deepseek_v4_mtp.py)DeprecationWarningpointing users atget_v4_mtp_block_spec. Code path unchanged otherwise.Tests (
tests/.../test_v4_mtp.py, ~17 tests)get_v4_mtp_block_specstructural assertions: outer module isMultiTokenPredictionBlock;layer_specslength matchesmtp_num_layers(parametrised 1/2/3); each per-depth spec is aMultiTokenPredictionLayer; the V4 inner layer is threaded through unchanged; norm + linear come from the V4 provider.mtp_num_layers=0with a clearValueError.DeepseekV4HybridLayerSubmodulesextendsTransformerLayerSubmodulesso MTP picks up the GPT path (not Mamba) in its inner-layer-submodulesisinstancecheck.DeepseekV4HybridLayer.forwardreturns(hidden_states, None)(source-level assertion onreturn x, None).AttnMaskType.causal(source-level assertion).DeepseekV4MTPBlockemitsDeprecationWarningon construction.deepseek_v4_model.py:process_mtp_lossis called; upstream MTP machinery is imported; spec helper is invoked;v4_use_custom_mtp_blockflag is preserved; themtp_num_layers > 0guard keeps the no-MTP path inert.Status / blockers
deepseek-v4/develop/progress/status.md— Phase 16 tasks ticked except G7 (MTP loss appears in train log;mtp_num_layers=0vsmtp_num_layers=1ablation matches LM loss to 1e-6), which requires distributed init +MultiTokenPredictionBlockruntime (CP / SP plumbing); tracked into P19 distributed re-validation alongside G6.attn_mask_typedeclarations (both required by upstream MTP wiring).Schedule
6c5875d4(continuing the early start; May 02 – 05 holiday remains).1a8bf32e, P14 phase-25fe8bc3c, P1525ccdb5e, and P166c5875d4are recorded under May 01 in the daily plan.What landed in
e591b893(P17 — code cleanup, gate G14)P17 ships the dead-code retirement that was front-loaded from P21 in the 2026-05-01 reshuffle (
f548d8b2). With pre-training as the release path, the HF state-dict adapter slot moved out (deferred to P22+) and the cleanup work moved up so P18's spec audit walks a clean tree.Retired in this commit
primus/backends/megatron/core/models/deepseek_v4/deepseek_v4_mtp.py— the legacy primus-ownedDeepseekV4MTPBlockwas deprecation-warned since P16 (6c5875d4); the spec-based path (get_v4_mtp_block_spec+ upstreamMultiTokenPredictionBlock+process_mtp_loss) is the only MTP route now.DeepSeekV4TransformerConfig.v4_use_custom_mtp_block(legacy MTP gate) — removed.DeepSeekV4TransformerConfig.mtp_compress_ratios(legacy-only field) — removed.DeepseekV4Model.__init__— single MTP branch on the spec path; theif v4_use_custom_mtp_blockarm +self.mtp_blockfield are gone.Dedup'd in this commit
primus/backends/megatron/core/transformer/local_rmsnorm.py(new) — one canonicalLocalRMSNormconsumed bydeepseek_v4_block.py(input_layernorm/pre_mlp_layernorm/final_layernormfallback),deepseek_v4_attention.py(q_norm/kv_normfallback closure), andcompressor.py(kv_norm). The three pre-existing_RMSNormdefinitions are deleted.YAML cleanup
deepseek_v4_flash.yaml— inverted comment fixed:4 = CSA(overlap) and128 = HCA(non-overlap) matchDeepseekV4Attention.forwarddispatch.deepseek_v4_pro.yaml+deepseek_v4_base.yaml— same canonical comment block added so all three V4 yamls are self-documenting.Audit gate G14
tests/unit_tests/megatron/transformer/deepseek_v4/test_v4_p17_dead_code.py(new):deepseek_v4_mtp.py,csa_attention.py,hca_attention.py).ImportError; package__all__no longer exposesDeepseekV4MTPBlock.DeepSeekV4TransformerConfigno longer carriesv4_use_custom_mtp_block/mtp_compress_ratios._v4_token_idsaccess (Attribute/Assign/Name) — docstring mentions are exempt.class _RMSNormshadow definitions — none allowed.4 = CSA/128 = HCAmapping is documented.Out of scope (kept, with notes in
status.md)primus/backends/megatron/core/transformer/dual_rope.py— load-bearing for V4's CSA / HCA dual-base partial RoPE; Megatron'sRotaryEmbeddingonly supports a single base. Plan-2 was over-eager listing this for retirement; it stays.What landed in
b5832672(P18 — spec-system audit, gate G1 + D1 / D2 / D4)P18 closes the spec-system audit findings D1 / D2 / D4 from
00-review-findings.md. Walking a clean tree (after P17) makes the audits crisp.Provider singleton (D1)
primus/backends/megatron/core/models/deepseek_v4/build_context.py(new):resolve_v4_provider(config)caches a singleDeepSeekV4SpecProvideron the config object via a private attribute. Different configs get different providers; the cache is GC'd when the config is released.DeepSeekV4SpecProvider(config=config)call sites migrated to the helper:deepseek_v4_block.py(_build_projection+DeepseekV4MoEshared-expert wiring)deepseek_v4_layer_specs.pydeepseek_v4_mtp_specs.pytest_v4_p18_spec_audit.py::test_no_direct_DeepSeekV4SpecProvider_construction_outside_build_context) rejects future regressions;build_context.pyis the only allowed instantiation site.Activation-func consistency (D2)
DeepSeekV4SpecProvider.v4_mlp_activation_func()returns:Nonewhenconfig.use_te_activation_funcis False — the V4 default; needed so Megatron MLP keeps the eager clamped-SwiGLU path (which appliesactivation_func_clamp_value).TEActivationOp(the TE class, instantiated by Megatron MLP at build) when the user opts into TE.DeepseekV4MoEshared-expert spec switched to the V4 helper. The base provider'sactivation_func()is unchanged (BackendSpecProvidercontract still says "returns a type").compress_ratiosnormalization (D4)DeepSeekV4TransformerConfig.__post_init__calls_normalize_compress_ratios_fieldon the raw value once, so downstream consumers seetuple[int, ...](orNone). The helper handles strings ("[0, 0, 4, 128, ...]") and real lists._parse_int_sequence/_normalize_compress_ratiosindeepseek_v4_block.py) keep accepting both forms for back-compat, but always receive the normalized form on the live path.Schema gate G1
tests/unit_tests/configs/test_deepseek_v4_yaml.py(new): parameterises overdeepseek_v4_{base,flash,pro}.yaml:parse_yaml()succeeds; required fields present.DeepSeekV4TransformerConfigbuilds from the parsed dict.compress_ratiosnormalized totuple[int, ...]with no value drift vs the raw schedule.compress_ratiosentry is in{0, 4, 128}(canonical V4 branches).v4_use_custom_mtp_block/mtp_compress_ratios) are gone from the dataclass and from each YAML.o_groups/o_lora_rank, MoE extras,swiglu_limit) all declared on the dataclass.resolve_v4_provider(cfg_a)returns the same instance on repeated calls; different configs get different providers.v4_mlp_activation_funccontract verified for both branches ofuse_te_activation_func.Spec audit (light-weight, AST-only)
tests/unit_tests/megatron/transformer/deepseek_v4/test_v4_p18_spec_audit.py(new):__init__.py__all__does not re-exportDeepseekV4MTPBlock(P17 cross-check).TENorm/TE{Column,Row}ParallelLinear/TELinear/TEActivationOpinside__init__— they emitModuleSpec(module=...)references that runtimebuild_moduleresolves.Schedule
e591b893+b5832672(continuing the early start; May 02 – 05 holiday remains).What landed in
83c33ad0(P19 — distributed re-validation) +dba27163(plan-2 close-out)P19 closes the distributed re-validation gate (G10) for the architecture-faithful V4 stack landed across P13 → P18. All four target smokes pass 10/10 iterations on
mi355-gpu-12(BF16, MBS=1 GBS=16, seq=128, 8 layers / 3 hash layers /hc_mult=4); twotorch.profilerchrome-trace JSONs (EP=8 and PP=2 EP=4) are captured for the perf-baseline reference.Smokes (10 iters each)
deepseek-v4/develop/progress/p19/smokeA*.logpp_tensor_shapep19/smokeB*.logpp_tensor_shape+pp_token_pre_broadcastp19/smokeC_pp4_ep2_v2.logpp_tensor_shape(also wraps the interleaved schedule) +pp_token_pre_broadcast(upfront)p19/smokeD_pp2_ep4_vpp2_v2_run3.logProfile traces
torch.profilerchrome-trace JSONs (single active step, iter 6 → 7) under the same V4 smoke config:output/amd/tas-mi355x-20260507/p19_profile_pp1_ep8/tensorboard/...rank[0].*.pt.trace.json— TP=1 PP=1 EP=8 (~99 MB).output/amd/tas-mi355x-20260507/p19_profile_pp2_ep4/tensorboard/...rank[0].*.pt.trace.json— TP=1 PP=2 EP=4 (~105 MB).Launchers:
deepseek-v4/develop/progress/p19/run_profile_ep8.shandrun_profile_pp2_ep4.sh.megatron.deepseek_v4.pp_tensor_shape(primus/backends/megatron/patches/deepseek_v4_pp_shape_patches.py)Wraps two Megatron entry points in
megatron.core.pipeline_parallel.schedulesso V4's mHCK = hc_multpacking is reflected on the PP wire:get_tensor_shapes(used by 1F1B): seq dim multiplied byhc_multso the receive buffer matches[S * K, B, D]instead of the stock[S, B, D].forward_backward_pipelining_with_interleaving(used by VPP):seq_lengthkwarg multiplied byhc_multbefore the schedule's inlinetensor_shape = [seq_length, mbs, hidden]runs.Both wrappers gate on
model_type == "deepseek_v4"+hc_mult > 1+PP > 1and are strict no-ops otherwise. Without (2) VPP allocates[S, B, D]recv buffers while the sender emits[S * K, B, D], and_lift_streams_inreshapes the truncated copy — surfaces asDeepseekV4HashRouter: hidden=32 vs token_ids=128.megatron.deepseek_v4.pp_token_pre_broadcast(primus/backends/megatron/patches/deepseek_v4_get_batch_patches.py)V4's hash-routed MoE layers (the first
num_hash_layers) need rawinput_idson every PP stage that owns one, butpretrain_gpt.get_batchreturnsNoneon middle PP stages. Two earlier in-loop hooks both deadlocked under VPP — an in-DeepseekV4Model.forwardbroadcast and a per-callget_batchbroadcast each raced the interleaved schedule's pre-warmuprecv_forward.wait().This patch wraps
pp_module.get_forward_backward_funcso eachtrain_stepfirst runs allnum_microbatches × num_chunksPPdist.broadcastcollectives upfront, before the schedule's first send / recv, and caches the resulting(tokens, labels, loss_mask, attention_mask, position_ids, packed_seq_params)tuples per(vp_stage, microbatch). A companion wrapper aroundpretrain_gpt.get_batchconsumes the cache when active and falls back to the original implementation otherwise. Cache is reset in afinallyafter each schedule call. Cost ≈mbs * seq * 8Bper microbatch (~32 KiB / step on the smoke), dwarfed by the activation P2P.Model-side cleanup (
deepseek_v4_model.py,deepseek_v4_layer_specs.py)forwardinput_idsPP broadcast + VPP fail-fast assert fromDeepseekV4Model; the pre-broadcast patch handles both 1F1B and VPP cleanly.self.mtp = Nonein__init__; Megatron'sset_current_microbatch(incuda_graphs.py) only iteratesmodel.mtp.layerswhen MTP is actually live, which matches upstreamGPTModel. Downstream MTP guards usegetattr(self, "mtp", None).DeepSeekV4SpecProviderindeepseek_v4_layer_specs.pyso the type annotation resolves at module load (NameError surfaced once turbo path was off).c10d::allreduce_autograd warning goneThe historical
UserWarning: An operator was called with autograd not registered for c10d::allreduce_came from the early bring-up's "local shard +torch.distributed.all_reduce" path for MoE routed-output aggregation inv4_moe.py. P14 phase-2 migrated MoE to Megatron's token dispatchers (MoEAlltoAllTokenDispatcher/MoEFlexTokenDispatcher); P17 deleted thev4_enable_ep_allreduce_fallbackdebug gate; and P19 confirms zeroc10d::allreducehits in stderr across all four smokes + the EP=8 / PP=2 EP=4 profile runs.dba27163plan-2 close-out (docs-only)status.md— markc10d::allreduce_warning as gone (with the verification log paths); mark G11 as[-]deferred (snapshot dump tooling never landed); drop Phase 20 / 21 / 22+ sections (kept as documented intent inplan-2/03-phase-details.md); refresh the Blockers / Risks log entry for c10d to reference the actual P19 verification rather than "still tracked into P19".deepseek-v4/develop/progress/plan-2-summary.md(new) — stand-alone summary of the plan-2 architecture-faithful rewrite (P12 → P19): per-phase outcome with key commits; P19 deep-dive (smokes / profile traces / patches / c10d verification); test-gate ledger (G1 / G3 / G4 / G5 / G6 / G7 / G11 / G14 + smokes); plan-1 → plan-2 architectural-shift table (attention, MoE, layer / block, MTP, token-IDs path, HC × PP, TP, spec hygiene); explicit deferred / out-of-scope list (G6 distributed, G7 MTP, G11, P20, P21, P22+).run_profile_ep8.sh(TP=1 PP=1 EP=8) andrun_profile_pp2_ep4.sh(TP=1 PP=2 EP=4); both setPROFILE=True+disable_tensorboard=Falseso the existingtorch_profiler_patches.pyhook captures iter 6 → 7.deepseek-v4/download_ref.sh— idempotent helper that ensuresgit-lfsand clones the V4 reference assets at pinned commits (HFtransformers, ROCmTransformerEngine,AMD-AGI/Primus-Turbo,NVIDIA-NeMo/Automodel, plusDeepSeek-V4-Pro/Flash/Flash-Base/Pro-Base) withGIT_LFS_SKIP_SMUDGE=1so weights are not downloaded by default.Schedule
636ab3de→b5832672).83c33ad0) + plan-2 close-out (dba27163).plan-2/03-phase-details.md; they re-enter active work when the next campaign (release, downstream integration ask, SFT / eval) needs them.Test plan
PP=2,EP=4, BF16, 3 iters.dev_primus_wenx_691.iteration 10/10smoke with grouped-expert clamped-SwiGLU guard.isort/autoflake/black) pass.MLASelfAttention-rooted module); inline-reference numerical alignment for dense (≤ 1e-3) and HCA (≤ 1e-3); CSA shape / finiteness;linear_q_up_proj/linear_o_bColumn / Row parallel; pre-commit hooks (isort/autoflake/black) pass.weight) covered bytest_clamped_swiglu.py+test_v4_routers.py; pre-commit hooks (isort/autoflake/black) pass.DeepseekV4MoE->MegatronModule+ providerv4_grouped_mlp_spec(swiglu_limit)/v4_router_spec(learned)+ 1L MoE forward within 1e-3 of HF reference (gate G5) — covered bytest_v4_moe.py; pre-commit hooks pass.DeepseekV4HybridLayer->TransformerLayer+DeepseekV4TransformerBlock->TransformerBlock;HyperHeadonly onpost_process;_lift_streams_in/_lower_streams_outpacking helpers (CPU-only G6 sub-gate covered bytest_v4_block_pp.py); token_ids forward-kwarg threading +decoder._v4_token_idsAST audit; pre-commit hooks pass.PP=1 / 2 / 4equivalence on 4L V4 toy — gate G6 — deferred to P19.MultiTokenPredictionBlock+process_mtp_loss;get_v4_mtp_block_spechelper; layer forward returns(hidden_states, None); legacyDeepseekV4MTPBlockdeprecated; pre-commit hooks pass.mtp_num_layers=0matches LM loss to 1e-6 — gate G7 — deferred to P19.DeepseekV4MTPBlockdeleted;v4_use_custom_mtp_block/mtp_compress_ratiosconfig fields removed; three_RMSNormshadows replaced by sharedLocalRMSNorm; yaml comment inversion fixed (4 = CSA/128 = HCA); package surface refreshed; AST gate G14 green viatest_v4_p17_dead_code.py(retired-files check, retired-config-fields check,_v4_token_idsAST scan,_RMSNormshadow scan, yaml-comment dispatch).dual_rope.pyintentionally kept (load-bearing for V4's CSA / HCA dual-base RoPE; no Megatron equivalent — documented instatus.md). Pre-commit hooks pass.build_context.resolve_v4_provider(config)(D1);provider.v4_mlp_activation_func()returnsNonewhenuse_te_activation_func=FalseandTEActivationOpotherwise (D2);compress_ratiosnormalized totuple[int, ...]in__post_init__(D4); newtests/unit_tests/configs/test_deepseek_v4_yaml.py(G1 schema gate) +test_v4_p18_spec_audit.py(D1 / D2 / package surface / TE eager-construction AST audits). Pre-commit hooks pass.1×8 PP=1 EP=1, B1×8 PP=2 EP=4, C1×8 PP=4 EP=2, D1×8 PP=2 EP=4 VPP=2all 10/10 iters onmi355-gpu-12(BF16, MBS=1 GBS=16, seq=128, 8 layers / 3 hash layers /hc_mult=4); twotorch.profilerchrome-trace JSONs (EP=8 and PP=2 EP=4) captured. Two primus-patches landed (pp_tensor_shape+pp_token_pre_broadcast);c10d::allreduce_autograd warning verified absent in stderr across all smokes + profile runs.P19 — routing-snapshot diff = 0 across PP / EP changes — gate G11.deferred: snapshot dump tooling never landed; not on the pre-training release path. Runtime stability of the P15 / P19 patches is covered by the smokes above.P20— 200-step Megatron-bridge convergence (±0.05 loss) + TE on/off perf report + FP8 follow-up plan — gates G12 / G13. Deferred follow-up as of 2026-05-07; not on the pre-training release path. Re-enters active work when a release / perf campaign needs it.50-iter stability run + TP partitioning end-to-end coverage— superseded by the P19 smoke matrix (10 iters × 4 parallelism configurations); a longer stability sweep is bundled into the deferred P20 perf campaign.Known risk / follow-up
EP routed-output path currently usesresolved (P14 phase-2 / P17 audit / P19 runtime verification): theall_reduceand emits a PyTorch autograd warning (c10d::allreduce_kernel registration). Functional for bring-up; gated behind thev4_enable_ep_allreduce_fallbackdebug toggle on the active path.v4_enable_ep_allreduce_fallbackflag was removed during the dispatcher migration in P14; the debug gate was deleted in P17 (e591b893); P19 smokes (A/B/C/D + EP=8 / PP=2 EP=4 profile runs onmi355-gpu-12) confirm zeroc10d::allreducewarnings in stderr — the EP routed-output reduction now flows entirely through Megatron'sMoEAlltoAllTokenDispatcher/MoEFlexTokenDispatcher.HC × PP —resolved (P15 + P19):HyperHeadper-stage application destroys K-stream context.DeepseekV4TransformerBlockpacks[B, S, K, D] → [S*K, B, D]for PP P2P via_lower_streams_outand only appliesHyperHeadon thepost_processstage. CPU-side bit-exact roundtrip covered bytest_v4_block_pp.py. Runtime stability across PP > 1 verified by P19 smokes B / C / D with thepp_tensor_shapepatch; distributed bit-equality across PP = 1 / 2 / 4 (G6) is a separate audit and is not on the pre-training release path (deferred follow-up).resolved (P15):decoder._v4_token_idsattribute stash — leaks state across PP and microbatches.DeepseekV4Model.forwardnow passestoken_ids=input_idsdirectly to the decoder; AST audit prevents regressions.02-target-architecture.md§7 +03-phase-details.md(P22+ section).