Skip to content

fix(v4-ci): unblock nightly Docker test + V4-Pro accuracy step#703

Merged
valarLip merged 1 commit intomainfrom
fix/v4-pr1-ci-hotfix
May 6, 2026
Merged

fix(v4-ci): unblock nightly Docker test + V4-Pro accuracy step#703
valarLip merged 1 commit intomainfrom
fix/v4-pr1-ci-hotfix

Conversation

@valarLip
Copy link
Copy Markdown
Collaborator

@valarLip valarLip commented May 6, 2026

Summary

Two hotfixes for CI regressions exposed after the V4-Pro PR1 merge (#650):

1. Nightly Docker Release — simple_inference long prompt overflows block_tables

The new arithmetic stress prompt (3000 terms ≈ 16k tokens) added in c3ec204 crashed the Docker test (`atom-release` runs Llama-3-8B-Instruct):

ValueError: could not broadcast input array from shape (1005,) into shape (512,)
  at atom/model_ops/attentions/backends.py:249 (prepare_block_tables)

Llama-3-8B's max_model_len=8192block_tables buffer dim = 8192 / 16 = 512. The 16k-token seq passes the prefill scheduler (max_num_batched_tokens=16384 ≥ 16016) but its KV needs ~1000 blocks > 512. Fix: shrink the prompt to 1500 terms (~7k tokens, 438 blocks → 74-block margin).

Locally reproduced with gpt-oss-120b -tp 1 --max-model-len 8192.

2. V4-Pro per-PR accuracy step — safetensors==0.7.0 missing F8_E8M0

The new V4-Pro accuracy entry (also c3ec204) failed at model load:

KeyError: 'F8_E8M0'
  at safetensors/torch.py:427 (_getdtype)

V4-Pro shards contain MX scale tensors with F8_E8M0 dtype. Both torch.float8_e8m0fnu and the safetensors-rust binary support it, but the Python _TYPES dict in safetensors==0.7.0 (latest pinned in CI) is missing the mapping. The mmap path (safe_open + get_tensor) goes through Rust and works fine, but the ATOM_DISABLE_MMAP=true path (set by atom-test.yaml's container env) calls safetensors.torch.load(bytes) which fails the dict lookup.

Fix: monkey-patch the missing mapping at loader import time. No-op when safetensors ships the entry natively.

End-to-end verified locally: previously crashed at shard 2/64; with patch, all 10000 scanned tensors load (4951 are F8_E8M0).

Test plan

  • Local repro: gpt-oss-120b --max-model-len 8192 → simple_inference crashes on the old 3000-term prompt with the same (1005,) → (512,) error
  • Local F8_E8M0 repro: safetensors.torch.load(bytes) on a V4-Pro shard → KeyError: 'F8_E8M0'
  • Local F8_E8M0 fix verified: same call path now loads 2337 tensors per shard cleanly via the loader's disable_mmap=True branch
  • Wait for CI to re-run nightly Docker + V4-Pro accuracy after merge

Two regressions introduced/exposed by the V4-Pro per-PR CI entry
(commits c3ec204 onward):

1. simple_inference.py: shrink the new arithmetic stress prompt from
   3000 terms (~16k tokens) to 1500 terms (~7k tokens). At 16k tokens
   the seq's KV needs ~1000 blocks, but the `block_tables` buffer
   second-dim is `max_model_len / block_size`. Llama-3-8B-Instruct
   used by `Test Docker image` step has `max_model_len=8192` →
   buffer dim = 512, so prefill schedules but decode crashes in
   `prepare_block_tables` with
   `ValueError: could not broadcast input array from shape (1005,)
   into shape (512,)`. 1500 terms → 7005 tokens → 438 blocks, well
   under 512 with comfortable margin.

2. loader.py: monkey-patch `safetensors.torch._TYPES` to register
   `F8_E8M0 -> torch.float8_e8m0fnu`. The Python wrapper in
   `safetensors==0.7.0` is missing this entry even though both torch
   and safetensors-rust support it. The mmap path
   (`safe_open + get_tensor`) goes through Rust and works fine, but
   the `ATOM_DISABLE_MMAP=true` path (used by `atom-test.yaml` in
   CI) calls `safetensors.torch.load(bytes)` which does the dict
   lookup and raises `KeyError: 'F8_E8M0'` on V4-Pro shards
   containing MX scale tensors. Patch is no-op when safetensors
   ships the entry natively.
Copilot AI review requested due to automatic review settings May 6, 2026 17:49
@valarLip valarLip merged commit 09b157e into main May 6, 2026
10 of 12 checks passed
@valarLip valarLip deleted the fix/v4-pr1-ci-hotfix branch May 6, 2026 17:51
valarLip added a commit that referenced this pull request May 7, 2026
…ill (#710)

Two V4-Pro CI failures surfaced after PR #703 unblocked the loader path:

1. Triton MoE OOM on AMD CDNA4 with triton 3.6+/3.7+
   - `triton_kernels.matmul_ogs_details.opt_flags_amd` has a CDNA4 special
     case `if cdna4 and block_m == 128: block_n = 512`, giving
     BLOCK_M*BLOCK_N = 64K FP32 acc entries. triton 3.6+ spills the
     accumulator to LDS more aggressively than 3.5, exceeding the
     MI355X 160 KiB LDS budget (observed 269 KiB).
   - Fix: wrap matmul_ogs calls with a CDNA4-only context manager that
     pins block_m=64 / block_n=256 (BLOCK_M*BLOCK_N = 16K, fits regs).
     Tunable via `ATOM_TRITON_MOE_BLOCK_{M,N}` env vars.
   - Other GPU families and triton ≤3.5 paths are unaffected.

2. `cp_gather_indexer_k_quant_cache` HIP "invalid configuration argument"
   when `cu_committed_cpu[-1] == 0` (fresh prefill with prompt shorter
   than the CSA `ratio`). The kernel grid is computed as
   `(num_tokens + BLOCK_Y_SIZE - 1) / BLOCK_Y_SIZE` so `num_tokens=0`
   makes grid.x=0 and the HIP launcher rejects it.
   - Fix: clamp `cu_committed_cpu[-1]` to ≥1 in the indexer-meta builder.
     The dummy +1 row is gathered from the last seq's first cache block
     but never read downstream, because `fp8_mqa_logits` and
     `top_k_per_row_prefill` honor per-token `cu_starts`/`cu_ends`
     derived from `cu_committed_gpu[:-1]` and `n_committed_per_seq`,
     both of which remain 0. Output stays all -1 sentinels, matching the
     all-empty semantics. Pure host-side scalar arithmetic on a value
     already host-synced; no CG/torch.compile graph branch added.

Verified locally with triton 3.7.0+amd.rocm7.1.0:
- DeepSeek-V4-Pro server starts (no OOM)
- 1-token "Hi" curl returns successfully (was crashing pre-fix)
- GSM8K-50 fewshot=5 = 0.94 (matches pre-PR-703 baseline)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant