cuda: DGX Spark / GB10 backend support — HBM-resident model by TrevorS · Pull Request #11 · TrevorS/ds4

TrevorS · 2026-05-24T18:08:36Z

cuda: DGX Spark / GB10 backend support — HBM-resident model

Summary

Makes the 80 GB IQ2XXS DeepSeek V4 Flash checkpoint run well on NVIDIA DGX Spark (GB10, sm_121, 121 GiB UMA). On Spark the GPU can consume host-mmap'd weights via ATS, but at lower effective bandwidth than device-resident copies. This PR makes a budgeted set of hot tensor spans HBM-resident at startup, leaving the cold MoE routed experts ATS-mapped.

Standalone change: no MTP behavior, no kernel semantics change. Plain decode only. Adds GB10 to speed-bench/ next to m2_ultra.csv and m4_max.csv.

Speed — `ds4-bench` standard sweep (promessi_sposi.txt, gen=128, GB10)

ctx	upstream/main	this PR	Δ
2048	13.85	14.24	+0.39
8192	13.67	14.10	+0.43
16384	13.54	13.97	+0.43
18432	13.45	13.88	+0.43

Steady ~+0.4 t/s from the small-N kernel fuses (Q_A+KV_A pairing, hc_expand epilogue parallelization, head_rms_norm+rope_tail fusion).

UMA headroom (the bigger reason)

On current upstream/main, the startup cache copies the full 80.76 GiB model to device. On a 121 GiB Spark that leaves little headroom once KV cache and prefill activations grow: in my testing the standard ds4-bench 2048→65536 sweep did not complete past ~18k context on upstream.

This PR caps the cache at 24 GiB with a MoE filter, so only the hot spans (attention projections, shared experts, embedding, output head) become device-resident — ~8.2 GiB in practice — while cold routed experts stay ATS-mapped. The full 2048→65536 sweep completes with headroom to spare. (Worth confirming on your own Spark — UMA pressure depends on driver and allocator behavior.)

What's in the PR

Startup HBM cache with a budget cap + MoE filter (hot spans device-resident, cold routed experts ATS-mapped).
cuda_model_range_ptr helper: single hash-keyed lookup for device-resident pointers on the hot path.
GPU argmax kernel (the prior fallback misused indexer scoring as argmax, double-paying dispatcher cost at N=1).
Pair-fused Q_A + KV_A matmuls in qkv_rms_fused decode.
Parallelized matmul_q8_0_hc_expand epilogue across n_hc lanes.
HBM cache extends to the MTP support model (no behavioral change without --mtp).
Drop cudaHostRegisterReadOnly (unsupported on GB10).
speed-bench/gb10.csv from the standard sweep.

Tested

make clean && make cuda-spark — clean
make cpu — clean
./ds4_test --long-context, --tool-call-quality, --server, --metal-kernels — OK
./ds4-bench 2048→65536 sweep — completes; speed-bench/gb10.csv
Two ds4_test checks fail identically on raw upstream/main (bfe070a) — not introduced here:
- --logprob-vectors short_code_completion — fixture's official continuation is one greedy token off
- --metal-tensor-equivalence — intrinsically flaky on GB10 (run-to-run non-determinism in the batched prefill path; raw upstream fails ~2/3 runs, this branch similar)

Hardware: NVIDIA DGX Spark (GB10 / sm_121), driver 580.142, CUDA 13.0
Model: DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf

Note

Foundation for a stacked follow-up (#12) that adds MTP combined-forward speculative decode. This PR stands alone — it's what makes Spark usable at high context whether or not you care about MTP.

DGX Spark (GB10, sm_121, 121 GiB UMA, driver 580+) sits in an unusual spot for CUDA inference: ATS (Address Translation Service) lets the GPU consume host-mmap'd weights directly, but at significantly lower effective bandwidth than HBM-resident copies. For an 80 GB IQ2XXS DeepSeek V4 Flash checkpoint, the difference is the model running versus the model being usable. This commit adds: - Startup HBM cache that copies hot tensor spans (attn projections, MoE shared experts, output projection) into device memory at engine init, capped by a configurable budget (defaults sized to leave headroom for KV cache and a second model load). Cold MoE routed experts stay ATS-mapped. - Factored `cudaMalloc` cache helper (cuda_model_range_ptr) so the HBM-resident pointer lookup is a single hash-keyed read on the hot decode path. - GPU argmax kernel; the prior fallback misused indexer scoring as an argmax which double-paid the dispatcher cost on N=1 decode. - Pair-fused Q_A + KV_A matmuls in qkv_rms_fused decode path (one shared weight load per row, two outputs). - Parallelized matmul_q8_0_hc_expand epilogue across n_hc lanes (n_hc parallel residual loads + writes vs n_hc^2 serial reads). - HBM cache also populated for the MTP support model. - Drop `cudaHostRegisterReadOnly` flag — unsupported on GB10. - Drop `!mtp_ready` gate from accelerator_cache_model_tensors so the MTP support model gets the same HBM-cache treatment. Bench (DGX Spark / GB10, ds4flash, n=256, "knight" prompt, 3-run mean): Plain decode before: ~13.9 t/s (ATS-mapped weights, all paths) Plain decode after: ~16.13 t/s (HBM-resident hot spans + small-N kernel fuses) Adds `speed-bench/gb10.csv` per CONTRIBUTING.md convention so the 2048..65536 sweep is preserved alongside the existing m2_ultra.csv and m4_max.csv. Generated via: ./ds4-bench -m ds4flash.gguf \ --prompt-file speed-bench/promessi_sposi.txt \ --ctx-start 2048 --ctx-max 65536 --step-incr 2048 \ --gen-tokens 128 --csv speed-bench/gb10.csv Hardware: NVIDIA DGX Spark (GB10 / sm_121), driver 580.142, CUDA 13.0 Model: DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf

TrevorS · 2026-05-24T21:08:43Z

Recreated as #13 after renaming the branch to gb10-hbm-resident-model (GitHub's branch rename closed this PR). Continue at #13.

TrevorS mentioned this pull request May 24, 2026

mtp: combined-forward speculative decode beats plain on GB10 (+2.4 t/s) (stacked on #11) #12

Closed

TrevorS force-pushed the clean-spark-backend branch from 2330d1d to 36c1735 Compare May 24, 2026 19:47

TrevorS changed the base branch from clean-base to main May 24, 2026 20:30

TrevorS closed this May 24, 2026

TrevorS deleted the clean-spark-backend branch May 24, 2026 21:06

TrevorS mentioned this pull request May 24, 2026

mtp: combined-forward speculative decode beats plain on GB10 (+2.4 t/s) (stacked on #13) #14

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cuda: DGX Spark / GB10 backend support — HBM-resident model#11

cuda: DGX Spark / GB10 backend support — HBM-resident model#11
TrevorS wants to merge 1 commit into
mainfrom
clean-spark-backend

TrevorS commented May 24, 2026 •

edited

Loading

Uh oh!

TrevorS commented May 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

TrevorS commented May 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

cuda: DGX Spark / GB10 backend support — HBM-resident model

Summary

Speed — ds4-bench standard sweep (promessi_sposi.txt, gen=128, GB10)

UMA headroom (the bigger reason)

What's in the PR

Tested

Note

Uh oh!

TrevorS commented May 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

TrevorS commented May 24, 2026 •

edited

Loading

Speed — `ds4-bench` standard sweep (promessi_sposi.txt, gen=128, GB10)