cuda: DGX Spark / GB10 backend support — HBM-resident model#11
Closed
TrevorS wants to merge 1 commit into
Closed
Conversation
DGX Spark (GB10, sm_121, 121 GiB UMA, driver 580+) sits in an unusual
spot for CUDA inference: ATS (Address Translation Service) lets the
GPU consume host-mmap'd weights directly, but at significantly lower
effective bandwidth than HBM-resident copies. For an 80 GB IQ2XXS
DeepSeek V4 Flash checkpoint, the difference is the model running
versus the model being usable.
This commit adds:
- Startup HBM cache that copies hot tensor spans (attn projections,
MoE shared experts, output projection) into device memory at engine
init, capped by a configurable budget (defaults sized to leave
headroom for KV cache and a second model load). Cold MoE routed
experts stay ATS-mapped.
- Factored `cudaMalloc` cache helper (cuda_model_range_ptr) so the
HBM-resident pointer lookup is a single hash-keyed read on the
hot decode path.
- GPU argmax kernel; the prior fallback misused indexer scoring as
an argmax which double-paid the dispatcher cost on N=1 decode.
- Pair-fused Q_A + KV_A matmuls in qkv_rms_fused decode path
(one shared weight load per row, two outputs).
- Parallelized matmul_q8_0_hc_expand epilogue across n_hc lanes
(n_hc parallel residual loads + writes vs n_hc^2 serial reads).
- HBM cache also populated for the MTP support model.
- Drop `cudaHostRegisterReadOnly` flag — unsupported on GB10.
- Drop `!mtp_ready` gate from accelerator_cache_model_tensors so
the MTP support model gets the same HBM-cache treatment.
Bench (DGX Spark / GB10, ds4flash, n=256, "knight" prompt, 3-run mean):
Plain decode before: ~13.9 t/s (ATS-mapped weights, all paths)
Plain decode after: ~16.13 t/s (HBM-resident hot spans + small-N kernel fuses)
Adds `speed-bench/gb10.csv` per CONTRIBUTING.md convention so the
2048..65536 sweep is preserved alongside the existing m2_ultra.csv
and m4_max.csv. Generated via:
./ds4-bench -m ds4flash.gguf \
--prompt-file speed-bench/promessi_sposi.txt \
--ctx-start 2048 --ctx-max 65536 --step-incr 2048 \
--gen-tokens 128 --csv speed-bench/gb10.csv
Hardware: NVIDIA DGX Spark (GB10 / sm_121), driver 580.142, CUDA 13.0
Model: DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf
2330d1d to
36c1735
Compare
This was referenced May 24, 2026
Owner
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
cuda: DGX Spark / GB10 backend support — HBM-resident model
Summary
Makes the 80 GB IQ2XXS DeepSeek V4 Flash checkpoint run well on NVIDIA DGX Spark (GB10, sm_121, 121 GiB UMA). On Spark the GPU can consume host-mmap'd weights via ATS, but at lower effective bandwidth than device-resident copies. This PR makes a budgeted set of hot tensor spans HBM-resident at startup, leaving the cold MoE routed experts ATS-mapped.
Standalone change: no MTP behavior, no kernel semantics change. Plain decode only. Adds GB10 to
speed-bench/next tom2_ultra.csvandm4_max.csv.Speed —
ds4-benchstandard sweep (promessi_sposi.txt, gen=128, GB10)Steady ~+0.4 t/s from the small-N kernel fuses (Q_A+KV_A pairing, hc_expand epilogue parallelization, head_rms_norm+rope_tail fusion).
UMA headroom (the bigger reason)
On current
upstream/main, the startup cache copies the full 80.76 GiB model to device. On a 121 GiB Spark that leaves little headroom once KV cache and prefill activations grow: in my testing the standardds4-bench2048→65536 sweep did not complete past ~18k context on upstream.This PR caps the cache at 24 GiB with a MoE filter, so only the hot spans (attention projections, shared experts, embedding, output head) become device-resident — ~8.2 GiB in practice — while cold routed experts stay ATS-mapped. The full 2048→65536 sweep completes with headroom to spare. (Worth confirming on your own Spark — UMA pressure depends on driver and allocator behavior.)
What's in the PR
cuda_model_range_ptrhelper: single hash-keyed lookup for device-resident pointers on the hot path.qkv_rms_fuseddecode.matmul_q8_0_hc_expandepilogue acrossn_hclanes.--mtp).cudaHostRegisterReadOnly(unsupported on GB10).speed-bench/gb10.csvfrom the standard sweep.Tested
make clean && make cuda-spark— cleanmake cpu— clean./ds4_test --long-context,--tool-call-quality,--server,--metal-kernels— OK./ds4-bench2048→65536 sweep — completes;speed-bench/gb10.csvds4_testchecks fail identically on rawupstream/main(bfe070a) — not introduced here:--logprob-vectors short_code_completion— fixture's official continuation is one greedy token off--metal-tensor-equivalence— intrinsically flaky on GB10 (run-to-run non-determinism in the batched prefill path; raw upstream fails ~2/3 runs, this branch similar)Hardware: NVIDIA DGX Spark (GB10 / sm_121), driver 580.142, CUDA 13.0
Model:
DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.ggufNote
Foundation for a stacked follow-up (#12) that adds MTP combined-forward speculative decode. This PR stands alone — it's what makes Spark usable at high context whether or not you care about MTP.