49 benchmarked | 14 working | 29 pending | 17 failed — across 3 GPU architectures
README.mdis the canonical root document and the only root doc intended for publication.- Local mirrors
CLAUDE.md,AGENTS.md, andAGENT.mdare symlinks toREADME.mdand are git-ignored. - Recreate local mirrors with
./scripts/sync_local_docs.sh. - Public attribution identity for this repo is
dev@primitivecontext.com. - Local git hooks (in
.githooks/) enforce attribution and block public AI co-author tags.
- Work is executed by a large agent swarm (up to ~20 concurrent) in a full-freedom dev sandbox.
- Research tools are valuable but sometimes stale or contradictory; agent triage can diverge.
- Timeline estimates will be laughed at. Work is performed right now, with agents scaled appropriately.
| Target | Chip | Products | Memory |
|---|---|---|---|
| SM100 | B200 | Datacenter Blackwell | TMEM + HBM |
| SM120 | GB202 | RTX 5090, RTX Pro 6000 | 96 GB GDDR7 |
| SM121 | GB10 | DGX Spark | 128 GB unified LPDDR5x |
| GPU | Arch | Count |
|---|---|---|
| RTX Pro 6000 (96 GB GDDR7) | SM120 | 46 deployments |
| DGX Spark (128 GB LPDDR5x) | SM121 | 41 deployments |
| RTX 4090 (24 GB GDDR6X) | SM89 | 22 deployments |
| Capability | SM100 | SM120/SM121 | Impact |
|---|---|---|---|
| Tensor Memory | 256 KB TMEM/SM | None | Operands must live in registers |
| TMA Multicast | Yes | Disabled | Cluster shape locked to 1×1×1 |
| Instruction Set | tcgen05.mma | mma.sync.aligned.block_scale | Different memory models, layouts incompatible |
| Operand Location | SMEM/TMEM | Registers | Register pressure is the bottleneck |
Bottom line: SM100 kernels will not run on SM120/SM121. Code compiled for sm_100a traps on consumer Blackwell.
| Platform | SM120 NVFP4 Status | Limitation |
|---|---|---|
| vLLM | Partial | SM120 FP4 kernel detection fails (#31085), falls back to Marlin (40-50% perf loss) |
| TensorRT-LLM | Partial | NVFP4 KV cache support requested but not shipped (#10241) |
| SGLang | Blocked | All attention backends fail for MoE on SM120 (triton SMEM overflow) |
| Custom CUTLASS | Required | Only path to NVFP4 weights + NVFP4 KV cache on SM120/SM121 |
Bottom line: No off-the-shelf stack gives NVFP4 model weights + NVFP4 KV cache on consumer Blackwell.
| Format | Data Bits | Scale Type | Block Size | Use Case |
|---|---|---|---|---|
| NVFP4 | E2M1 (4-bit) | UE4M3 (unsigned) | 16 elements | Weights + KV cache (2× mem reduction) |
| MXFP4 | E2M1 (4-bit) | UE8M0 (power-of-2) | 32 elements | OCP-compliant, coarser scaling |
| FP8 | E4M3/E5M2 | N/A | N/A | Intermediate compute precision |
Bottom line: NVFP4 with UE4M3 scales gives finer granularity (16 vs 32 blocks) and 137× more dynamic range than signed E4M3.
- vLLM/TRT-LLM don't support NVFP4 KV cache on SM120
- SM120 lacks TMEM so SM100 kernels fail
- NVFP4 weights + NVFP4 KV cache = maximum memory efficiency
- MoE + GQA patterns need specialized attention
- Use
sm_120afor RTX 5090 / RTX Pro 6000 - Use
sm_121afor DGX Spark - Base
sm_120cannot compile block-scaled instructions - PTX ISA 8.7 added
sm_120a, PTX ISA 8.8 addedsm_121a
- Use
mma.sync.aligned.kind::mxf4nvf4.block_scale - All operands (A/B/ACC/SFA/SFB) reside in registers
- Cluster shape fixed to 1×1×1 (no multicast)
- TMA available for GMEM↔SMEM movement
- No TMEM access (SM100-only)
- TMA loads to shared memory, not tensor memory
- Register file is the primary operand store
- Manage register pressure carefully
[NVFP4 weights] ──┐
├──► [FP8 dequant] ──► [FP8 compute] ──► [FP8 result]
[NVFP4 KV cache] ─┘ │
▼
[FP8→NVFP4 quant] ──► [NVFP4 KV append]
- NVFP4 → FP8 dequant before attention compute
- FP8 is mandatory intermediate precision
- New K/V quantized to NVFP4 before cache append
- SM120 uses emulation for scalar FP8; native FP8 is tensor core MMA only
__nv_cvt_float_to_fp8()intrinsic produces wrong exponent on SM120 (off by +6)- Use software conversion:
float_to_fp8_e4m3_sw()innvfp4_kv_dequantize.cu - Applies to all NVFP4→FP8 dequant paths
- NVFP4: 1 UE4M3 scale per 16 E2M1 values
- Scale layout matches SM100 (portable between archs)
- Two-level scaling: per-block E4M3 + per-tensor FP32
- RTX 5090: No NVLink, PCIe only
- DGX Spark: 200 Gbps ConnectX-7 (Ethernet, no RDMA)
- PCIe x8 single-GPU: <2% perf impact
- PCIe x8 multi-GPU TP: 20-40% perf loss
- Optimize for single-GPU first
Full citations: verified_facts.json
Local documentation library at docs/ (~2.4 GB).
| Section | Path | Contents |
|---|---|---|
| GPU Whitepapers | reference/gpu/whitepapers/ |
9 PDFs — Blackwell RTX, Blackwell DC, Ada, Ampere, Turing, Volta, Pascal |
| GPU Datasheets | reference/gpu/datasheets/ |
17 PDFs — RTX 5090, RTX Pro 6000, DGX Spark, B200, A100, H100, H200, L4, T4, V100 |
| CUDA Toolkit | reference/cuda/pdf/ |
53 PDFs — Programming Guide, PTX ISA 9.1, Blackwell/Hopper/Ada tuning & compat guides, cuBLAS, cuSPARSE, NVVM IR, etc. |
| CUDA HTML Docs | reference/cuda/ |
89 subdirs — parallel-thread-execution, cuda-c-programming-guide, blackwell-tuning-guide, inline-ptx-assembly, etc. |
| cuDNN | reference/cudnn/ |
api/, backend/, frontend/, developer-guide/, installation/, latest/ |
| CUTLASS | reference/cutlass/latest/ |
Scraped HTML docs — overview, changelog, index |
| DGX | reference/dgx/ |
27 product manuals — DGX Spark (HTML + PDF), DGX B200/B300, DGX A100/H100/GB200 |
| NVIDIA Blog | reference/blog/blog/ |
Saved posts — NVFP4 intro, NVFP4 training, Blackwell Ultra, MoE expert parallelism, PTQ optimization |
| Generation | Whitepaper |
|---|---|
| Blackwell | reference/gpu/whitepapers/nvidia-rtx-blackwell-gpu-architecture.pdf |
| Ada | reference/gpu/whitepapers/ada-lovelace-architecture-whitepaper.pdf |
| Ampere | reference/gpu/whitepapers/nvidia-ampere-architecture-whitepaper.pdf |
| Turing | reference/gpu/whitepapers/nvidia-turing-architecture-whitepaper.pdf |
| Volta | reference/gpu/whitepapers/volta-architecture-whitepaper.pdf |
| Pascal | reference/gpu/whitepapers/pascal-architecture-whitepaper.pdf |
| Need | File |
|---|---|
| PTX ISA | reference/cuda/pdf/ptx_isa_9.1.pdf |
| Blackwell tuning | reference/cuda/pdf/Blackwell_Tuning_Guide.pdf |
| Blackwell compat | reference/cuda/pdf/Blackwell_Compatibility_Guide.pdf |
| CUDA C Programming | reference/cuda/pdf/CUDA_C_Programming_Guide.pdf |
| RTX Blackwell arch | reference/gpu/whitepapers/nvidia-rtx-blackwell-gpu-architecture.pdf |
| Blackwell DC arch | reference/gpu/whitepapers/blackwell-architecture-hardwareand.pdf |
| Blackwell microbench | reference/gpu/datasheets/blackwell-microbenchmarks-arxiv.pdf |
| DGX Spark guide | reference/dgx/dgx-spark/ |
| NVFP4 blog | reference/blog/blog/introducing-nvfp4-for-efficient-and-accurate-low-precision-inference/ |
| NVFP4 training blog | reference/blog/blog/nvfp4-trains-with-precision-of-16-bit-and-speed-and-efficiency-of-4-bit/ |
| Snapshot | Version | Purpose |
|---|---|---|
20260124_cutlass-v4.3.0/ |
CUTLASS 4.3.0 | Kernel primitives, SM120 examples (77, 81, 82) |
20260125_vllm-v0.14.0rc1/ |
vLLM 0.14.0-rc1 | Serving framework, quantization backends |
20260125_sglang-v0.5.8/ |
SGLang 0.5.8 | Structured generation, attention backends |
20260125_flashinfer-v0.6.2/ |
FlashInfer 0.6.2 | Attention kernel library |
20260125_flash-attention-v2.8.3/ |
FlashAttention 2.8.3 | Reference attention implementation |
20260125_exo-explore-exo/ |
Exo | Distributed inference |
20260122_dgx-spark-playbooks/ |
— | DGX Spark deployment playbooks |
20251210_avarok-vllm-dgx-spark/ |
— | vLLM DGX Spark adaptation |
20251201_dgx-spark-config-v1.0.1/ |
— | DGX Spark system config |
20251124_rtx-ai-toolkit-archived/ |
— | RTX AI Toolkit (archived) |
20251017_rtx-pro-6000-vs-dgx-spark/ |
— | RTX Pro 6000 vs DGX Spark benchmarks/comparison |
vggt/ |
VGGT | Visual Geometry Grounded Transformer |
| Repo | Purpose |
|---|---|
streaming-vlm/ |
Streaming VLM inference (git repo with training + inference code) |
Sorted by single-stream throughput. TPS = tokens/second.
| Single TPS | Batched TPS | Model | Params | Type | Quant | Platform | TP | VRAM |
|---|---|---|---|---|---|---|---|---|
| 255 | 6,396 | gpt-oss-20b-MXFP4 | 20.9B/3.6B | MoE | MXFP4 | vllm | 1 | - |
| 198 | 8,463 | Qwen2.5-VL-7B-Instruct-NVFP4 | 7B | VLM-Dense | NVFP4 | trt-llm | 1 | 6.71 |
| 173 | 6,545 | gpt-oss-120b-MXFP4 | 120B/5.1B | MoE | MXFP4 | vllm | 1 | 95 |
| 131 | 4,778 | Qwen3-Next-80B-A3B-Instruct-NVFP4 | 80B/3B | MoE | NVFP4 | vllm | 1 | 44.2 |
| 105 | 404 | MiniMax-M2.1-AWQ-4bit | 456B/45B | MoE | AWQ | vllm | 2 | - |
| 78.0 | 1,004 | Qwen3-Next-80B-A3B-Instruct-FP8 | 80B/3B | MoE | FP8 | sglang | 1 | 82 |
| 75.2 | 1,355 | MiniMax-M2.1-NVFP4 | 456B/45B | MoE | NVFP4 | vllm | 2 | 122 |
| 61.5 | 1,630 | Llama-4-Scout-17B-16E-Instruct-NVFP4 | 109B/17B | MoE | NVFP4 | vllm | 1 | 89.7 |
| 58.3 | 1,304 | Qwen3-235B-A22B-NVFP4 | 235B/22B | MoE | NVFP4 | vllm | 2 | 95.5 |
| 43.6 | 2,510 | Nemotron-Nano-12B-v2-VL-NVFP4-QAD | 12B | VLM-Dense | NVFP4 | trt-llm | 1 | 9.87 |
| 31.0 | — | chandra | 8B | Dense | BF16 | vllm | 1 | |
| 26.9 | 1,812 | Nemotron-Super-49B-v1_5-NVFP4 | 49B | Dense | NVFP4 | vllm | 1 | 89.6 |
| 25.8 | 603 | Qwen3-VL-30B-A3B-Instruct-NVFP4 | 30B/3B | VLM-MoE | NVFP4 | vllm | 1 | 18.2 |
| 21.5 | 1,959 | Qwen3-Next-80B-A3B-Instruct-NVFP4 | 80B/3B | MoE | NVFP4 | vllm | 1 | 90.7 |
| 18.8 | 1,225 | Llama-3.3-70B-Instruct-NVFP4 | 70B | Dense | NVFP4 | vllm | 1 | 89.7 |
| 14.0 | — | olmOCR-2-7B-1025-FP8 | 7B | Dense | FP8 | vllm | 1 | |
| 7.1 | 2,064 | Qwen3-Next-80B-A3B-Instruct-NVFP4 | 80B/3B | MoE | NVFP4 | vllm | 1 | - |
| 6.1 | 304 | Qwen3-VL-235B-A22B-Instruct-NVFP4 | 235B/22B | VLM-MoE | NVFP4 | vllm | 2 | 94 |
| Single TPS | Batched TPS | Model | Params | Type | Quant | Platform | VRAM |
|---|---|---|---|---|---|---|---|
| 180 | — | Qwen3-0.6B-NVFP4 | 600M | Dense | NVFP4 | vllm | 0.60 |
| 99.7 | — | Qwen3-1.7B-NVFP4 | 2B | Dense | NVFP4 | vllm | 1.00 |
| 60.1 | — | Qwen3-30B-A3B-NVFP4 | 30B/3.3B | MoE | NVFP4 | tensorrt-llm | 16.85 |
| 58.0 | — | Phi-4-multimodal-instruct-NVFP4 | 14B | VLM-Dense | NVFP4 | tensorrt-llm | 19.40 |
| 55.6 | — | Qwen3-4B-Instruct-2507-NVFP4 | 4B | Dense | NVFP4 | vllm | 3.00 |
| 42.7 | 188 | Qwen3-Next-80B-A3B-Instruct-FP8 | 80B/3B | MoE | FP8 | vllm | 74.89 |
| 41.3 | — | Qwen2.5-VL-7B-Instruct-NVFP4 | 7B | VLM-Dense | NVFP4 | tensorrt-llm | 22.00 |
| 39.9 | — | Llama-3.3-70B-Instruct-NVFP4 | 70B | Dense | NVFP4 | vllm | |
| 38.6 | 1,128 | Qwen3-30B-A3B-FP4 | 30B/3B | MoE | NVFP4 | trt-llm | 19.2 |
| 28.5 | 28.5 | Qwen3-8B-NVFP4 | 8B | Dense | NVFP4 | vllm | - |
| 28.2 | — | Qwen3-8B-NVFP4 | 8B | Dense | NVFP4 | tensorrt-llm | 7.00 |
| 27.6 | 441 | Qwen3-Next-80B-A3B-Instruct-NVFP4 | 80B/3B | MoE | NVFP4 | trt-llm | 44.2 |
| 27.3 | — | NVIDIA-Nemotron-Nano-9B-v2-NVFP4 | 9B | Dense | NVFP4 | vllm | 7.34 |
| 20.2 | — | Qwen3-14B-NVFP4 | 14B | Dense | NVFP4 | vllm | 9.00 |
| 19.6 | 19.6 | Phi-4-reasoning-plus-NVFP4 | 14B | Dense | NVFP4 | trt-llm | - |
| 19.4 | — | Phi-4-reasoning-plus-NVFP4 | 15B | Dense | NVFP4 | tensorrt-llm | 10.00 |
| 17.0 | 17.0 | Qwen3-14B-FP4 | 14.8B | Dense | NVFP4 | trt-llm | - |
| 17.0 | — | Qwen3-14B-NVFP4 | 15B | Dense | NVFP4 | tensorrt-llm | 10.00 |
| 16.3 | — | Llama-4-Scout-17B-16E-Instruct-NVFP4 | 56B/17.0B | MoE | NVFP4 | tensorrt-llm | 114.23 |
| 10.2 | — | Qwen3-32B-NVFP4 | 32B | Dense | NVFP4 | vllm | 19.00 |
| 10.0 | — | DeepSeek-R1-Distill-Qwen-32B-NVFP4 | 32B | Dense | NVFP4 | vllm | 19.00 |
| 8.2 | — | Qwen3-32B-NVFP4 | 33B | Dense | NVFP4 | tensorrt-llm | 105.00 |
| 4.8 | 398 | Qwen3-Next-80B-A3B-Instruct-NVFP4 | 80B/3B | MoE | NVFP4 | vllm | 44.3 |
| 4.8 | 75.8 | Qwen3-Next-80B-A3B-Instruct-FP8 | 80B/3B | MoE | FP8 | vllm | |
| 4.6 | — | Llama-3.1-70B-Instruct-NVFP4 | 70B | Dense | NVFP4 | vllm | 41.00 |
| 3.1 | — | Llama-3.3-70B-Instruct-NVFP4 | 70B | Dense | NVFP4 | tensorrt-llm | 60.00 |
| Single TPS | Batched TPS | Model | Params | Type | Quant | Platform | VRAM |
|---|---|---|---|---|---|---|---|
| 152 | 5,026 | Llama-3.2-3B-Instruct | 3B | Dense | FP8 | vllm | 5 |
| 88.3 | 3,566 | DeepSeek-R1-Distill-Qwen-7B | 7B | Dense | W8A8 | vllm | 10 |
| 81.0 | 197 | gpt-oss-20b-MXFP4 | 20.9B/3.6B | MoE | MXFP4 | ollama | 16 |
| 60.1 | 2,586 | Qwen2.5-7B-Instruct | 7B | Dense | BF16 | vllm | - |
| 48.0 | 1,174 | DeepSeek-R1-Distill-Qwen-14B | 14B | Dense | W8A8 | vllm | 14 |
| GPU | Model | Type | Quant | Platform | Notes |
|---|---|---|---|---|---|
| RTX Pro 6000 | Qwen3-Next-80B-A3B-Instruct-NVFP4 | MoE | NVFP4 | trt-llm | 104 single, 1951 batched (ctx=512 par=64) |
| RTX Pro 6000 | DeepSeek-OCR | Dense | BF16 | vllm | |
| RTX Pro 6000 | Qwen3-Embedding-8B | Embedding | BF16 | vllm | |
| RTX Pro 6000 | Qwen3-Reranker-8B | Reranker | BF16 | vllm | |
| RTX Pro 6000 | Qwen3-VL-Embedding-8B | VL-Embedding | BF16 | vllm | |
| RTX Pro 6000 | Qwen3-VL-Reranker-8B | VL-Reranker | BF16 | vllm | |
| RTX 4090 | personaplex-7b-v1 | Dense | BF16 | pytorch | Moshi voice model |
| RTX 4090 | Qwen3-Embedding-8B | Embedding | BF16 | vllm | |
| RTX 4090 | Qwen3-Reranker-8B | Reranker | BF16 | vllm | |
| RTX 4090 | Qwen3-VL-Embedding-8B | VL-Embedding | BF16 | vllm | |
| RTX 4090 | Qwen3-VL-Reranker-8B | VL-Reranker | BF16 | vllm | |
| DGX Spark | Nemotron-Super-49B-v1_5-NVFP4 | Dense | NVFP4 | trt-llm | TRT-LLM 1.1.0rc3, 3 deployment attempts in logs, served requests successfully |
| DGX Spark | Llama-4-Scout-17B-16E-Instruct-NVFP4 | MoE | NVFP4 | trt-llm | TRT-LLM 1.1.0rc3, port 8042, 75.22 GiB model+CUDA |
| DGX Spark | Qwen3-32B-NVFP4 | Dense | NVFP4 | trt-llm | TRT-LLM 1.1.0rc3, loaded from logs |
| GPU | Model | Quant | Platform | Notes |
|---|---|---|---|---|
| RTX Pro 6000 | gpt-oss-120b-MXFP4+eagle3 | MXFP4 | sglang | |
| RTX Pro 6000 | Qwen3-235B-A22B-Thinking-2507-AWQ | AWQ | sglang | |
| RTX Pro 6000 | Qwen3-235B-A22B-NVFP4 | NVFP4 | trt-llm | |
| RTX Pro 6000 | Qwen3-Next-80B-A3B-Thinking-NVFP4 | NVFP4 | trt-llm | |
| RTX Pro 6000 | gpt-oss-20B-NVFP4A16-BF16 | NVFP4 | vllm | |
| RTX Pro 6000 | gpt-oss-20b-MXFP4+eagle3 | MXFP4 | vllm | |
| RTX Pro 6000 | Llama-3.3-70B-Instruct-NVFP4 | NVFP4 | trt-llm | |
| RTX Pro 6000 | twitter-roberta-base-sentiment-latest | BF16 | transformers | Sentiment analysis, 1556 samples/sec |
| RTX Pro 6000 | bart-large-cnn | BF16 | transformers | Text summarization, 86 samples/sec |
| RTX Pro 6000 | bart-large-mnli | BF16 | transformers | Zero-shot classification, 415 samples/sec |
| RTX Pro 6000 | gpt-oss-120b | MXFP4 | vllm | Older directory naming, config.json present |
| RTX Pro 6000 | Qwen3-Omni-30B-A3B-Abliterated-Voice | BF16 | custom | Voice-enabled Qwen3 Omni, custom setup |
| RTX Pro 6000 | NVFP4-GEMM+Attention | NVFP4 | cutlass | Custom CUTLASS kernels: GEMM 38% peak, attention, KV quant/dequant, MoE GEMM. In development. |
| RTX Pro 6000 | NVFP4-KV-Cache | NVFP4 | trt-llm | TRT-LLM fork adding NVFP4 KV cache (issue #10241). MMHA integration complete, awaiting test. |
| RTX 4090 | Qwen2.5-32B-Instruct-AWQ | AWQ | vllm | |
| RTX 4090 | Qwen3-30B-A3B-GPTQ-Int4 | GPTQ | vllm | |
| RTX 4090 | Qwen3-Coder-30B-A3B-Instruct | GGUF | vllm | |
| RTX 4090 | personaplex-7b-v1 | BF16 | pytorch | Second instance of Moshi voice model |
| RTX 4090 | Qwen3-Embedding-0.6B-FP8-Dynamic | FP8 | vllm | |
| RTX 4090 | Qwen3-Embedding-0.6B-vllm-W8A8 | W8A8 | vllm | |
| RTX 4090 | Qwen3VL-8B-Instruct-FP8 | FP8 | vllm | |
| RTX 4090 | Qwen3-Embedding-8B-FP8-Dynamic | FP8 | vllm | |
| RTX 4090 | Qwen3-VL-8B-Instruct-FP8 | FP8 | vllm | |
| RTX 4090 | Qwen3-Embedding-0.6B | BF16 | vllm | |
| RTX 4090 | Qwen3-VL-8B-Instruct-FP8 | FP8 | vllm | |
| RTX 4090 | Qwen3-VL-8B-Instruct-FP8 | FP8 | vllm | |
| DGX Spark | Qwen3-14B-FP4 | NVFP4 | trt-llm | |
| DGX Spark | Qwen3-Next-80B-A3B-Instruct-NVFP4 | NVFP4 | sglang | |
| DGX Spark | gpt-oss-120b-MXFP4 | MXFP4 | vllm |
| GPU | Model | Quant | Platform | Reason |
|---|---|---|---|---|
| RTX Pro 6000 | personaplex-7b-v1 | moshi | ||
| RTX Pro 6000 | Qwen3-Next-80B-A3B-Instruct-NVFP4 | NVFP4 | sglang | |
| RTX Pro 6000 | gpt-oss-120b-MXFP4+eagle3 | MXFP4 | vllm | |
| RTX Pro 6000 | Qwen2.5-VL-7B-Instruct-NVFP4 | NVFP4 | vllm | vLLM variant, TRT-LLM version works (8463 TPS) |
| RTX Pro 6000 | Qwen2.5-VL-32B-Instruct-AWQ | AWQ | vllm | |
| RTX Pro 6000 | Qwen2.5-VL-72B-Instruct-AWQ | AWQ | vllm | OOM or incompatible |
| RTX Pro 6000 | Llama-4-Scout-17B-16E-Instruct-NVFP4 | NVFP4 | trt-llm | Chunked attention kernel SM90-only |
| RTX Pro 6000 | sam2-hiera-large | BF16 | pytorch | SAM2 vision encoder |
| DGX Spark | Seed-OSS-36B-Instruct-NVFP4 | NVFP4 | trt-llm | NotImplementedError: compressed-tensors nvfp4-pack-quantized format unsupported |
| DGX Spark | Qwen3-30B-A3B-NVFP4 | NVFP4 | vllm | MoE: 30.5B total, 3.3B active. Model loads but stuck in scheduler loop. GB10 sm_121 MoE scheduler incompatibility. |
| DGX Spark | Qwen3-Next-80B-A3B-Instruct-NVFP4 | NVFP4 | tensorrt-llm | MoE: 80B total, 3.9B active. Gated DeltaNet hybrid attention not supported. No workaround for GB10. |
| DGX Spark | Mistral-Small-3.2-24B-Instruct-2506-NVFP4 | NVFP4 | vllm | Tokenizer error: MistralTokenizer has no convert_tokens_to_ids. Tekken tokenizer incompatible. |
| DGX Spark | NVIDIA-Nemotron-Nano-9B-v2-NVFP4 | NVFP4 | tensorrt-llm | NemotronHForCausalLM architecture not supported in TRT-LLM. Use vLLM instead (27.27 TPS). |
| DGX Spark | Seed-OSS-36B-Instruct-NVFP4 | NVFP4 | tensorrt-llm | Community quant uses compressed-tensors format, not official NVIDIA NVFP4. TRT-LLM cannot parse. |
| DGX Spark | Seed-OSS-36B-Instruct-NVFP4 | NVFP4 | vllm | CUDA driver init error: Error 35. |
| DGX Spark | nemotron3-nano-nvfp4-w4a16 | NVFP4-W4A16 | vllm | Mamba hybrid architecture requires HybridMambaAttentionDecoderLayer not supported. |
| DGX Spark | nemotron3-nano-nvfp4-w4a16 | NVFP4-W4A16 | tensorrt-llm | Mamba hybrid architecture not supported by TRT-LLM NVFP4 backend. |
| TPS | Model | Quant | Scaling Eff | Knee (par) | Knee TPS |
|---|---|---|---|---|---|
| 5,176 | gpt-oss-20b-MXFP4 | MXFP4 | 36.5% | 16 | 1,783 |
| 3,269 | Nemotron-Super-49B-NVFP4 | NVFP4 | 104.1% | 128 | 3,269 |
| 2,165 | gpt-oss-120b-MXFP4 | MXFP4 | 26.3% | 8 | 610 |
| 1,557 | Llama-4-Scout-17B-16E-NVFP4 | NVFP4 | 43.2% | 16 | 492 |
| Quant | Model | TPS | Relative |
|---|---|---|---|
| NVFP4 | MiniMax-M2.1-NVFP4 | 1,355 | 1.0x |
| AWQ | MiniMax-M2.1-AWQ-4bit | 405 | 0.3x |
| NVFP4 | Qwen3-Next-80B-NVFP4 | 4,778 | 1.0x |
| FP8 | Qwen3-Next-80B-FP8 | 781 | 0.16x |
User Models --> ModelOpt (NVFP4/FP8 quantization)
|
+-----------+-----------+
v v
vLLM TRT-LLM
(community) (NVIDIA)
| |
v v
FlashInfer TRT-LLM Plugins
(UW Academic) (C++ kernels)
| |
+-----------+-----------+
v
CUTLASS
(NVIDIA)
GEMM + NVFP4 primitives
(no attention/KV cache)
|
+---------+---------+
v v
SM120 SM121
RTX Pro 6000 DGX Spark
Use the /deploy skill for deploying containers. Quick ref: ./deploy <dir> [--dry-run|--generate|--down|--shared-tenancy]
| SM | GPU | Stack | Reason |
|---|---|---|---|
| SM120 | RTX Pro 6000 | vLLM + FlashInfer | TRT-LLM missing chunked attention SM120 kernel |
| SM121 | DGX Spark | TRT-LLM | 1.1x faster than vLLM on same model |
| SM89 | RTX 4090 | vLLM | Dense models only, 24 GB limit |
| Model | Platform | Issue | Workaround |
|---|---|---|---|
| Llama-4-Scout-17B-16E | TRT-LLM | Chunked attention kernel SM90-only | Use vLLM |
| Qwen3-Next-80B (Mamba hybrid) | TRT-LLM | Mamba hybrid overhead | enable_block_reuse=false |
| Qwen3-Next-80B-NVFP4 | SGLang | SMEM overflow on SM120 | Use vLLM |
| gpt-oss-120b-MXFP4+eagle3 | vLLM | Speculative decoding broken | Use without spec decode |
| Issue | Repo | Description | Status |
|---|---|---|---|
| #10241 | TensorRT-LLM | NVFP4 KV cache support request | Open |
| #5018 | TensorRT-LLM | SM120 (5090) NVFP4 support | Confirmed in 0.20.0rc3+ |
| #31085 | vLLM | SM120 FP4 kernel detection fails | Unknown |
| Project | URL |
|---|---|
| CUTLASS | github.com/NVIDIA/cutlass |
| TensorRT-LLM | github.com/NVIDIA/TensorRT-LLM |
| vLLM | github.com/vllm-project/vllm |
| FlashInfer | github.com/flashinfer-ai/flashinfer |
| FlashAttention | github.com/Dao-AILab/flash-attention |
| ModelOpt | github.com/NVIDIA/TensorRT-Model-Optimizer |