Skip to content

Production LLM deployment specs for NVIDIA Blackwell GPUs (RTX Pro 6000, DGX Spark). Includes vLLM configurations, benchmarks, load balancer, and throughput calculators for NVFP4/FP8/MoE models.

Notifications You must be signed in to change notification settings

PrimitiveContext/blackwell

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

62 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Blackwell LLM Deployment Lab

49 benchmarked | 14 working | 29 pending | 17 failed — across 3 GPU architectures

Canonical Docs and Public Attribution

  • README.md is the canonical root document and the only root doc intended for publication.
  • Local mirrors CLAUDE.md, AGENTS.md, and AGENT.md are symlinks to README.md and are git-ignored.
  • Recreate local mirrors with ./scripts/sync_local_docs.sh.
  • Public attribution identity for this repo is dev@primitivecontext.com.
  • Local git hooks (in .githooks/) enforce attribution and block public AI co-author tags.

Operational Context

  • Work is executed by a large agent swarm (up to ~20 concurrent) in a full-freedom dev sandbox.
  • Research tools are valuable but sometimes stale or contradictory; agent triage can diverge.
  • Timeline estimates will be laughed at. Work is performed right now, with agents scaled appropriately.

2026 Blackwell Hardware

Target Chip Products Memory
SM100 B200 Datacenter Blackwell TMEM + HBM
SM120 GB202 RTX 5090, RTX Pro 6000 96 GB GDDR7
SM121 GB10 DGX Spark 128 GB unified LPDDR5x
GPU Arch Count
RTX Pro 6000 (96 GB GDDR7) SM120 46 deployments
DGX Spark (128 GB LPDDR5x) SM121 41 deployments
RTX 4090 (24 GB GDDR6X) SM89 22 deployments

Deconfliction (WHY)

Hardware: SM100 vs SM120/SM121

Capability SM100 SM120/SM121 Impact
Tensor Memory 256 KB TMEM/SM None Operands must live in registers
TMA Multicast Yes Disabled Cluster shape locked to 1×1×1
Instruction Set tcgen05.mma mma.sync.aligned.block_scale Different memory models, layouts incompatible
Operand Location SMEM/TMEM Registers Register pressure is the bottleneck

Bottom line: SM100 kernels will not run on SM120/SM121. Code compiled for sm_100a traps on consumer Blackwell.

Platforms: vLLM vs TensorRT-LLM vs Custom CUTLASS

Platform SM120 NVFP4 Status Limitation
vLLM Partial SM120 FP4 kernel detection fails (#31085), falls back to Marlin (40-50% perf loss)
TensorRT-LLM Partial NVFP4 KV cache support requested but not shipped (#10241)
SGLang Blocked All attention backends fail for MoE on SM120 (triton SMEM overflow)
Custom CUTLASS Required Only path to NVFP4 weights + NVFP4 KV cache on SM120/SM121

Bottom line: No off-the-shelf stack gives NVFP4 model weights + NVFP4 KV cache on consumer Blackwell.

Quantization: NVFP4 vs MXFP4 vs FP8

Format Data Bits Scale Type Block Size Use Case
NVFP4 E2M1 (4-bit) UE4M3 (unsigned) 16 elements Weights + KV cache (2× mem reduction)
MXFP4 E2M1 (4-bit) UE8M0 (power-of-2) 32 elements OCP-compliant, coarser scaling
FP8 E4M3/E5M2 N/A N/A Intermediate compute precision

Bottom line: NVFP4 with UE4M3 scales gives finer granularity (16 vs 32 blocks) and 137× more dynamic range than signed E4M3.

Why Custom Kernel?

  1. vLLM/TRT-LLM don't support NVFP4 KV cache on SM120
  2. SM120 lacks TMEM so SM100 kernels fail
  3. NVFP4 weights + NVFP4 KV cache = maximum memory efficiency
  4. MoE + GQA patterns need specialized attention

Build Facts (HOW)

Targets

  • Use sm_120a for RTX 5090 / RTX Pro 6000
  • Use sm_121a for DGX Spark
  • Base sm_120 cannot compile block-scaled instructions
  • PTX ISA 8.7 added sm_120a, PTX ISA 8.8 added sm_121a

Instruction Path

  • Use mma.sync.aligned.kind::mxf4nvf4.block_scale
  • All operands (A/B/ACC/SFA/SFB) reside in registers
  • Cluster shape fixed to 1×1×1 (no multicast)
  • TMA available for GMEM↔SMEM movement

Memory Model

  • No TMEM access (SM100-only)
  • TMA loads to shared memory, not tensor memory
  • Register file is the primary operand store
  • Manage register pressure carefully

NVFP4 Pipeline

[NVFP4 weights] ──┐
                  ├──► [FP8 dequant] ──► [FP8 compute] ──► [FP8 result]
[NVFP4 KV cache] ─┘                                              │
                                                                 ▼
                                              [FP8→NVFP4 quant] ──► [NVFP4 KV append]
  • NVFP4 → FP8 dequant before attention compute
  • FP8 is mandatory intermediate precision
  • New K/V quantized to NVFP4 before cache append

FP8 Scalar Conversion

  • SM120 uses emulation for scalar FP8; native FP8 is tensor core MMA only
  • __nv_cvt_float_to_fp8() intrinsic produces wrong exponent on SM120 (off by +6)
  • Use software conversion: float_to_fp8_e4m3_sw() in nvfp4_kv_dequantize.cu
  • Applies to all NVFP4→FP8 dequant paths

Scale Factor Layout

  • NVFP4: 1 UE4M3 scale per 16 E2M1 values
  • Scale layout matches SM100 (portable between archs)
  • Two-level scaling: per-block E4M3 + per-tensor FP32

Connectivity Constraints

  • RTX 5090: No NVLink, PCIe only
  • DGX Spark: 200 Gbps ConnectX-7 (Ethernet, no RDMA)
  • PCIe x8 single-GPU: <2% perf impact
  • PCIe x8 multi-GPU TP: 20-40% perf loss
  • Optimize for single-GPU first

Full citations: verified_facts.json


Documentation Resource Map

Local documentation library at docs/ (~2.4 GB).

Reference (docs/reference/ — 1.5 GB, static)

Section Path Contents
GPU Whitepapers reference/gpu/whitepapers/ 9 PDFs — Blackwell RTX, Blackwell DC, Ada, Ampere, Turing, Volta, Pascal
GPU Datasheets reference/gpu/datasheets/ 17 PDFs — RTX 5090, RTX Pro 6000, DGX Spark, B200, A100, H100, H200, L4, T4, V100
CUDA Toolkit reference/cuda/pdf/ 53 PDFs — Programming Guide, PTX ISA 9.1, Blackwell/Hopper/Ada tuning & compat guides, cuBLAS, cuSPARSE, NVVM IR, etc.
CUDA HTML Docs reference/cuda/ 89 subdirs — parallel-thread-execution, cuda-c-programming-guide, blackwell-tuning-guide, inline-ptx-assembly, etc.
cuDNN reference/cudnn/ api/, backend/, frontend/, developer-guide/, installation/, latest/
CUTLASS reference/cutlass/latest/ Scraped HTML docs — overview, changelog, index
DGX reference/dgx/ 27 product manuals — DGX Spark (HTML + PDF), DGX B200/B300, DGX A100/H100/GB200
NVIDIA Blog reference/blog/blog/ Saved posts — NVFP4 intro, NVFP4 training, Blackwell Ultra, MoE expert parallelism, PTQ optimization

GPU Architecture Whitepapers

Generation Whitepaper
Blackwell reference/gpu/whitepapers/nvidia-rtx-blackwell-gpu-architecture.pdf
Ada reference/gpu/whitepapers/ada-lovelace-architecture-whitepaper.pdf
Ampere reference/gpu/whitepapers/nvidia-ampere-architecture-whitepaper.pdf
Turing reference/gpu/whitepapers/nvidia-turing-architecture-whitepaper.pdf
Volta reference/gpu/whitepapers/volta-architecture-whitepaper.pdf
Pascal reference/gpu/whitepapers/pascal-architecture-whitepaper.pdf

Key Files for SM12x Work

Need File
PTX ISA reference/cuda/pdf/ptx_isa_9.1.pdf
Blackwell tuning reference/cuda/pdf/Blackwell_Tuning_Guide.pdf
Blackwell compat reference/cuda/pdf/Blackwell_Compatibility_Guide.pdf
CUDA C Programming reference/cuda/pdf/CUDA_C_Programming_Guide.pdf
RTX Blackwell arch reference/gpu/whitepapers/nvidia-rtx-blackwell-gpu-architecture.pdf
Blackwell DC arch reference/gpu/whitepapers/blackwell-architecture-hardwareand.pdf
Blackwell microbench reference/gpu/datasheets/blackwell-microbenchmarks-arxiv.pdf
DGX Spark guide reference/dgx/dgx-spark/
NVFP4 blog reference/blog/blog/introducing-nvfp4-for-efficient-and-accurate-low-precision-inference/
NVFP4 training blog reference/blog/blog/nvfp4-trains-with-precision-of-16-bit-and-speed-and-efficiency-of-4-bit/

Source Snapshots (docs/source/ — 885 MB, dated git repos)

Snapshot Version Purpose
20260124_cutlass-v4.3.0/ CUTLASS 4.3.0 Kernel primitives, SM120 examples (77, 81, 82)
20260125_vllm-v0.14.0rc1/ vLLM 0.14.0-rc1 Serving framework, quantization backends
20260125_sglang-v0.5.8/ SGLang 0.5.8 Structured generation, attention backends
20260125_flashinfer-v0.6.2/ FlashInfer 0.6.2 Attention kernel library
20260125_flash-attention-v2.8.3/ FlashAttention 2.8.3 Reference attention implementation
20260125_exo-explore-exo/ Exo Distributed inference
20260122_dgx-spark-playbooks/ DGX Spark deployment playbooks
20251210_avarok-vllm-dgx-spark/ vLLM DGX Spark adaptation
20251201_dgx-spark-config-v1.0.1/ DGX Spark system config
20251124_rtx-ai-toolkit-archived/ RTX AI Toolkit (archived)
20251017_rtx-pro-6000-vs-dgx-spark/ RTX Pro 6000 vs DGX Spark benchmarks/comparison
vggt/ VGGT Visual Geometry Grounded Transformer

Other (docs/sources/ — 30 MB)

Repo Purpose
streaming-vlm/ Streaming VLM inference (git repo with training + inference code)

SM12x Deployment Reference

Benchmarked Deployments

Sorted by single-stream throughput. TPS = tokens/second.

RTX Pro 6000 (SM120)

Single TPS Batched TPS Model Params Type Quant Platform TP VRAM
255 6,396 gpt-oss-20b-MXFP4 20.9B/3.6B MoE MXFP4 vllm 1 -
198 8,463 Qwen2.5-VL-7B-Instruct-NVFP4 7B VLM-Dense NVFP4 trt-llm 1 6.71
173 6,545 gpt-oss-120b-MXFP4 120B/5.1B MoE MXFP4 vllm 1 95
131 4,778 Qwen3-Next-80B-A3B-Instruct-NVFP4 80B/3B MoE NVFP4 vllm 1 44.2
105 404 MiniMax-M2.1-AWQ-4bit 456B/45B MoE AWQ vllm 2 -
78.0 1,004 Qwen3-Next-80B-A3B-Instruct-FP8 80B/3B MoE FP8 sglang 1 82
75.2 1,355 MiniMax-M2.1-NVFP4 456B/45B MoE NVFP4 vllm 2 122
61.5 1,630 Llama-4-Scout-17B-16E-Instruct-NVFP4 109B/17B MoE NVFP4 vllm 1 89.7
58.3 1,304 Qwen3-235B-A22B-NVFP4 235B/22B MoE NVFP4 vllm 2 95.5
43.6 2,510 Nemotron-Nano-12B-v2-VL-NVFP4-QAD 12B VLM-Dense NVFP4 trt-llm 1 9.87
31.0 chandra 8B Dense BF16 vllm 1
26.9 1,812 Nemotron-Super-49B-v1_5-NVFP4 49B Dense NVFP4 vllm 1 89.6
25.8 603 Qwen3-VL-30B-A3B-Instruct-NVFP4 30B/3B VLM-MoE NVFP4 vllm 1 18.2
21.5 1,959 Qwen3-Next-80B-A3B-Instruct-NVFP4 80B/3B MoE NVFP4 vllm 1 90.7
18.8 1,225 Llama-3.3-70B-Instruct-NVFP4 70B Dense NVFP4 vllm 1 89.7
14.0 olmOCR-2-7B-1025-FP8 7B Dense FP8 vllm 1
7.1 2,064 Qwen3-Next-80B-A3B-Instruct-NVFP4 80B/3B MoE NVFP4 vllm 1 -
6.1 304 Qwen3-VL-235B-A22B-Instruct-NVFP4 235B/22B VLM-MoE NVFP4 vllm 2 94

DGX Spark (SM121)

Single TPS Batched TPS Model Params Type Quant Platform VRAM
180 Qwen3-0.6B-NVFP4 600M Dense NVFP4 vllm 0.60
99.7 Qwen3-1.7B-NVFP4 2B Dense NVFP4 vllm 1.00
60.1 Qwen3-30B-A3B-NVFP4 30B/3.3B MoE NVFP4 tensorrt-llm 16.85
58.0 Phi-4-multimodal-instruct-NVFP4 14B VLM-Dense NVFP4 tensorrt-llm 19.40
55.6 Qwen3-4B-Instruct-2507-NVFP4 4B Dense NVFP4 vllm 3.00
42.7 188 Qwen3-Next-80B-A3B-Instruct-FP8 80B/3B MoE FP8 vllm 74.89
41.3 Qwen2.5-VL-7B-Instruct-NVFP4 7B VLM-Dense NVFP4 tensorrt-llm 22.00
39.9 Llama-3.3-70B-Instruct-NVFP4 70B Dense NVFP4 vllm
38.6 1,128 Qwen3-30B-A3B-FP4 30B/3B MoE NVFP4 trt-llm 19.2
28.5 28.5 Qwen3-8B-NVFP4 8B Dense NVFP4 vllm -
28.2 Qwen3-8B-NVFP4 8B Dense NVFP4 tensorrt-llm 7.00
27.6 441 Qwen3-Next-80B-A3B-Instruct-NVFP4 80B/3B MoE NVFP4 trt-llm 44.2
27.3 NVIDIA-Nemotron-Nano-9B-v2-NVFP4 9B Dense NVFP4 vllm 7.34
20.2 Qwen3-14B-NVFP4 14B Dense NVFP4 vllm 9.00
19.6 19.6 Phi-4-reasoning-plus-NVFP4 14B Dense NVFP4 trt-llm -
19.4 Phi-4-reasoning-plus-NVFP4 15B Dense NVFP4 tensorrt-llm 10.00
17.0 17.0 Qwen3-14B-FP4 14.8B Dense NVFP4 trt-llm -
17.0 Qwen3-14B-NVFP4 15B Dense NVFP4 tensorrt-llm 10.00
16.3 Llama-4-Scout-17B-16E-Instruct-NVFP4 56B/17.0B MoE NVFP4 tensorrt-llm 114.23
10.2 Qwen3-32B-NVFP4 32B Dense NVFP4 vllm 19.00
10.0 DeepSeek-R1-Distill-Qwen-32B-NVFP4 32B Dense NVFP4 vllm 19.00
8.2 Qwen3-32B-NVFP4 33B Dense NVFP4 tensorrt-llm 105.00
4.8 398 Qwen3-Next-80B-A3B-Instruct-NVFP4 80B/3B MoE NVFP4 vllm 44.3
4.8 75.8 Qwen3-Next-80B-A3B-Instruct-FP8 80B/3B MoE FP8 vllm
4.6 Llama-3.1-70B-Instruct-NVFP4 70B Dense NVFP4 vllm 41.00
3.1 Llama-3.3-70B-Instruct-NVFP4 70B Dense NVFP4 tensorrt-llm 60.00

RTX 4090 (SM89)

Single TPS Batched TPS Model Params Type Quant Platform VRAM
152 5,026 Llama-3.2-3B-Instruct 3B Dense FP8 vllm 5
88.3 3,566 DeepSeek-R1-Distill-Qwen-7B 7B Dense W8A8 vllm 10
81.0 197 gpt-oss-20b-MXFP4 20.9B/3.6B MoE MXFP4 ollama 16
60.1 2,586 Qwen2.5-7B-Instruct 7B Dense BF16 vllm -
48.0 1,174 DeepSeek-R1-Distill-Qwen-14B 14B Dense W8A8 vllm 14

Working (not yet benchmarked)

GPU Model Type Quant Platform Notes
RTX Pro 6000 Qwen3-Next-80B-A3B-Instruct-NVFP4 MoE NVFP4 trt-llm 104 single, 1951 batched (ctx=512 par=64)
RTX Pro 6000 DeepSeek-OCR Dense BF16 vllm
RTX Pro 6000 Qwen3-Embedding-8B Embedding BF16 vllm
RTX Pro 6000 Qwen3-Reranker-8B Reranker BF16 vllm
RTX Pro 6000 Qwen3-VL-Embedding-8B VL-Embedding BF16 vllm
RTX Pro 6000 Qwen3-VL-Reranker-8B VL-Reranker BF16 vllm
RTX 4090 personaplex-7b-v1 Dense BF16 pytorch Moshi voice model
RTX 4090 Qwen3-Embedding-8B Embedding BF16 vllm
RTX 4090 Qwen3-Reranker-8B Reranker BF16 vllm
RTX 4090 Qwen3-VL-Embedding-8B VL-Embedding BF16 vllm
RTX 4090 Qwen3-VL-Reranker-8B VL-Reranker BF16 vllm
DGX Spark Nemotron-Super-49B-v1_5-NVFP4 Dense NVFP4 trt-llm TRT-LLM 1.1.0rc3, 3 deployment attempts in logs, served requests successfully
DGX Spark Llama-4-Scout-17B-16E-Instruct-NVFP4 MoE NVFP4 trt-llm TRT-LLM 1.1.0rc3, port 8042, 75.22 GiB model+CUDA
DGX Spark Qwen3-32B-NVFP4 Dense NVFP4 trt-llm TRT-LLM 1.1.0rc3, loaded from logs

Pending

GPU Model Quant Platform Notes
RTX Pro 6000 gpt-oss-120b-MXFP4+eagle3 MXFP4 sglang
RTX Pro 6000 Qwen3-235B-A22B-Thinking-2507-AWQ AWQ sglang
RTX Pro 6000 Qwen3-235B-A22B-NVFP4 NVFP4 trt-llm
RTX Pro 6000 Qwen3-Next-80B-A3B-Thinking-NVFP4 NVFP4 trt-llm
RTX Pro 6000 gpt-oss-20B-NVFP4A16-BF16 NVFP4 vllm
RTX Pro 6000 gpt-oss-20b-MXFP4+eagle3 MXFP4 vllm
RTX Pro 6000 Llama-3.3-70B-Instruct-NVFP4 NVFP4 trt-llm
RTX Pro 6000 twitter-roberta-base-sentiment-latest BF16 transformers Sentiment analysis, 1556 samples/sec
RTX Pro 6000 bart-large-cnn BF16 transformers Text summarization, 86 samples/sec
RTX Pro 6000 bart-large-mnli BF16 transformers Zero-shot classification, 415 samples/sec
RTX Pro 6000 gpt-oss-120b MXFP4 vllm Older directory naming, config.json present
RTX Pro 6000 Qwen3-Omni-30B-A3B-Abliterated-Voice BF16 custom Voice-enabled Qwen3 Omni, custom setup
RTX Pro 6000 NVFP4-GEMM+Attention NVFP4 cutlass Custom CUTLASS kernels: GEMM 38% peak, attention, KV quant/dequant, MoE GEMM. In development.
RTX Pro 6000 NVFP4-KV-Cache NVFP4 trt-llm TRT-LLM fork adding NVFP4 KV cache (issue #10241). MMHA integration complete, awaiting test.
RTX 4090 Qwen2.5-32B-Instruct-AWQ AWQ vllm
RTX 4090 Qwen3-30B-A3B-GPTQ-Int4 GPTQ vllm
RTX 4090 Qwen3-Coder-30B-A3B-Instruct GGUF vllm
RTX 4090 personaplex-7b-v1 BF16 pytorch Second instance of Moshi voice model
RTX 4090 Qwen3-Embedding-0.6B-FP8-Dynamic FP8 vllm
RTX 4090 Qwen3-Embedding-0.6B-vllm-W8A8 W8A8 vllm
RTX 4090 Qwen3VL-8B-Instruct-FP8 FP8 vllm
RTX 4090 Qwen3-Embedding-8B-FP8-Dynamic FP8 vllm
RTX 4090 Qwen3-VL-8B-Instruct-FP8 FP8 vllm
RTX 4090 Qwen3-Embedding-0.6B BF16 vllm
RTX 4090 Qwen3-VL-8B-Instruct-FP8 FP8 vllm
RTX 4090 Qwen3-VL-8B-Instruct-FP8 FP8 vllm
DGX Spark Qwen3-14B-FP4 NVFP4 trt-llm
DGX Spark Qwen3-Next-80B-A3B-Instruct-NVFP4 NVFP4 sglang
DGX Spark gpt-oss-120b-MXFP4 MXFP4 vllm

Failed

GPU Model Quant Platform Reason
RTX Pro 6000 personaplex-7b-v1 moshi
RTX Pro 6000 Qwen3-Next-80B-A3B-Instruct-NVFP4 NVFP4 sglang
RTX Pro 6000 gpt-oss-120b-MXFP4+eagle3 MXFP4 vllm
RTX Pro 6000 Qwen2.5-VL-7B-Instruct-NVFP4 NVFP4 vllm vLLM variant, TRT-LLM version works (8463 TPS)
RTX Pro 6000 Qwen2.5-VL-32B-Instruct-AWQ AWQ vllm
RTX Pro 6000 Qwen2.5-VL-72B-Instruct-AWQ AWQ vllm OOM or incompatible
RTX Pro 6000 Llama-4-Scout-17B-16E-Instruct-NVFP4 NVFP4 trt-llm Chunked attention kernel SM90-only
RTX Pro 6000 sam2-hiera-large BF16 pytorch SAM2 vision encoder
DGX Spark Seed-OSS-36B-Instruct-NVFP4 NVFP4 trt-llm NotImplementedError: compressed-tensors nvfp4-pack-quantized format unsupported
DGX Spark Qwen3-30B-A3B-NVFP4 NVFP4 vllm MoE: 30.5B total, 3.3B active. Model loads but stuck in scheduler loop. GB10 sm_121 MoE scheduler incompatibility.
DGX Spark Qwen3-Next-80B-A3B-Instruct-NVFP4 NVFP4 tensorrt-llm MoE: 80B total, 3.9B active. Gated DeltaNet hybrid attention not supported. No workaround for GB10.
DGX Spark Mistral-Small-3.2-24B-Instruct-2506-NVFP4 NVFP4 vllm Tokenizer error: MistralTokenizer has no convert_tokens_to_ids. Tekken tokenizer incompatible.
DGX Spark NVIDIA-Nemotron-Nano-9B-v2-NVFP4 NVFP4 tensorrt-llm NemotronHForCausalLM architecture not supported in TRT-LLM. Use vLLM instead (27.27 TPS).
DGX Spark Seed-OSS-36B-Instruct-NVFP4 NVFP4 tensorrt-llm Community quant uses compressed-tensors format, not official NVIDIA NVFP4. TRT-LLM cannot parse.
DGX Spark Seed-OSS-36B-Instruct-NVFP4 NVFP4 vllm CUDA driver init error: Error 35.
DGX Spark nemotron3-nano-nvfp4-w4a16 NVFP4-W4A16 vllm Mamba hybrid architecture requires HybridMambaAttentionDecoderLayer not supported.
DGX Spark nemotron3-nano-nvfp4-w4a16 NVFP4-W4A16 tensorrt-llm Mamba hybrid architecture not supported by TRT-LLM NVFP4 backend.

Scaling Efficiency (RTX Pro 6000)

TPS Model Quant Scaling Eff Knee (par) Knee TPS
5,176 gpt-oss-20b-MXFP4 MXFP4 36.5% 16 1,783
3,269 Nemotron-Super-49B-NVFP4 NVFP4 104.1% 128 3,269
2,165 gpt-oss-120b-MXFP4 MXFP4 26.3% 8 610
1,557 Llama-4-Scout-17B-16E-NVFP4 NVFP4 43.2% 16 492

Quant Comparison (SM120, same model)

Quant Model TPS Relative
NVFP4 MiniMax-M2.1-NVFP4 1,355 1.0x
AWQ MiniMax-M2.1-AWQ-4bit 405 0.3x
NVFP4 Qwen3-Next-80B-NVFP4 4,778 1.0x
FP8 Qwen3-Next-80B-FP8 781 0.16x

Inference Stack

User Models --> ModelOpt (NVFP4/FP8 quantization)
                    |
        +-----------+-----------+
        v                       v
      vLLM                  TRT-LLM
   (community)              (NVIDIA)
        |                       |
        v                       v
   FlashInfer            TRT-LLM Plugins
  (UW Academic)          (C++ kernels)
        |                       |
        +-----------+-----------+
                    v
                 CUTLASS
                (NVIDIA)
          GEMM + NVFP4 primitives
          (no attention/KV cache)
                    |
          +---------+---------+
          v                   v
        SM120               SM121
   RTX Pro 6000          DGX Spark

Deployment System

Use the /deploy skill for deploying containers. Quick ref: ./deploy <dir> [--dry-run|--generate|--down|--shared-tenancy]

Platform Recommendations

SM GPU Stack Reason
SM120 RTX Pro 6000 vLLM + FlashInfer TRT-LLM missing chunked attention SM120 kernel
SM121 DGX Spark TRT-LLM 1.1x faster than vLLM on same model
SM89 RTX 4090 vLLM Dense models only, 24 GB limit

Known Issues

Model Platform Issue Workaround
Llama-4-Scout-17B-16E TRT-LLM Chunked attention kernel SM90-only Use vLLM
Qwen3-Next-80B (Mamba hybrid) TRT-LLM Mamba hybrid overhead enable_block_reuse=false
Qwen3-Next-80B-NVFP4 SGLang SMEM overflow on SM120 Use vLLM
gpt-oss-120b-MXFP4+eagle3 vLLM Speculative decoding broken Use without spec decode

GitHub Issues

Issue Repo Description Status
#10241 TensorRT-LLM NVFP4 KV cache support request Open
#5018 TensorRT-LLM SM120 (5090) NVFP4 support Confirmed in 0.20.0rc3+
#31085 vLLM SM120 FP4 kernel detection fails Unknown

Projects

Project URL
CUTLASS github.com/NVIDIA/cutlass
TensorRT-LLM github.com/NVIDIA/TensorRT-LLM
vLLM github.com/vllm-project/vllm
FlashInfer github.com/flashinfer-ai/flashinfer
FlashAttention github.com/Dao-AILab/flash-attention
ModelOpt github.com/NVIDIA/TensorRT-Model-Optimizer

About

Production LLM deployment specs for NVIDIA Blackwell GPUs (RTX Pro 6000, DGX Spark). Includes vLLM configurations, benchmarks, load balancer, and throughput calculators for NVFP4/FP8/MoE models.

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •