Blackwell LLM Deployment Lab

49 benchmarked | 14 working | 29 pending | 17 failed — across 3 GPU architectures

Canonical Docs and Public Attribution

README.md is the canonical root document and the only root doc intended for publication.
Local mirrors CLAUDE.md, AGENTS.md, and AGENT.md are symlinks to README.md and are git-ignored.
Recreate local mirrors with ./scripts/sync_local_docs.sh.
Public attribution identity for this repo is dev@primitivecontext.com.
Local git hooks (in .githooks/) enforce attribution and block public AI co-author tags.

Operational Context

Work is executed by a large agent swarm (up to ~20 concurrent) in a full-freedom dev sandbox.
Research tools are valuable but sometimes stale or contradictory; agent triage can diverge.
Timeline estimates will be laughed at. Work is performed right now, with agents scaled appropriately.

2026 Blackwell Hardware

Target	Chip	Products	Memory
SM100	B200	Datacenter Blackwell	TMEM + HBM
SM120	GB202	RTX 5090, RTX Pro 6000	96 GB GDDR7
SM121	GB10	DGX Spark	128 GB unified LPDDR5x

GPU	Arch	Count
RTX Pro 6000 (96 GB GDDR7)	SM120	46 deployments
DGX Spark (128 GB LPDDR5x)	SM121	41 deployments
RTX 4090 (24 GB GDDR6X)	SM89	22 deployments

Deconfliction (WHY)

Hardware: SM100 vs SM120/SM121

Capability	SM100	SM120/SM121	Impact
Tensor Memory	256 KB TMEM/SM	None	Operands must live in registers
TMA Multicast	Yes	Disabled	Cluster shape locked to 1×1×1
Instruction Set	tcgen05.mma	mma.sync.aligned.block_scale	Different memory models, layouts incompatible
Operand Location	SMEM/TMEM	Registers	Register pressure is the bottleneck

Bottom line: SM100 kernels will not run on SM120/SM121. Code compiled for sm_100a traps on consumer Blackwell.

Platforms: vLLM vs TensorRT-LLM vs Custom CUTLASS

Platform	SM120 NVFP4 Status	Limitation
vLLM	Partial	SM120 FP4 kernel detection fails (#31085), falls back to Marlin (40-50% perf loss)
TensorRT-LLM	Partial	NVFP4 KV cache support requested but not shipped (#10241)
SGLang	Blocked	All attention backends fail for MoE on SM120 (triton SMEM overflow)
Custom CUTLASS	Required	Only path to NVFP4 weights + NVFP4 KV cache on SM120/SM121

Bottom line: No off-the-shelf stack gives NVFP4 model weights + NVFP4 KV cache on consumer Blackwell.

Quantization: NVFP4 vs MXFP4 vs FP8

Format	Data Bits	Scale Type	Block Size	Use Case
NVFP4	E2M1 (4-bit)	UE4M3 (unsigned)	16 elements	Weights + KV cache (2× mem reduction)
MXFP4	E2M1 (4-bit)	UE8M0 (power-of-2)	32 elements	OCP-compliant, coarser scaling
FP8	E4M3/E5M2	N/A	N/A	Intermediate compute precision

Bottom line: NVFP4 with UE4M3 scales gives finer granularity (16 vs 32 blocks) and 137× more dynamic range than signed E4M3.

Why Custom Kernel?

vLLM/TRT-LLM don't support NVFP4 KV cache on SM120
SM120 lacks TMEM so SM100 kernels fail
NVFP4 weights + NVFP4 KV cache = maximum memory efficiency
MoE + GQA patterns need specialized attention

Build Facts (HOW)

Targets

Use sm_120a for RTX 5090 / RTX Pro 6000
Use sm_121a for DGX Spark
Base sm_120 cannot compile block-scaled instructions
PTX ISA 8.7 added sm_120a, PTX ISA 8.8 added sm_121a

Instruction Path

Use mma.sync.aligned.kind::mxf4nvf4.block_scale
All operands (A/B/ACC/SFA/SFB) reside in registers
Cluster shape fixed to 1×1×1 (no multicast)
TMA available for GMEM↔SMEM movement

Memory Model

No TMEM access (SM100-only)
TMA loads to shared memory, not tensor memory
Register file is the primary operand store
Manage register pressure carefully

NVFP4 Pipeline

[NVFP4 weights] ──┐
                  ├──► [FP8 dequant] ──► [FP8 compute] ──► [FP8 result]
[NVFP4 KV cache] ─┘                                              │
                                                                 ▼
                                              [FP8→NVFP4 quant] ──► [NVFP4 KV append]

NVFP4 → FP8 dequant before attention compute
FP8 is mandatory intermediate precision
New K/V quantized to NVFP4 before cache append

FP8 Scalar Conversion

SM120 uses emulation for scalar FP8; native FP8 is tensor core MMA only
__nv_cvt_float_to_fp8() intrinsic produces wrong exponent on SM120 (off by +6)
Use software conversion: float_to_fp8_e4m3_sw() in nvfp4_kv_dequantize.cu
Applies to all NVFP4→FP8 dequant paths

Scale Factor Layout

NVFP4: 1 UE4M3 scale per 16 E2M1 values
Scale layout matches SM100 (portable between archs)
Two-level scaling: per-block E4M3 + per-tensor FP32

Connectivity Constraints

RTX 5090: No NVLink, PCIe only
DGX Spark: 200 Gbps ConnectX-7 (Ethernet, no RDMA)
PCIe x8 single-GPU: <2% perf impact
PCIe x8 multi-GPU TP: 20-40% perf loss
Optimize for single-GPU first

Full citations: verified_facts.json

Documentation Resource Map

Local documentation library at docs/ (~2.4 GB).

Reference (`docs/reference/` — 1.5 GB, static)

Section	Path	Contents
GPU Whitepapers	`reference/gpu/whitepapers/`	9 PDFs — Blackwell RTX, Blackwell DC, Ada, Ampere, Turing, Volta, Pascal
GPU Datasheets	`reference/gpu/datasheets/`	17 PDFs — RTX 5090, RTX Pro 6000, DGX Spark, B200, A100, H100, H200, L4, T4, V100
CUDA Toolkit	`reference/cuda/pdf/`	53 PDFs — Programming Guide, PTX ISA 9.1, Blackwell/Hopper/Ada tuning & compat guides, cuBLAS, cuSPARSE, NVVM IR, etc.
CUDA HTML Docs	`reference/cuda/`	89 subdirs — parallel-thread-execution, cuda-c-programming-guide, blackwell-tuning-guide, inline-ptx-assembly, etc.
cuDNN	`reference/cudnn/`	api/, backend/, frontend/, developer-guide/, installation/, latest/
CUTLASS	`reference/cutlass/latest/`	Scraped HTML docs — overview, changelog, index
DGX	`reference/dgx/`	27 product manuals — DGX Spark (HTML + PDF), DGX B200/B300, DGX A100/H100/GB200
NVIDIA Blog	`reference/blog/blog/`	Saved posts — NVFP4 intro, NVFP4 training, Blackwell Ultra, MoE expert parallelism, PTQ optimization

GPU Architecture Whitepapers

Generation	Whitepaper
Blackwell	`reference/gpu/whitepapers/nvidia-rtx-blackwell-gpu-architecture.pdf`
Ada	`reference/gpu/whitepapers/ada-lovelace-architecture-whitepaper.pdf`
Ampere	`reference/gpu/whitepapers/nvidia-ampere-architecture-whitepaper.pdf`
Turing	`reference/gpu/whitepapers/nvidia-turing-architecture-whitepaper.pdf`
Volta	`reference/gpu/whitepapers/volta-architecture-whitepaper.pdf`
Pascal	`reference/gpu/whitepapers/pascal-architecture-whitepaper.pdf`

Key Files for SM12x Work

Need	File
PTX ISA	`reference/cuda/pdf/ptx_isa_9.1.pdf`
Blackwell tuning	`reference/cuda/pdf/Blackwell_Tuning_Guide.pdf`
Blackwell compat	`reference/cuda/pdf/Blackwell_Compatibility_Guide.pdf`
CUDA C Programming	`reference/cuda/pdf/CUDA_C_Programming_Guide.pdf`
RTX Blackwell arch	`reference/gpu/whitepapers/nvidia-rtx-blackwell-gpu-architecture.pdf`
Blackwell DC arch	`reference/gpu/whitepapers/blackwell-architecture-hardwareand.pdf`
Blackwell microbench	`reference/gpu/datasheets/blackwell-microbenchmarks-arxiv.pdf`
DGX Spark guide	`reference/dgx/dgx-spark/`
NVFP4 blog	`reference/blog/blog/introducing-nvfp4-for-efficient-and-accurate-low-precision-inference/`
NVFP4 training blog	`reference/blog/blog/nvfp4-trains-with-precision-of-16-bit-and-speed-and-efficiency-of-4-bit/`

Source Snapshots (`docs/source/` — 885 MB, dated git repos)

Snapshot	Version	Purpose
`20260124_cutlass-v4.3.0/`	CUTLASS 4.3.0	Kernel primitives, SM120 examples (77, 81, 82)
`20260125_vllm-v0.14.0rc1/`	vLLM 0.14.0-rc1	Serving framework, quantization backends
`20260125_sglang-v0.5.8/`	SGLang 0.5.8	Structured generation, attention backends
`20260125_flashinfer-v0.6.2/`	FlashInfer 0.6.2	Attention kernel library
`20260125_flash-attention-v2.8.3/`	FlashAttention 2.8.3	Reference attention implementation
`20260125_exo-explore-exo/`	Exo	Distributed inference
`20260122_dgx-spark-playbooks/`	—	DGX Spark deployment playbooks
`20251210_avarok-vllm-dgx-spark/`	—	vLLM DGX Spark adaptation
`20251201_dgx-spark-config-v1.0.1/`	—	DGX Spark system config
`20251124_rtx-ai-toolkit-archived/`	—	RTX AI Toolkit (archived)
`20251017_rtx-pro-6000-vs-dgx-spark/`	—	RTX Pro 6000 vs DGX Spark benchmarks/comparison
`vggt/`	VGGT	Visual Geometry Grounded Transformer

Other (`docs/sources/` — 30 MB)

Repo	Purpose
`streaming-vlm/`	Streaming VLM inference (git repo with training + inference code)

SM12x Deployment Reference

Benchmarked Deployments

Sorted by single-stream throughput. TPS = tokens/second.

RTX Pro 6000 (SM120)

Single TPS	Batched TPS	Model	Params	Type	Quant	Platform	TP	VRAM
255	6,396	gpt-oss-20b-MXFP4	20.9B/3.6B	MoE	MXFP4	vllm	1	-
198	8,463	Qwen2.5-VL-7B-Instruct-NVFP4	7B	VLM-Dense	NVFP4	trt-llm	1	6.71
173	6,545	gpt-oss-120b-MXFP4	120B/5.1B	MoE	MXFP4	vllm	1	95
131	4,778	Qwen3-Next-80B-A3B-Instruct-NVFP4	80B/3B	MoE	NVFP4	vllm	1	44.2
105	404	MiniMax-M2.1-AWQ-4bit	456B/45B	MoE	AWQ	vllm	2	-
78.0	1,004	Qwen3-Next-80B-A3B-Instruct-FP8	80B/3B	MoE	FP8	sglang	1	82
75.2	1,355	MiniMax-M2.1-NVFP4	456B/45B	MoE	NVFP4	vllm	2	122
61.5	1,630	Llama-4-Scout-17B-16E-Instruct-NVFP4	109B/17B	MoE	NVFP4	vllm	1	89.7
58.3	1,304	Qwen3-235B-A22B-NVFP4	235B/22B	MoE	NVFP4	vllm	2	95.5
43.6	2,510	Nemotron-Nano-12B-v2-VL-NVFP4-QAD	12B	VLM-Dense	NVFP4	trt-llm	1	9.87
31.0	—	chandra	8B	Dense	BF16	vllm	1
26.9	1,812	Nemotron-Super-49B-v1_5-NVFP4	49B	Dense	NVFP4	vllm	1	89.6
25.8	603	Qwen3-VL-30B-A3B-Instruct-NVFP4	30B/3B	VLM-MoE	NVFP4	vllm	1	18.2
21.5	1,959	Qwen3-Next-80B-A3B-Instruct-NVFP4	80B/3B	MoE	NVFP4	vllm	1	90.7
18.8	1,225	Llama-3.3-70B-Instruct-NVFP4	70B	Dense	NVFP4	vllm	1	89.7
14.0	—	olmOCR-2-7B-1025-FP8	7B	Dense	FP8	vllm	1
7.1	2,064	Qwen3-Next-80B-A3B-Instruct-NVFP4	80B/3B	MoE	NVFP4	vllm	1	-
6.1	304	Qwen3-VL-235B-A22B-Instruct-NVFP4	235B/22B	VLM-MoE	NVFP4	vllm	2	94

DGX Spark (SM121)

Single TPS	Batched TPS	Model	Params	Type	Quant	Platform	VRAM
180	—	Qwen3-0.6B-NVFP4	600M	Dense	NVFP4	vllm	0.60
99.7	—	Qwen3-1.7B-NVFP4	2B	Dense	NVFP4	vllm	1.00
60.1	—	Qwen3-30B-A3B-NVFP4	30B/3.3B	MoE	NVFP4	tensorrt-llm	16.85
58.0	—	Phi-4-multimodal-instruct-NVFP4	14B	VLM-Dense	NVFP4	tensorrt-llm	19.40
55.6	—	Qwen3-4B-Instruct-2507-NVFP4	4B	Dense	NVFP4	vllm	3.00
42.7	188	Qwen3-Next-80B-A3B-Instruct-FP8	80B/3B	MoE	FP8	vllm	74.89
41.3	—	Qwen2.5-VL-7B-Instruct-NVFP4	7B	VLM-Dense	NVFP4	tensorrt-llm	22.00
39.9	—	Llama-3.3-70B-Instruct-NVFP4	70B	Dense	NVFP4	vllm
38.6	1,128	Qwen3-30B-A3B-FP4	30B/3B	MoE	NVFP4	trt-llm	19.2
28.5	28.5	Qwen3-8B-NVFP4	8B	Dense	NVFP4	vllm	-
28.2	—	Qwen3-8B-NVFP4	8B	Dense	NVFP4	tensorrt-llm	7.00
27.6	441	Qwen3-Next-80B-A3B-Instruct-NVFP4	80B/3B	MoE	NVFP4	trt-llm	44.2
27.3	—	NVIDIA-Nemotron-Nano-9B-v2-NVFP4	9B	Dense	NVFP4	vllm	7.34
20.2	—	Qwen3-14B-NVFP4	14B	Dense	NVFP4	vllm	9.00
19.6	19.6	Phi-4-reasoning-plus-NVFP4	14B	Dense	NVFP4	trt-llm	-
19.4	—	Phi-4-reasoning-plus-NVFP4	15B	Dense	NVFP4	tensorrt-llm	10.00
17.0	17.0	Qwen3-14B-FP4	14.8B	Dense	NVFP4	trt-llm	-
17.0	—	Qwen3-14B-NVFP4	15B	Dense	NVFP4	tensorrt-llm	10.00
16.3	—	Llama-4-Scout-17B-16E-Instruct-NVFP4	56B/17.0B	MoE	NVFP4	tensorrt-llm	114.23
10.2	—	Qwen3-32B-NVFP4	32B	Dense	NVFP4	vllm	19.00
10.0	—	DeepSeek-R1-Distill-Qwen-32B-NVFP4	32B	Dense	NVFP4	vllm	19.00
8.2	—	Qwen3-32B-NVFP4	33B	Dense	NVFP4	tensorrt-llm	105.00
4.8	398	Qwen3-Next-80B-A3B-Instruct-NVFP4	80B/3B	MoE	NVFP4	vllm	44.3
4.8	75.8	Qwen3-Next-80B-A3B-Instruct-FP8	80B/3B	MoE	FP8	vllm
4.6	—	Llama-3.1-70B-Instruct-NVFP4	70B	Dense	NVFP4	vllm	41.00
3.1	—	Llama-3.3-70B-Instruct-NVFP4	70B	Dense	NVFP4	tensorrt-llm	60.00

RTX 4090 (SM89)

Single TPS	Batched TPS	Model	Params	Type	Quant	Platform	VRAM
152	5,026	Llama-3.2-3B-Instruct	3B	Dense	FP8	vllm	5
88.3	3,566	DeepSeek-R1-Distill-Qwen-7B	7B	Dense	W8A8	vllm	10
81.0	197	gpt-oss-20b-MXFP4	20.9B/3.6B	MoE	MXFP4	ollama	16
60.1	2,586	Qwen2.5-7B-Instruct	7B	Dense	BF16	vllm	-
48.0	1,174	DeepSeek-R1-Distill-Qwen-14B	14B	Dense	W8A8	vllm	14

Working (not yet benchmarked)

GPU	Model	Type	Quant	Platform	Notes
RTX Pro 6000	Qwen3-Next-80B-A3B-Instruct-NVFP4	MoE	NVFP4	trt-llm	104 single, 1951 batched (ctx=512 par=64)
RTX Pro 6000	DeepSeek-OCR	Dense	BF16	vllm
RTX Pro 6000	Qwen3-Embedding-8B	Embedding	BF16	vllm
RTX Pro 6000	Qwen3-Reranker-8B	Reranker	BF16	vllm
RTX Pro 6000	Qwen3-VL-Embedding-8B	VL-Embedding	BF16	vllm
RTX Pro 6000	Qwen3-VL-Reranker-8B	VL-Reranker	BF16	vllm
RTX 4090	personaplex-7b-v1	Dense	BF16	pytorch	Moshi voice model
RTX 4090	Qwen3-Embedding-8B	Embedding	BF16	vllm
RTX 4090	Qwen3-Reranker-8B	Reranker	BF16	vllm
RTX 4090	Qwen3-VL-Embedding-8B	VL-Embedding	BF16	vllm
RTX 4090	Qwen3-VL-Reranker-8B	VL-Reranker	BF16	vllm
DGX Spark	Nemotron-Super-49B-v1_5-NVFP4	Dense	NVFP4	trt-llm	TRT-LLM 1.1.0rc3, 3 deployment attempts in logs, served requests successfully
DGX Spark	Llama-4-Scout-17B-16E-Instruct-NVFP4	MoE	NVFP4	trt-llm	TRT-LLM 1.1.0rc3, port 8042, 75.22 GiB model+CUDA
DGX Spark	Qwen3-32B-NVFP4	Dense	NVFP4	trt-llm	TRT-LLM 1.1.0rc3, loaded from logs

Pending

GPU	Model	Quant	Platform	Notes
RTX Pro 6000	gpt-oss-120b-MXFP4+eagle3	MXFP4	sglang
RTX Pro 6000	Qwen3-235B-A22B-Thinking-2507-AWQ	AWQ	sglang
RTX Pro 6000	Qwen3-235B-A22B-NVFP4	NVFP4	trt-llm
RTX Pro 6000	Qwen3-Next-80B-A3B-Thinking-NVFP4	NVFP4	trt-llm
RTX Pro 6000	gpt-oss-20B-NVFP4A16-BF16	NVFP4	vllm
RTX Pro 6000	gpt-oss-20b-MXFP4+eagle3	MXFP4	vllm
RTX Pro 6000	Llama-3.3-70B-Instruct-NVFP4	NVFP4	trt-llm
RTX Pro 6000	twitter-roberta-base-sentiment-latest	BF16	transformers	Sentiment analysis, 1556 samples/sec
RTX Pro 6000	bart-large-cnn	BF16	transformers	Text summarization, 86 samples/sec
RTX Pro 6000	bart-large-mnli	BF16	transformers	Zero-shot classification, 415 samples/sec
RTX Pro 6000	gpt-oss-120b	MXFP4	vllm	Older directory naming, config.json present
RTX Pro 6000	Qwen3-Omni-30B-A3B-Abliterated-Voice	BF16	custom	Voice-enabled Qwen3 Omni, custom setup
RTX Pro 6000	NVFP4-GEMM+Attention	NVFP4	cutlass	Custom CUTLASS kernels: GEMM 38% peak, attention, KV quant/dequant, MoE GEMM. In development.
RTX Pro 6000	NVFP4-KV-Cache	NVFP4	trt-llm	TRT-LLM fork adding NVFP4 KV cache (issue #10241). MMHA integration complete, awaiting test.
RTX 4090	Qwen2.5-32B-Instruct-AWQ	AWQ	vllm
RTX 4090	Qwen3-30B-A3B-GPTQ-Int4	GPTQ	vllm
RTX 4090	Qwen3-Coder-30B-A3B-Instruct	GGUF	vllm
RTX 4090	personaplex-7b-v1	BF16	pytorch	Second instance of Moshi voice model
RTX 4090	Qwen3-Embedding-0.6B-FP8-Dynamic	FP8	vllm
RTX 4090	Qwen3-Embedding-0.6B-vllm-W8A8	W8A8	vllm
RTX 4090	Qwen3VL-8B-Instruct-FP8	FP8	vllm
RTX 4090	Qwen3-Embedding-8B-FP8-Dynamic	FP8	vllm
RTX 4090	Qwen3-VL-8B-Instruct-FP8	FP8	vllm
RTX 4090	Qwen3-Embedding-0.6B	BF16	vllm
RTX 4090	Qwen3-VL-8B-Instruct-FP8	FP8	vllm
RTX 4090	Qwen3-VL-8B-Instruct-FP8	FP8	vllm
DGX Spark	Qwen3-14B-FP4	NVFP4	trt-llm
DGX Spark	Qwen3-Next-80B-A3B-Instruct-NVFP4	NVFP4	sglang
DGX Spark	gpt-oss-120b-MXFP4	MXFP4	vllm

Failed

GPU	Model	Quant	Platform	Reason
RTX Pro 6000	personaplex-7b-v1		moshi
RTX Pro 6000	Qwen3-Next-80B-A3B-Instruct-NVFP4	NVFP4	sglang
RTX Pro 6000	gpt-oss-120b-MXFP4+eagle3	MXFP4	vllm
RTX Pro 6000	Qwen2.5-VL-7B-Instruct-NVFP4	NVFP4	vllm	vLLM variant, TRT-LLM version works (8463 TPS)
RTX Pro 6000	Qwen2.5-VL-32B-Instruct-AWQ	AWQ	vllm
RTX Pro 6000	Qwen2.5-VL-72B-Instruct-AWQ	AWQ	vllm	OOM or incompatible
RTX Pro 6000	Llama-4-Scout-17B-16E-Instruct-NVFP4	NVFP4	trt-llm	Chunked attention kernel SM90-only
RTX Pro 6000	sam2-hiera-large	BF16	pytorch	SAM2 vision encoder
DGX Spark	Seed-OSS-36B-Instruct-NVFP4	NVFP4	trt-llm	NotImplementedError: compressed-tensors nvfp4-pack-quantized format unsupported
DGX Spark	Qwen3-30B-A3B-NVFP4	NVFP4	vllm	MoE: 30.5B total, 3.3B active. Model loads but stuck in scheduler loop. GB10 sm_121 MoE scheduler incompatibility.
DGX Spark	Qwen3-Next-80B-A3B-Instruct-NVFP4	NVFP4	tensorrt-llm	MoE: 80B total, 3.9B active. Gated DeltaNet hybrid attention not supported. No workaround for GB10.
DGX Spark	Mistral-Small-3.2-24B-Instruct-2506-NVFP4	NVFP4	vllm	Tokenizer error: MistralTokenizer has no convert_tokens_to_ids. Tekken tokenizer incompatible.
DGX Spark	NVIDIA-Nemotron-Nano-9B-v2-NVFP4	NVFP4	tensorrt-llm	NemotronHForCausalLM architecture not supported in TRT-LLM. Use vLLM instead (27.27 TPS).
DGX Spark	Seed-OSS-36B-Instruct-NVFP4	NVFP4	tensorrt-llm	Community quant uses compressed-tensors format, not official NVIDIA NVFP4. TRT-LLM cannot parse.
DGX Spark	Seed-OSS-36B-Instruct-NVFP4	NVFP4	vllm	CUDA driver init error: Error 35.
DGX Spark	nemotron3-nano-nvfp4-w4a16	NVFP4-W4A16	vllm	Mamba hybrid architecture requires HybridMambaAttentionDecoderLayer not supported.
DGX Spark	nemotron3-nano-nvfp4-w4a16	NVFP4-W4A16	tensorrt-llm	Mamba hybrid architecture not supported by TRT-LLM NVFP4 backend.

Scaling Efficiency (RTX Pro 6000)

TPS	Model	Quant	Scaling Eff	Knee (par)	Knee TPS
5,176	gpt-oss-20b-MXFP4	MXFP4	36.5%	16	1,783
3,269	Nemotron-Super-49B-NVFP4	NVFP4	104.1%	128	3,269
2,165	gpt-oss-120b-MXFP4	MXFP4	26.3%	8	610
1,557	Llama-4-Scout-17B-16E-NVFP4	NVFP4	43.2%	16	492

Quant Comparison (SM120, same model)

Quant	Model	TPS	Relative
NVFP4	MiniMax-M2.1-NVFP4	1,355	1.0x
AWQ	MiniMax-M2.1-AWQ-4bit	405	0.3x
NVFP4	Qwen3-Next-80B-NVFP4	4,778	1.0x
FP8	Qwen3-Next-80B-FP8	781	0.16x

Inference Stack

User Models --> ModelOpt (NVFP4/FP8 quantization)
                    |
        +-----------+-----------+
        v                       v
      vLLM                  TRT-LLM
   (community)              (NVIDIA)
        |                       |
        v                       v
   FlashInfer            TRT-LLM Plugins
  (UW Academic)          (C++ kernels)
        |                       |
        +-----------+-----------+
                    v
                 CUTLASS
                (NVIDIA)
          GEMM + NVFP4 primitives
          (no attention/KV cache)
                    |
          +---------+---------+
          v                   v
        SM120               SM121
   RTX Pro 6000          DGX Spark

Deployment System

Use the /deploy skill for deploying containers. Quick ref: ./deploy <dir> [--dry-run|--generate|--down|--shared-tenancy]

Platform Recommendations

SM	GPU	Stack	Reason
SM120	RTX Pro 6000	vLLM + FlashInfer	TRT-LLM missing chunked attention SM120 kernel
SM121	DGX Spark	TRT-LLM	1.1x faster than vLLM on same model
SM89	RTX 4090	vLLM	Dense models only, 24 GB limit

Known Issues

Model	Platform	Issue	Workaround
Llama-4-Scout-17B-16E	TRT-LLM	Chunked attention kernel SM90-only	Use vLLM
Qwen3-Next-80B (Mamba hybrid)	TRT-LLM	Mamba hybrid overhead	`enable_block_reuse=false`
Qwen3-Next-80B-NVFP4	SGLang	SMEM overflow on SM120	Use vLLM
gpt-oss-120b-MXFP4+eagle3	vLLM	Speculative decoding broken	Use without spec decode

GitHub Issues

Issue	Repo	Description	Status
#10241	TensorRT-LLM	NVFP4 KV cache support request	Open
#5018	TensorRT-LLM	SM120 (5090) NVFP4 support	Confirmed in 0.20.0rc3+
#31085	vLLM	SM120 FP4 kernel detection fails	Unknown

Projects

Project	URL
CUTLASS	github.com/NVIDIA/cutlass
TensorRT-LLM	github.com/NVIDIA/TensorRT-LLM
vLLM	github.com/vllm-project/vllm
FlashInfer	github.com/flashinfer-ai/flashinfer
FlashAttention	github.com/Dao-AILab/flash-attention
ModelOpt	github.com/NVIDIA/TensorRT-Model-Optimizer

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
.githooks		.githooks
SM120.RtxPro6000		SM120.RtxPro6000
SM121.DgxSpark		SM121.DgxSpark
SM12x.NVFP4w-NVFP4kv.MoE-GQA.CUTLASS		SM12x.NVFP4w-NVFP4kv.MoE-GQA.CUTLASS
SM12x.NVFP4w-NVFP4kv.MoE-GQA.TrtLLM-Fork		SM12x.NVFP4w-NVFP4kv.MoE-GQA.TrtLLM-Fork
SM89.Rtx4090Ada		SM89.Rtx4090Ada
data		data
docs/reference-md		docs/reference-md
infrastructure		infrastructure
ridgelines		ridgelines
scripts		scripts
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
blackwell_primer.py		blackwell_primer.py
bootstrap		bootstrap
caveats.tsv		caveats.tsv
deploy		deploy
deployments.tsv		deployments.tsv
hardware.txt		hardware.txt
llm-benchmark.py		llm-benchmark.py

PrimitiveContext/blackwell

Folders and files

Latest commit

History

Repository files navigation