Releases: quantumaikr/quant.cpp
v0.6.2 — Per-channel outlier handling (turbo_kv_4bo / turbo_kv_3bo)
🆕 Per-channel outlier handling
Two new research types add per-block outlier handling on top of the Variant F base, validating the technique from the Google TurboQuant paper.
Each block stores the K=8 channels with the largest `|rotated[i]|` as exact FP16 values that overwrite the codebook reconstruction at dequant time. The non-outlier channels share a tighter codebook (max-abs computed from the body only), so the codebook doesn't waste resolution on the tails the outliers already capture exactly.
Karpathy-loop result: gap cut in half on Llama 3.2 3B
| Type | Bytes/block | PPL | Δ vs FP32 | vs 4b |
|---|---|---|---|---|
| FP32 | — | 13.56 | — | — |
| `turbo_kv_4b` ⭐ default | 72 | 14.28 | +5.3% | — |
| `turbo_kv_3bo` 🧪 | 80 | 14.03 | +3.5% | −34% gap |
| `turbo_kv_5b` 🏆 quality | 88 | 13.60 | +0.34% | −94% gap |
| `turbo_kv_4bo` 🧪 | 96 | 13.86 | +2.2% | −58% gap |
Honest disclosure: model-dependent
The outlier types are data-dependent. On SmolLM2 135M:
| Type | Bytes | PPL | Δ vs FP32 |
|---|---|---|---|
| FP32 | — | 18.62 | — |
| `turbo_kv_4b` | 72 | 19.70 | +5.8% |
| `turbo_kv_3bo` | 80 | 20.45 | +9.8% (regression) |
| `turbo_kv_5b` | 88 | 18.94 | +1.7% |
| `turbo_kv_4bo` | 96 | 19.29 | +3.6% |
On a smaller-dimension model the 3-bit base in 3bo is too coarse even with outliers, and 4bo is dominated by 5b. The Pareto-optimal recommendations remain:
- `turbo_kv_4b` as the default (production)
- `turbo_kv_5b` for quality (production)
- `turbo_kv_4bo` / `turbo_kv_3bo` as research types (selectable via `-k turbo_kv_4bo` / `turbo_kv_3bo`)
Why ship them anyway?
- Validates the per-channel outlier technique — proves that local outlier handling closes meaningful PPL gap on heavy-tailed models
- Data point for Issue #15 and the ongoing TurboQuant paper reproduction work
- Foundation for per-model auto-selection — a future release could pick 4b vs 4bo per layer/head
CLI
```bash
./build/quant model.gguf -k turbo_kv_4bo # research, 96B blocks
./build/quant model.gguf -k turbo_kv_3bo # research, 80B blocks
./build/quant model.gguf -k turbo_kv_5b # production quality, 88B blocks
./build/quant model.gguf # default = turbo_kv_4b, 72B blocks
```
Tests
35/35 unit tests pass. The existing regression tests on `turbo_kv_4b` and `turbo_kv_5b` cosine quality remain unchanged and continue to gate any future regression.
Closes from issue #15
- ✅ Per-channel outlier handling (Google paper's 32-channel split) — explored, model-dependent
Still open in #15:
- Paper-faithful Llama 3.1 8B + LongBench-E reproduction
- Per-head rotation seeds
v0.6.1 — turbo_kv_5b near-lossless + regression tests
🆕 turbo_kv_5b — near-lossless KV at +0.34% PPL
5-bit (32-level) Lloyd-Max-Gaussian codebook on RHT-rotated keys, following the same Variant F single-stage architecture as `turbo_kv_4b`. The new quality-maximizing option for users who can spare 22% more KV memory than 4b.
| Type | Bytes/block | Compression | Llama 3.2 3B PPL | Δ vs FP32 |
|---|---|---|---|---|
| FP32 baseline | 4/elem | 1× | 13.56 | — |
| `turbo_kv_3b` | 56 | 9.1× | 15.39 | +13.5% |
| `turbo_kv_4b` ⭐ default | 72 | 7.1× | 14.28 | +5.3% |
| `turbo_kv_5b` 🏆 | 88 | 5.8× | 13.60 | +0.34% |
CLI: `./build/quant model.gguf -k turbo_kv_5b`
Regression tests
Three new deterministic tests in `test_turbo_kv.cpp` pin the Variant F quality thresholds so future Karpathy-loop iterations cannot regress past them without failing CI:
- `KV_4B_AttentionCosine` — `turbo_kv_4b` cosine ≥ 0.99 vs FP32 reference on synthetic data
- `KV_5B_AttentionCosine` — `turbo_kv_5b` cosine ≥ 0.999
- `KV_5B_BeatsKV_4B` — invariant: more bits must give ≥ accuracy
Tests use synthetic Gaussian-with-outliers vectors (~3% injected outliers at ±5× scale) and run in < 1 second. No model file needed.
Compatibility
- Block layout for `turbo_kv_3b`/`4b` is unchanged from v0.6.0 — only new `turbo_kv_5b` type added
- All 35 unit tests pass on macOS / Linux / Windows
Closes one item from issue #15
The 5-bit codebook variant follow-up from #15 is now shipped. Remaining items: per-channel outlier handling, Llama 3.1 8B + LongBench-E reproduction.
v0.6.0 — turbo_kv_4b champion, beats production baseline
🏆 Highlights
After 6 rounds of Karpathy-loop iteration starting from a literal port of Google TurboQuant (ICLR 2026), turbo_kv_4b is now the best 4-bit KV quantization in the project — beating both our previous production baseline (`uniform_4b`) and llama.cpp's `q4_0` KV at the same bit budget.
| KV type | Bits/elem | Llama 3.2 3B PPL | Δ vs FP32 |
|---|---|---|---|
| FP32 baseline | 32 | 13.56 | — |
| `turbo_kv_4b` ⭐ | 4 | 14.28 | +5.3% |
| `uniform_4b` | 4 | 14.41 | +6.3% |
| `turbo_kv_3b` | 3 | 15.39 | +13.5% |
| llama.cpp q4_0 KV (rough) | 4 | ~14.99 | +10.6% |
The story
The literal paper port (RHT → Lloyd-Max codebook → 1-bit QJL residual + ‖r‖₂) gave PPL 16.03 — worse than the simpler `uniform_4b` (14.41). A Karpathy-loop ablation found the QJL stage contributed byte-identical zero to attention scores. We dropped it and reinvested the freed 16 bytes per block in a 2× larger codebook (3-bit → 4-bit / 8 → 16 levels). Same total block size, finer reconstruction, structurally simpler.
Full optimization history: bench/results/turboquant_reproduction.md
Other changes
- CLI default switched — `quant model.gguf` now uses `turbo_kv_4b` automatically
- @quantcpp/wasm npm package — `npm install @quantcpp/wasm` to drop a 192KB GGUF inference engine into any web project
- Windows CI green — pthread_cond_wait SRWLOCK deadlock fixed, MSVC `_builtin*` shims, /tmp paths in tests, M_PI in test_neon_scalar. 35/35 tests pass on macOS / Linux / Windows.
- Honest TurboQuant story — public reproduction report with full ablation history. No overstated claims.
- Public PR triage — PR #12 (5 critical bug fixes) cherry-picked; PR #13 reformatting noise rejected, examples README + CMake separation salvaged.
What's tracked for next release
See issue #15:
- Per-channel outlier handling (Google paper's 32-channel split)
- Paper-faithful Llama 3.1 8B + LongBench-E reproduction
- 5-bit codebook variant for ~5 bpc
Bug fixes
- `tq_qjl.c`: NaN guard requires `dim > 0`
- `tq_uniform.c`: heap-allocate Q8 query buffer (was 512B stack)
- `tq_transformer.c`: NULL-check key/value cache calloc results
- `tq_ops.c`: Windows pthread_cond_wait must use SRW variant (CS variant on SRWLOCK = deadlock in test_ops thread pool)
Citations
If you use quant.cpp's KV compression in research, please cite:
v0.5.0 — Gemma 4 MoE + 7x KV Compression + WASM
What's New
Gemma 4 26B-A4B MoE Support
Full support for Gemma 4's hybrid MoE architecture: 128 experts, dual-FFN, hybrid attention (sliding + full), QK-norm, learned RoPE, GeGLU activation. Generates correct answers in English and Korean.
7x KV Cache Compression
Same hardware, 7x longer context, zero quality loss.
| Model | FP16 KV | quant.cpp KV | Gain |
|---|---|---|---|
| Llama 3.2 3B (16GB Mac) | 50K tokens | 350K tokens | 6.9x |
| Gemma 4 26B (16GB Mac) | 4K tokens | 30K tokens | 6.9x |
New Models
- Llama 3.2 3B Instruct — 17 tok/s, correct code generation
- Gemma 4 26B-A4B-it — 3.9 tok/s, 128-expert MoE
WASM Browser Demo
192KB binary. Drag and drop a GGUF model, chat in the browser. Everything runs client-side.
→ Try it
Windows (MSVC) Support
Compiles with Visual Studio 2019/2022. pthread shim, C11 atomics compat.
quant.h Synced
Single header now includes Gemma 4, Llama 3, IQ3_XXS support. cc app.c -lm -lpthread — done.
Documentation
- API Reference — 730 lines, all platforms
- Custom Quantization Guide — add your own KV type in 3 functions
- ROADMAP — project direction
Performance
- Gemma 4 26B: 549ms → 257ms/token (-53%)
- Metal GPU: 7 compute kernels implemented (infrastructure for batch inference)
Bug Fixes
- Gemma 4 NaN regression, Llama head_dim misdetection
- TQ_STATIC_ASSERT in C mode, stack buffer overflow
- Zero build warnings, 34/34 tests pass, score 99.2%
Full changelog: CHANGELOG.md
Full Changelog: v0.2.0...v0.5.0
v0.2.0
Full Changelog: v0.1.0...v0.2.0
v0.1.0 — Multi-Architecture Engine with KV Cache Compression
TurboQuant.cpp v0.1.0 — First Release
Multi-architecture LLM inference engine in pure C with KV cache compression.
Highlights
- 3 models supported: Gemma 3 4B, Qwen3.5-0.8B, Gemma 3 270M
- 3.8x KV cache compression — at 32K context: 1.2 GB vs llama.cpp's 4.4 GB
- llama.cpp parity: 51 tok/s single-thread (vs 50.7 tok/s)
- Multi-shard safetensors: loads sharded models (Gemma 4B = 2 shards)
- Dual tokenizer: GPT2 byte-level BPE + SentencePiece auto-detect
- TQM format: pre-quantized mmap binary, instant loading
- Zero dependencies: libc only, ~1MB binary
Supported Models
| Model | Speed (Q4, 6T) | Quality |
|---|---|---|
| Gemma 3 4B | 5.2 tok/s | "capital of France" → "Paris" |
| Qwen3.5-0.8B | 82 tok/s | 0.999 cosine vs PyTorch |
| Gemma 3 270M | 176 tok/s | per-layer exact match |
KV Cache Memory Savings
Gemma 3 4B at 32K context:
llama.cpp (FP16 KV): 4,352 MB
TurboQuant (Q4 KV): 1,156 MB ← 3.8x compression
Quick Start
git clone https://github.com/quantumaikr/TurboQuant.cpp && cd TurboQuant.cpp
bash scripts/quickstart.sh "What is deep learning?"What's Inside
- 9,000+ lines of pure C — complete inference engine
- 8 quantization types: Uniform, Mixed, PolarQuant, QJL, TurboQuant
- Architecture dispatch: Qwen3.5 (DeltaNet + Attention) + Gemma 3 (Sliding Window + GQA)
- Q4 weight quantization with NEON 2-row batching + thread pool
- Integer Q4×Q8 attention via ARM vdotq_s32
- 20 test suites, 70+ tests
- Python bindings (ctypes), llama.cpp/vLLM integration stubs
References
- TurboQuant (ICLR 2026) — KV cache compression
- QJL (AAAI 2025) — 1-bit quantized JL transform
- PolarQuant (AISTATS 2026) — Polar coordinate quantization