Skip to content

Releases: quantumaikr/quant.cpp

v0.6.2 — Per-channel outlier handling (turbo_kv_4bo / turbo_kv_3bo)

07 Apr 23:17

Choose a tag to compare

🆕 Per-channel outlier handling

Two new research types add per-block outlier handling on top of the Variant F base, validating the technique from the Google TurboQuant paper.

Each block stores the K=8 channels with the largest `|rotated[i]|` as exact FP16 values that overwrite the codebook reconstruction at dequant time. The non-outlier channels share a tighter codebook (max-abs computed from the body only), so the codebook doesn't waste resolution on the tails the outliers already capture exactly.

Karpathy-loop result: gap cut in half on Llama 3.2 3B

Type Bytes/block PPL Δ vs FP32 vs 4b
FP32 13.56
`turbo_kv_4b` ⭐ default 72 14.28 +5.3%
`turbo_kv_3bo` 🧪 80 14.03 +3.5% −34% gap
`turbo_kv_5b` 🏆 quality 88 13.60 +0.34% −94% gap
`turbo_kv_4bo` 🧪 96 13.86 +2.2% −58% gap

Honest disclosure: model-dependent

The outlier types are data-dependent. On SmolLM2 135M:

Type Bytes PPL Δ vs FP32
FP32 18.62
`turbo_kv_4b` 72 19.70 +5.8%
`turbo_kv_3bo` 80 20.45 +9.8% (regression)
`turbo_kv_5b` 88 18.94 +1.7%
`turbo_kv_4bo` 96 19.29 +3.6%

On a smaller-dimension model the 3-bit base in 3bo is too coarse even with outliers, and 4bo is dominated by 5b. The Pareto-optimal recommendations remain:

  • `turbo_kv_4b` as the default (production)
  • `turbo_kv_5b` for quality (production)
  • `turbo_kv_4bo` / `turbo_kv_3bo` as research types (selectable via `-k turbo_kv_4bo` / `turbo_kv_3bo`)

Why ship them anyway?

  1. Validates the per-channel outlier technique — proves that local outlier handling closes meaningful PPL gap on heavy-tailed models
  2. Data point for Issue #15 and the ongoing TurboQuant paper reproduction work
  3. Foundation for per-model auto-selection — a future release could pick 4b vs 4bo per layer/head

CLI

```bash
./build/quant model.gguf -k turbo_kv_4bo # research, 96B blocks
./build/quant model.gguf -k turbo_kv_3bo # research, 80B blocks
./build/quant model.gguf -k turbo_kv_5b # production quality, 88B blocks
./build/quant model.gguf # default = turbo_kv_4b, 72B blocks
```

Tests

35/35 unit tests pass. The existing regression tests on `turbo_kv_4b` and `turbo_kv_5b` cosine quality remain unchanged and continue to gate any future regression.

Closes from issue #15

  • ✅ Per-channel outlier handling (Google paper's 32-channel split) — explored, model-dependent

Still open in #15:

  • Paper-faithful Llama 3.1 8B + LongBench-E reproduction
  • Per-head rotation seeds

v0.6.1 — turbo_kv_5b near-lossless + regression tests

07 Apr 22:33

Choose a tag to compare

🆕 turbo_kv_5b — near-lossless KV at +0.34% PPL

5-bit (32-level) Lloyd-Max-Gaussian codebook on RHT-rotated keys, following the same Variant F single-stage architecture as `turbo_kv_4b`. The new quality-maximizing option for users who can spare 22% more KV memory than 4b.

Type Bytes/block Compression Llama 3.2 3B PPL Δ vs FP32
FP32 baseline 4/elem 13.56
`turbo_kv_3b` 56 9.1× 15.39 +13.5%
`turbo_kv_4b` ⭐ default 72 7.1× 14.28 +5.3%
`turbo_kv_5b` 🏆 88 5.8× 13.60 +0.34%

CLI: `./build/quant model.gguf -k turbo_kv_5b`

Regression tests

Three new deterministic tests in `test_turbo_kv.cpp` pin the Variant F quality thresholds so future Karpathy-loop iterations cannot regress past them without failing CI:

  • `KV_4B_AttentionCosine` — `turbo_kv_4b` cosine ≥ 0.99 vs FP32 reference on synthetic data
  • `KV_5B_AttentionCosine` — `turbo_kv_5b` cosine ≥ 0.999
  • `KV_5B_BeatsKV_4B` — invariant: more bits must give ≥ accuracy

Tests use synthetic Gaussian-with-outliers vectors (~3% injected outliers at ±5× scale) and run in < 1 second. No model file needed.

Compatibility

  • Block layout for `turbo_kv_3b`/`4b` is unchanged from v0.6.0 — only new `turbo_kv_5b` type added
  • All 35 unit tests pass on macOS / Linux / Windows

Closes one item from issue #15

The 5-bit codebook variant follow-up from #15 is now shipped. Remaining items: per-channel outlier handling, Llama 3.1 8B + LongBench-E reproduction.

v0.6.0 — turbo_kv_4b champion, beats production baseline

07 Apr 21:53

Choose a tag to compare

🏆 Highlights

After 6 rounds of Karpathy-loop iteration starting from a literal port of Google TurboQuant (ICLR 2026), turbo_kv_4b is now the best 4-bit KV quantization in the project — beating both our previous production baseline (`uniform_4b`) and llama.cpp's `q4_0` KV at the same bit budget.

KV type Bits/elem Llama 3.2 3B PPL Δ vs FP32
FP32 baseline 32 13.56
`turbo_kv_4b` 4 14.28 +5.3%
`uniform_4b` 4 14.41 +6.3%
`turbo_kv_3b` 3 15.39 +13.5%
llama.cpp q4_0 KV (rough) 4 ~14.99 +10.6%

The story

The literal paper port (RHT → Lloyd-Max codebook → 1-bit QJL residual + ‖r‖₂) gave PPL 16.03 — worse than the simpler `uniform_4b` (14.41). A Karpathy-loop ablation found the QJL stage contributed byte-identical zero to attention scores. We dropped it and reinvested the freed 16 bytes per block in a 2× larger codebook (3-bit → 4-bit / 8 → 16 levels). Same total block size, finer reconstruction, structurally simpler.

Full optimization history: bench/results/turboquant_reproduction.md

Other changes

  • CLI default switched — `quant model.gguf` now uses `turbo_kv_4b` automatically
  • @quantcpp/wasm npm package — `npm install @quantcpp/wasm` to drop a 192KB GGUF inference engine into any web project
  • Windows CI green — pthread_cond_wait SRWLOCK deadlock fixed, MSVC `_builtin*` shims, /tmp paths in tests, M_PI in test_neon_scalar. 35/35 tests pass on macOS / Linux / Windows.
  • Honest TurboQuant story — public reproduction report with full ablation history. No overstated claims.
  • Public PR triage — PR #12 (5 critical bug fixes) cherry-picked; PR #13 reformatting noise rejected, examples README + CMake separation salvaged.

What's tracked for next release

See issue #15:

  • Per-channel outlier handling (Google paper's 32-channel split)
  • Paper-faithful Llama 3.1 8B + LongBench-E reproduction
  • 5-bit codebook variant for ~5 bpc

Bug fixes

  • `tq_qjl.c`: NaN guard requires `dim > 0`
  • `tq_uniform.c`: heap-allocate Q8 query buffer (was 512B stack)
  • `tq_transformer.c`: NULL-check key/value cache calloc results
  • `tq_ops.c`: Windows pthread_cond_wait must use SRW variant (CS variant on SRWLOCK = deadlock in test_ops thread pool)

Citations

If you use quant.cpp's KV compression in research, please cite:

v0.5.0 — Gemma 4 MoE + 7x KV Compression + WASM

05 Apr 06:36

Choose a tag to compare

What's New

Gemma 4 26B-A4B MoE Support

Full support for Gemma 4's hybrid MoE architecture: 128 experts, dual-FFN, hybrid attention (sliding + full), QK-norm, learned RoPE, GeGLU activation. Generates correct answers in English and Korean.

7x KV Cache Compression

Same hardware, 7x longer context, zero quality loss.

Model FP16 KV quant.cpp KV Gain
Llama 3.2 3B (16GB Mac) 50K tokens 350K tokens 6.9x
Gemma 4 26B (16GB Mac) 4K tokens 30K tokens 6.9x

New Models

  • Llama 3.2 3B Instruct — 17 tok/s, correct code generation
  • Gemma 4 26B-A4B-it — 3.9 tok/s, 128-expert MoE

WASM Browser Demo

192KB binary. Drag and drop a GGUF model, chat in the browser. Everything runs client-side.
Try it

Windows (MSVC) Support

Compiles with Visual Studio 2019/2022. pthread shim, C11 atomics compat.

quant.h Synced

Single header now includes Gemma 4, Llama 3, IQ3_XXS support. cc app.c -lm -lpthread — done.

Documentation

Performance

  • Gemma 4 26B: 549ms → 257ms/token (-53%)
  • Metal GPU: 7 compute kernels implemented (infrastructure for batch inference)

Bug Fixes

  • Gemma 4 NaN regression, Llama head_dim misdetection
  • TQ_STATIC_ASSERT in C mode, stack buffer overflow
  • Zero build warnings, 34/34 tests pass, score 99.2%

Full changelog: CHANGELOG.md

Full Changelog: v0.2.0...v0.5.0

v0.2.0

03 Apr 16:56

Choose a tag to compare

Full Changelog: v0.1.0...v0.2.0

v0.1.0 — Multi-Architecture Engine with KV Cache Compression

31 Mar 14:31

Choose a tag to compare

TurboQuant.cpp v0.1.0 — First Release

Multi-architecture LLM inference engine in pure C with KV cache compression.

Highlights

  • 3 models supported: Gemma 3 4B, Qwen3.5-0.8B, Gemma 3 270M
  • 3.8x KV cache compression — at 32K context: 1.2 GB vs llama.cpp's 4.4 GB
  • llama.cpp parity: 51 tok/s single-thread (vs 50.7 tok/s)
  • Multi-shard safetensors: loads sharded models (Gemma 4B = 2 shards)
  • Dual tokenizer: GPT2 byte-level BPE + SentencePiece auto-detect
  • TQM format: pre-quantized mmap binary, instant loading
  • Zero dependencies: libc only, ~1MB binary

Supported Models

Model Speed (Q4, 6T) Quality
Gemma 3 4B 5.2 tok/s "capital of France" → "Paris"
Qwen3.5-0.8B 82 tok/s 0.999 cosine vs PyTorch
Gemma 3 270M 176 tok/s per-layer exact match

KV Cache Memory Savings

Gemma 3 4B at 32K context:
  llama.cpp (FP16 KV):    4,352 MB
  TurboQuant (Q4 KV):     1,156 MB  ← 3.8x compression

Quick Start

git clone https://github.com/quantumaikr/TurboQuant.cpp && cd TurboQuant.cpp
bash scripts/quickstart.sh "What is deep learning?"

What's Inside

  • 9,000+ lines of pure C — complete inference engine
  • 8 quantization types: Uniform, Mixed, PolarQuant, QJL, TurboQuant
  • Architecture dispatch: Qwen3.5 (DeltaNet + Attention) + Gemma 3 (Sliding Window + GQA)
  • Q4 weight quantization with NEON 2-row batching + thread pool
  • Integer Q4×Q8 attention via ARM vdotq_s32
  • 20 test suites, 70+ tests
  • Python bindings (ctypes), llama.cpp/vLLM integration stubs

References

  • TurboQuant (ICLR 2026) — KV cache compression
  • QJL (AAAI 2025) — 1-bit quantized JL transform
  • PolarQuant (AISTATS 2026) — Polar coordinate quantization