Releases · quantumaikr/quant.cpp

07 Apr 23:17

unamedkr

v0.6.2

0f1d1a6

v0.6.2 — Per-channel outlier handling (turbo_kv_4bo / turbo_kv_3bo)

🆕 Per-channel outlier handling

Two new research types add per-block outlier handling on top of the Variant F base, validating the technique from the Google TurboQuant paper.

Each block stores the K=8 channels with the largest `|rotated[i]|` as exact FP16 values that overwrite the codebook reconstruction at dequant time. The non-outlier channels share a tighter codebook (max-abs computed from the body only), so the codebook doesn't waste resolution on the tails the outliers already capture exactly.

Karpathy-loop result: gap cut in half on Llama 3.2 3B

Type	Bytes/block	PPL	Δ vs FP32	vs 4b
FP32	—	13.56	—	—
`turbo_kv_4b` ⭐ default	72	14.28	+5.3%	—
`turbo_kv_3bo` 🧪	80	14.03	+3.5%	−34% gap
`turbo_kv_5b` 🏆 quality	88	13.60	+0.34%	−94% gap
`turbo_kv_4bo` 🧪	96	13.86	+2.2%	−58% gap

Honest disclosure: model-dependent

The outlier types are data-dependent. On SmolLM2 135M:

Type	Bytes	PPL	Δ vs FP32
FP32	—	18.62	—
`turbo_kv_4b`	72	19.70	+5.8%
`turbo_kv_3bo`	80	20.45	+9.8% (regression)
`turbo_kv_5b`	88	18.94	+1.7%
`turbo_kv_4bo`	96	19.29	+3.6%

On a smaller-dimension model the 3-bit base in 3bo is too coarse even with outliers, and 4bo is dominated by 5b. The Pareto-optimal recommendations remain:

`turbo_kv_4b` as the default (production)
`turbo_kv_5b` for quality (production)
`turbo_kv_4bo` / `turbo_kv_3bo` as research types (selectable via `-k turbo_kv_4bo` / `turbo_kv_3bo`)

Why ship them anyway?

Validates the per-channel outlier technique — proves that local outlier handling closes meaningful PPL gap on heavy-tailed models
Data point for Issue #15 and the ongoing TurboQuant paper reproduction work
Foundation for per-model auto-selection — a future release could pick 4b vs 4bo per layer/head

CLI

```bash
./build/quant model.gguf -k turbo_kv_4bo # research, 96B blocks
./build/quant model.gguf -k turbo_kv_3bo # research, 80B blocks
./build/quant model.gguf -k turbo_kv_5b # production quality, 88B blocks
./build/quant model.gguf # default = turbo_kv_4b, 72B blocks
```

Tests

35/35 unit tests pass. The existing regression tests on `turbo_kv_4b` and `turbo_kv_5b` cosine quality remain unchanged and continue to gate any future regression.

Closes from issue #15

✅ Per-channel outlier handling (Google paper's 32-channel split) — explored, model-dependent

Still open in #15:

Paper-faithful Llama 3.1 8B + LongBench-E reproduction
Per-head rotation seeds

Assets 2

07 Apr 22:33

unamedkr

v0.6.1

475872c

v0.6.1 — turbo_kv_5b near-lossless + regression tests

🆕 turbo_kv_5b — near-lossless KV at +0.34% PPL

5-bit (32-level) Lloyd-Max-Gaussian codebook on RHT-rotated keys, following the same Variant F single-stage architecture as `turbo_kv_4b`. The new quality-maximizing option for users who can spare 22% more KV memory than 4b.

Type	Bytes/block	Compression	Llama 3.2 3B PPL	Δ vs FP32
FP32 baseline	4/elem	1×	13.56	—
`turbo_kv_3b`	56	9.1×	15.39	+13.5%
`turbo_kv_4b` ⭐ default	72	7.1×	14.28	+5.3%
`turbo_kv_5b` 🏆	88	5.8×	13.60	+0.34%

CLI: `./build/quant model.gguf -k turbo_kv_5b`

Regression tests

Three new deterministic tests in `test_turbo_kv.cpp` pin the Variant F quality thresholds so future Karpathy-loop iterations cannot regress past them without failing CI:

`KV_4B_AttentionCosine` — `turbo_kv_4b` cosine ≥ 0.99 vs FP32 reference on synthetic data
`KV_5B_AttentionCosine` — `turbo_kv_5b` cosine ≥ 0.999
`KV_5B_BeatsKV_4B` — invariant: more bits must give ≥ accuracy

Tests use synthetic Gaussian-with-outliers vectors (~3% injected outliers at ±5× scale) and run in < 1 second. No model file needed.

Compatibility

Block layout for `turbo_kv_3b`/`4b` is unchanged from v0.6.0 — only new `turbo_kv_5b` type added
All 35 unit tests pass on macOS / Linux / Windows

Closes one item from issue #15

The 5-bit codebook variant follow-up from #15 is now shipped. Remaining items: per-channel outlier handling, Llama 3.1 8B + LongBench-E reproduction.

Assets 2

07 Apr 21:53

unamedkr

v0.6.0

5f6387b

v0.6.0 — turbo_kv_4b champion, beats production baseline

🏆 Highlights

After 6 rounds of Karpathy-loop iteration starting from a literal port of Google TurboQuant (ICLR 2026), turbo_kv_4b is now the best 4-bit KV quantization in the project — beating both our previous production baseline (`uniform_4b`) and llama.cpp's `q4_0` KV at the same bit budget.

KV type	Bits/elem	Llama 3.2 3B PPL	Δ vs FP32
FP32 baseline	32	13.56	—
`turbo_kv_4b` ⭐	4	14.28	+5.3%
`uniform_4b`	4	14.41	+6.3%
`turbo_kv_3b`	3	15.39	+13.5%
llama.cpp q4_0 KV (rough)	4	~14.99	+10.6%

The story

The literal paper port (RHT → Lloyd-Max codebook → 1-bit QJL residual + ‖r‖₂) gave PPL 16.03 — worse than the simpler `uniform_4b` (14.41). A Karpathy-loop ablation found the QJL stage contributed byte-identical zero to attention scores. We dropped it and reinvested the freed 16 bytes per block in a 2× larger codebook (3-bit → 4-bit / 8 → 16 levels). Same total block size, finer reconstruction, structurally simpler.

Full optimization history: bench/results/turboquant_reproduction.md

Other changes

CLI default switched — `quant model.gguf` now uses `turbo_kv_4b` automatically
@quantcpp/wasm npm package — `npm install @quantcpp/wasm` to drop a 192KB GGUF inference engine into any web project
Windows CI green — pthread_cond_wait SRWLOCK deadlock fixed, MSVC `_builtin*` shims, /tmp paths in tests, M_PI in test_neon_scalar. 35/35 tests pass on macOS / Linux / Windows.
Honest TurboQuant story — public reproduction report with full ablation history. No overstated claims.
Public PR triage — PR #12 (5 critical bug fixes) cherry-picked; PR #13 reformatting noise rejected, examples README + CMake separation salvaged.

What's tracked for next release

See issue #15:

Per-channel outlier handling (Google paper's 32-channel split)
Paper-faithful Llama 3.1 8B + LongBench-E reproduction
5-bit codebook variant for ~5 bpc

Bug fixes

`tq_qjl.c`: NaN guard requires `dim > 0`
`tq_uniform.c`: heap-allocate Q8 query buffer (was 512B stack)
`tq_transformer.c`: NULL-check key/value cache calloc results
`tq_ops.c`: Windows pthread_cond_wait must use SRW variant (CS variant on SRWLOCK = deadlock in test_ops thread pool)

Citations

If you use quant.cpp's KV compression in research, please cite:

Assets 2

05 Apr 06:36

unamedkr

v0.5.0

d78b3f8

v0.5.0 — Gemma 4 MoE + 7x KV Compression + WASM

What's New

Gemma 4 26B-A4B MoE Support

Full support for Gemma 4's hybrid MoE architecture: 128 experts, dual-FFN, hybrid attention (sliding + full), QK-norm, learned RoPE, GeGLU activation. Generates correct answers in English and Korean.

7x KV Cache Compression

Same hardware, 7x longer context, zero quality loss.

Model	FP16 KV	quant.cpp KV	Gain
Llama 3.2 3B (16GB Mac)	50K tokens	350K tokens	6.9x
Gemma 4 26B (16GB Mac)	4K tokens	30K tokens	6.9x

New Models

Llama 3.2 3B Instruct — 17 tok/s, correct code generation
Gemma 4 26B-A4B-it — 3.9 tok/s, 128-expert MoE

WASM Browser Demo

192KB binary. Drag and drop a GGUF model, chat in the browser. Everything runs client-side.
→ Try it

Windows (MSVC) Support

Compiles with Visual Studio 2019/2022. pthread shim, C11 atomics compat.

quant.h Synced

Single header now includes Gemma 4, Llama 3, IQ3_XXS support. cc app.c -lm -lpthread — done.

Documentation

API Reference — 730 lines, all platforms
Custom Quantization Guide — add your own KV type in 3 functions
ROADMAP — project direction

Performance

Gemma 4 26B: 549ms → 257ms/token (-53%)
Metal GPU: 7 compute kernels implemented (infrastructure for batch inference)

Bug Fixes

Gemma 4 NaN regression, Llama head_dim misdetection
TQ_STATIC_ASSERT in C mode, stack buffer overflow
Zero build warnings, 34/34 tests pass, score 99.2%

Full changelog: CHANGELOG.md

Full Changelog: v0.2.0...v0.5.0

Assets 4

03 Apr 16:56

github-actions

v0.2.0

4c7b7df

v0.2.0

Full Changelog: v0.1.0...v0.2.0

Assets 4

31 Mar 14:31

unamedkr

v0.1.0

ee6f8de

v0.1.0 — Multi-Architecture Engine with KV Cache Compression

TurboQuant.cpp v0.1.0 — First Release

Multi-architecture LLM inference engine in pure C with KV cache compression.

Highlights

3 models supported: Gemma 3 4B, Qwen3.5-0.8B, Gemma 3 270M
3.8x KV cache compression — at 32K context: 1.2 GB vs llama.cpp's 4.4 GB
llama.cpp parity: 51 tok/s single-thread (vs 50.7 tok/s)
Multi-shard safetensors: loads sharded models (Gemma 4B = 2 shards)
Dual tokenizer: GPT2 byte-level BPE + SentencePiece auto-detect
TQM format: pre-quantized mmap binary, instant loading
Zero dependencies: libc only, ~1MB binary

Supported Models

Model	Speed (Q4, 6T)	Quality
Gemma 3 4B	5.2 tok/s	"capital of France" → "Paris"
Qwen3.5-0.8B	82 tok/s	0.999 cosine vs PyTorch
Gemma 3 270M	176 tok/s	per-layer exact match

KV Cache Memory Savings

Gemma 3 4B at 32K context:
  llama.cpp (FP16 KV):    4,352 MB
  TurboQuant (Q4 KV):     1,156 MB  ← 3.8x compression

Quick Start

git clone https://github.com/quantumaikr/TurboQuant.cpp && cd TurboQuant.cpp
bash scripts/quickstart.sh "What is deep learning?"

What's Inside

9,000+ lines of pure C — complete inference engine
8 quantization types: Uniform, Mixed, PolarQuant, QJL, TurboQuant
Architecture dispatch: Qwen3.5 (DeltaNet + Attention) + Gemma 3 (Sliding Window + GQA)
Q4 weight quantization with NEON 2-row batching + thread pool
Integer Q4×Q8 attention via ARM vdotq_s32
20 test suites, 70+ tests
Python bindings (ctypes), llama.cpp/vLLM integration stubs

References

TurboQuant (ICLR 2026) — KV cache compression
QJL (AAAI 2025) — 1-bit quantized JL transform
PolarQuant (AISTATS 2026) — Polar coordinate quantization

Assets 2

Releases: quantumaikr/quant.cpp

v0.6.2 — Per-channel outlier handling (turbo_kv_4bo / turbo_kv_3bo)

🆕 Per-channel outlier handling

Karpathy-loop result: gap cut in half on Llama 3.2 3B

Honest disclosure: model-dependent

Why ship them anyway?

CLI

Tests

Closes from issue #15

Uh oh!

v0.6.1 — turbo_kv_5b near-lossless + regression tests

🆕 turbo_kv_5b — near-lossless KV at +0.34% PPL

Regression tests

Compatibility

Closes one item from issue #15

Uh oh!

v0.6.0 — turbo_kv_4b champion, beats production baseline

🏆 Highlights

The story

Other changes

What's tracked for next release

Bug fixes

Citations

Uh oh!

v0.5.0 — Gemma 4 MoE + 7x KV Compression + WASM

What's New

Gemma 4 26B-A4B MoE Support

7x KV Cache Compression

New Models

WASM Browser Demo

Windows (MSVC) Support

quant.h Synced

Documentation

Performance

Bug Fixes

Uh oh!

v0.2.0

Uh oh!

v0.1.0 — Multi-Architecture Engine with KV Cache Compression

TurboQuant.cpp v0.1.0 — First Release

Highlights

Supported Models

KV Cache Memory Savings

Quick Start

What's Inside

References

Uh oh!