Release v0.6.2 — Per-channel outlier handling (turbo_kv_4bo / turbo_kv_3bo) · quantumaikr/quant.cpp

🆕 Per-channel outlier handling

Two new research types add per-block outlier handling on top of the Variant F base, validating the technique from the Google TurboQuant paper.

Each block stores the K=8 channels with the largest `|rotated[i]|` as exact FP16 values that overwrite the codebook reconstruction at dequant time. The non-outlier channels share a tighter codebook (max-abs computed from the body only), so the codebook doesn't waste resolution on the tails the outliers already capture exactly.

Karpathy-loop result: gap cut in half on Llama 3.2 3B

Type	Bytes/block	PPL	Δ vs FP32	vs 4b
FP32	—	13.56	—	—
`turbo_kv_4b` ⭐ default	72	14.28	+5.3%	—
`turbo_kv_3bo` 🧪	80	14.03	+3.5%	−34% gap
`turbo_kv_5b` 🏆 quality	88	13.60	+0.34%	−94% gap
`turbo_kv_4bo` 🧪	96	13.86	+2.2%	−58% gap

Honest disclosure: model-dependent

The outlier types are data-dependent. On SmolLM2 135M:

Type	Bytes	PPL	Δ vs FP32
FP32	—	18.62	—
`turbo_kv_4b`	72	19.70	+5.8%
`turbo_kv_3bo`	80	20.45	+9.8% (regression)
`turbo_kv_5b`	88	18.94	+1.7%
`turbo_kv_4bo`	96	19.29	+3.6%

On a smaller-dimension model the 3-bit base in 3bo is too coarse even with outliers, and 4bo is dominated by 5b. The Pareto-optimal recommendations remain:

`turbo_kv_4b` as the default (production)
`turbo_kv_5b` for quality (production)
`turbo_kv_4bo` / `turbo_kv_3bo` as research types (selectable via `-k turbo_kv_4bo` / `turbo_kv_3bo`)

Why ship them anyway?

Validates the per-channel outlier technique — proves that local outlier handling closes meaningful PPL gap on heavy-tailed models
Data point for Issue #15 and the ongoing TurboQuant paper reproduction work
Foundation for per-model auto-selection — a future release could pick 4b vs 4bo per layer/head

CLI

```bash
./build/quant model.gguf -k turbo_kv_4bo # research, 96B blocks
./build/quant model.gguf -k turbo_kv_3bo # research, 80B blocks
./build/quant model.gguf -k turbo_kv_5b # production quality, 88B blocks
./build/quant model.gguf # default = turbo_kv_4b, 72B blocks
```

Tests

35/35 unit tests pass. The existing regression tests on `turbo_kv_4b` and `turbo_kv_5b` cosine quality remain unchanged and continue to gate any future regression.

Closes from issue #15

✅ Per-channel outlier handling (Google paper's 32-channel split) — explored, model-dependent

Still open in #15:

Paper-faithful Llama 3.1 8B + LongBench-E reproduction
Per-head rotation seeds

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.6.2 — Per-channel outlier handling (turbo_kv_4bo / turbo_kv_3bo)

Choose a tag to compare

Sorry, something went wrong.