v0.6.2 — Per-channel outlier handling (turbo_kv_4bo / turbo_kv_3bo)
🆕 Per-channel outlier handling
Two new research types add per-block outlier handling on top of the Variant F base, validating the technique from the Google TurboQuant paper.
Each block stores the K=8 channels with the largest `|rotated[i]|` as exact FP16 values that overwrite the codebook reconstruction at dequant time. The non-outlier channels share a tighter codebook (max-abs computed from the body only), so the codebook doesn't waste resolution on the tails the outliers already capture exactly.
Karpathy-loop result: gap cut in half on Llama 3.2 3B
| Type | Bytes/block | PPL | Δ vs FP32 | vs 4b |
|---|---|---|---|---|
| FP32 | — | 13.56 | — | — |
| `turbo_kv_4b` ⭐ default | 72 | 14.28 | +5.3% | — |
| `turbo_kv_3bo` 🧪 | 80 | 14.03 | +3.5% | −34% gap |
| `turbo_kv_5b` 🏆 quality | 88 | 13.60 | +0.34% | −94% gap |
| `turbo_kv_4bo` 🧪 | 96 | 13.86 | +2.2% | −58% gap |
Honest disclosure: model-dependent
The outlier types are data-dependent. On SmolLM2 135M:
| Type | Bytes | PPL | Δ vs FP32 |
|---|---|---|---|
| FP32 | — | 18.62 | — |
| `turbo_kv_4b` | 72 | 19.70 | +5.8% |
| `turbo_kv_3bo` | 80 | 20.45 | +9.8% (regression) |
| `turbo_kv_5b` | 88 | 18.94 | +1.7% |
| `turbo_kv_4bo` | 96 | 19.29 | +3.6% |
On a smaller-dimension model the 3-bit base in 3bo is too coarse even with outliers, and 4bo is dominated by 5b. The Pareto-optimal recommendations remain:
- `turbo_kv_4b` as the default (production)
- `turbo_kv_5b` for quality (production)
- `turbo_kv_4bo` / `turbo_kv_3bo` as research types (selectable via `-k turbo_kv_4bo` / `turbo_kv_3bo`)
Why ship them anyway?
- Validates the per-channel outlier technique — proves that local outlier handling closes meaningful PPL gap on heavy-tailed models
- Data point for Issue #15 and the ongoing TurboQuant paper reproduction work
- Foundation for per-model auto-selection — a future release could pick 4b vs 4bo per layer/head
CLI
```bash
./build/quant model.gguf -k turbo_kv_4bo # research, 96B blocks
./build/quant model.gguf -k turbo_kv_3bo # research, 80B blocks
./build/quant model.gguf -k turbo_kv_5b # production quality, 88B blocks
./build/quant model.gguf # default = turbo_kv_4b, 72B blocks
```
Tests
35/35 unit tests pass. The existing regression tests on `turbo_kv_4b` and `turbo_kv_5b` cosine quality remain unchanged and continue to gate any future regression.
Closes from issue #15
- ✅ Per-channel outlier handling (Google paper's 32-channel split) — explored, model-dependent
Still open in #15:
- Paper-faithful Llama 3.1 8B + LongBench-E reproduction
- Per-head rotation seeds