Skip to content

v0.6.2 — Per-channel outlier handling (turbo_kv_4bo / turbo_kv_3bo)

Choose a tag to compare

@unamedkr unamedkr released this 07 Apr 23:17
· 415 commits to main since this release

🆕 Per-channel outlier handling

Two new research types add per-block outlier handling on top of the Variant F base, validating the technique from the Google TurboQuant paper.

Each block stores the K=8 channels with the largest `|rotated[i]|` as exact FP16 values that overwrite the codebook reconstruction at dequant time. The non-outlier channels share a tighter codebook (max-abs computed from the body only), so the codebook doesn't waste resolution on the tails the outliers already capture exactly.

Karpathy-loop result: gap cut in half on Llama 3.2 3B

Type Bytes/block PPL Δ vs FP32 vs 4b
FP32 13.56
`turbo_kv_4b` ⭐ default 72 14.28 +5.3%
`turbo_kv_3bo` 🧪 80 14.03 +3.5% −34% gap
`turbo_kv_5b` 🏆 quality 88 13.60 +0.34% −94% gap
`turbo_kv_4bo` 🧪 96 13.86 +2.2% −58% gap

Honest disclosure: model-dependent

The outlier types are data-dependent. On SmolLM2 135M:

Type Bytes PPL Δ vs FP32
FP32 18.62
`turbo_kv_4b` 72 19.70 +5.8%
`turbo_kv_3bo` 80 20.45 +9.8% (regression)
`turbo_kv_5b` 88 18.94 +1.7%
`turbo_kv_4bo` 96 19.29 +3.6%

On a smaller-dimension model the 3-bit base in 3bo is too coarse even with outliers, and 4bo is dominated by 5b. The Pareto-optimal recommendations remain:

  • `turbo_kv_4b` as the default (production)
  • `turbo_kv_5b` for quality (production)
  • `turbo_kv_4bo` / `turbo_kv_3bo` as research types (selectable via `-k turbo_kv_4bo` / `turbo_kv_3bo`)

Why ship them anyway?

  1. Validates the per-channel outlier technique — proves that local outlier handling closes meaningful PPL gap on heavy-tailed models
  2. Data point for Issue #15 and the ongoing TurboQuant paper reproduction work
  3. Foundation for per-model auto-selection — a future release could pick 4b vs 4bo per layer/head

CLI

```bash
./build/quant model.gguf -k turbo_kv_4bo # research, 96B blocks
./build/quant model.gguf -k turbo_kv_3bo # research, 80B blocks
./build/quant model.gguf -k turbo_kv_5b # production quality, 88B blocks
./build/quant model.gguf # default = turbo_kv_4b, 72B blocks
```

Tests

35/35 unit tests pass. The existing regression tests on `turbo_kv_4b` and `turbo_kv_5b` cosine quality remain unchanged and continue to gate any future regression.

Closes from issue #15

  • ✅ Per-channel outlier handling (Google paper's 32-channel split) — explored, model-dependent

Still open in #15:

  • Paper-faithful Llama 3.1 8B + LongBench-E reproduction
  • Per-head rotation seeds