Releases: quantumaikr/quant.cpp
v0.7.2 — turbo_kv_5b_fast: near-lossless at fp32 parity speed (1-byte layout)
New Pareto point: near-lossless quality + parity speed
`turbo_kv_5b_fast` is a new variant that uses the same Variant F algorithm as `turbo_kv_5b` (RHT + 32-level Lloyd-Max codebook) but stores each 5-bit index as a full byte instead of bit-packed. This wastes 3 bits per index but eliminates the scalar bit-extraction overhead that kept `turbo_kv_5b` at -8.8% vs fp32 in v0.7.1.
| Type | Bytes/block | Compression | PPL | Δ vs FP32 | tok/s | vs FP32 speed |
|---|---|---|---|---|---|---|
| FP32 KV | — | 1× | 13.56 | — | 17.93 | baseline |
| `turbo_kv_4b` ⭐ default | 72 | 7.1× | 14.08 | +3.8% | 18.13 | +1.1% ✅ |
| `turbo_kv_5b` 🏆 quality | 88 | 5.8× | 13.65 | +0.7% | 16.93 | -5.6% |
| `turbo_kv_5b_fast` 🆕 | 136 | 3.76× | 13.65 | +0.7% | 17.53 | -2.2% near-parity |
| `turbo_kv_3b` | 56 | 9.1× | 15.36 | +13.3% | 16.57 | -10.1% |
Why it works
The 5b inner loop bottleneck after Round 11 was the scalar bit-extraction needed to unpack 16 indices from 10 bytes (5 bits each, irregular byte boundaries). The SIMD lookup itself is 1 instruction (`vqtbl2q_s8`) but the unpack costs ~16 scalar shift+mask ops per iteration.
By storing 1 byte per index, the unpack becomes a single `vld1q_u8` (16 bytes load = 16 indices) and the inner loop becomes pure SIMD:
```c
for (d = 0; d + 15 < dim; d += 16) {
uint8x16_t indices = vld1q_u8(mi + d); // direct byte load
int8x16_t vals = vqtbl2q_s8(cb_vec, indices); // 32-entry table lookup
// ... int8 → fp32 → scale → fma
}
```
The cleanest implementation in the codebase. Same PPL as `turbo_kv_5b` (verified — 13.65 exactly, both share the same 32-level codebook).
Trade-off
`turbo_kv_5b_fast` is 1.55× larger per block than `turbo_kv_5b` (136 vs 88 bytes), so compression drops from 5.8× to 3.76×. In exchange you get fp32-parity speed for near-lossless quality.
Use case: "I want the +0.7% PPL of 5b but I can't accept the -5.6% speed gap, and I have memory to spare for the lower compression ratio."
Recommended Pareto choices:
| Goal | Use |
|---|---|
| Best speed × compression × decent quality | `turbo_kv_4b` (default) |
| Near-lossless quality, max compression | `turbo_kv_5b` |
| Near-lossless quality, parity speed | `turbo_kv_5b_fast` (this release) |
| Max compression, OK with quality cost | `turbo_kv_3b` (≥ 3B models only) |
Tests
35/35 unit tests pass. Same regression test thresholds (cosine ≥ 0.999) as `turbo_kv_5b`.
What's not in v0.7.2
- 3b/4bo/3bo NEON optimization — separate Pareto choices, lower priority
- AVX2 / WASM SIMD ports of the NEON tbl pattern
- Llama 3.1 8B paper baseline reproduction (memory-constrained)
Try it
```bash
./build/quant model.gguf -k turbo_kv_5b_fast -v fp16
```
v0.7.1 — Round 11: SIMD lookup applied to 3b/5b (partial parity)
Round 11: same primitive, different bit-packing
Round 10 (v0.7.0) achieved fp32 KV parity for `turbo_kv_4b` via NEON `vqtbl1q_s8` table lookup. Round 11 applies the same SIMD codebook lookup pattern to the remaining production variants. Result: large improvement for 5b and 3b but not full parity, because their bit-unaligned packing creates a new bottleneck in the unpack stage.
| Type | tok/s (3-run avg) | vs FP32 | PPL Δ | Compression |
|---|---|---|---|---|
| FP32 | 18.43 | baseline | — | 1× |
| `turbo_kv_4b` ⭐ default | 18.17 | −1.4% ✅ parity | +3.8% | 7.1× |
| `turbo_kv_5b` 🏆 quality | 16.80 | −8.8% | +0.7% | 5.8× |
| `turbo_kv_3b` | 16.57 | −10.1% | +13.3% | 9.1× |
5b made a +9% jump (from −14.5% to −8.8%). 3b improved +3 percentage points.
Why 4b reached parity but 5b/3b didn't
| Type | Bit packing | Unpack | Result |
|---|---|---|---|
| 4b | byte-aligned (2 nibbles per byte) | pure SIMD `vandq_u8` + `vshrq_n_u8` | parity ✅ |
| 3b | bit-aligned (irregular 3-bit fields) | uint64 read + 16 scalar shifts | −10.1% |
| 5b | bit-aligned (irregular 5-bit fields) | uint64 read + 16 scalar shifts | −8.8% |
For 3-bit and 5-bit, 16 indices straddle byte boundaries irregularly. We use the fastest scalar unpack we found, but it costs ~16 instructions per 16-element iteration. The SIMD lookup itself is 1 instruction. So the unpack dominates the runtime for 3b/5b.
Bonus insight: matmul already used the same pattern
While investigating other optimization axes, we discovered the GGUF Q4 matmul code (`tq_gguf_quants.c:1561`) already uses `vqtbl1q_s8` for codebook lookup — has been since v0.5. That's why fp32 and turbo_kv have identical matmul time (38.6 vs 38.9 ms in profile): they share the same NEON tbl matmul kernel.
The "breakthrough" of Round 10 was applying a primitive we'd already been using for matmul to the attention path. Profile-driven analysis would have spotted this in week 1.
What's not in v0.7.1
- 5b/3b at full parity. Closing the remaining gap needs either:
- Layout change: 1 byte per index, sacrificing compression (5b would go 5.8× → 3.6×)
- SIMD bit-extraction trick: `vshlq` + bit-mask patterns, complex
- Acceptance: ship 5b/3b at near-parity with honest disclosure (chosen for v0.7.1)
- `turbo_kv_4bo` / `turbo_kv_3bo` — research types, still on Round 9 path
- AVX2 / WASM SIMD ports
Tests
35/35 unit tests pass. PPL unchanged across all variants from Round 10.
What you should use
```bash
cmake -B build -DCMAKE_BUILD_TYPE=Release # default: TQ_BUILD_METAL=OFF
cmake --build build -j
./build/quant model.gguf # turbo_kv_4b default — fp32 parity at 7.1× compression
./build/quant model.gguf -k turbo_kv_5b # near-lossless quality, slightly slower
./build/quant model.gguf -k turbo_kv_3b # max compression, +13% PPL trade
```
Cross-session lesson
Two sessions of 11 Karpathy rounds total. Key learnings now in persistent memory:
- Profile before optimizing — Round 10 found in 30s what 9 rounds of guessing missed
- SIMD table lookup pattern — vqtbl1q_s8 / vqtbl2q_s8 / vtbl1_s8 for small codebooks
- SIMD unpack constraint — byte-alignment matters as much as the primitive itself
- Ship honest — 5b/3b are not at parity, the README and CHANGELOG say so explicitly
v0.7.0 — Round 10 BREAKTHROUGH: turbo_kv_4b matches fp32 KV speed at 7.1× compression
🏆 The breakthrough we've been chasing for 3 sessions
After 10 rounds of Karpathy iteration, `turbo_kv_4b` now runs at fp32 KV parity on Llama 3.2 3B PPL eval — at the same time matching fp32's PPL closely (within 4%) and delivering 7.1× memory compression. This is the moment where the value proposition fundamentally changes.
| Type | Bytes/block | Compression | PPL | Δ vs FP32 | tok/s | vs FP32 speed |
|---|---|---|---|---|---|---|
| FP32 KV | — | 1× | 13.56 | — | 17.9 | baseline |
| `turbo_kv_4b` ⭐ default | 72 | 7.1× | 14.08 | +3.8% | 18.7 | +4.5% ⬆ |
The Karpathy story
Rounds 1–9 had been doing local fusions to the inner loop without measuring where time was actually going. Then we ran the existing `--profile` flag at long context (PPL eval, seq_len ~950) and finally saw the truth:
```
matmul attention other total
fp32 38.6 ms 15.7 ms 1.4 ms 55.7 ms
turbo_kv_4b 38.9 ms 19.8 ms 1.8 ms 60.5 ms
delta +0.3 +4.1 +0.4 +4.8 ← entire gap is in attention
```
The matmul code path is identical between fp32 and turbo_kv (it's a Q/K/V projection over Q4 weights). The 8% speed gap was entirely in the attention dot-product loop.
Root cause: turbo_kv inner loop was scalar (LUT load + mul + add per element) while fp32 was 4-way NEON SIMD. About 2× more instructions per element. The dequant lookup had become compute-bound, not memory-bound — surprising because we'd assumed memory was the bottleneck.
The fix: NEON vqtbl1q_s8 (Round 10)
Apple Silicon NEON has `vqtbl1q_s8`, a single instruction that does a 16-byte table lookup with 16 lanes. Perfect for our 16-entry codebook.
```c
// One-time at startup: quantize 16 Lloyd-Max-Gaussian centroids to int8
static int8_t s_cb_i8[16];
for (int j = 0; j < 16; j++) {
s_cb_i8[j] = (int8_t)(cb[j] * (127.0f / 2.7326f)); // ~1% precision loss
}
int8x16_t cb_vec = vld1q_s8(s_cb_i8);
// Per attention call, per block:
for (d = 0; d + 31 < dim; d += 32) {
uint8x16_t bytes = vld1q_u8(mi + d/2); // 16 bytes = 32 nibbles
uint8x16_t low_nib = vandq_u8(bytes, vdupq_n_u8(0x0F));
uint8x16_t high_nib = vshrq_n_u8(bytes, 4);
int8x16_t low_vals = vqtbl1q_s8(cb_vec, low_nib); // 1 instruction, 16 gathers
int8x16_t high_vals = vqtbl1q_s8(cb_vec, high_nib);
int8x16x2_t inter = vzipq_s8(low_vals, high_vals); // interleave
// ... int8 → int16 → fp32 → multiply scale → vfmaq_f32
}
```
32 elements per iteration (vs 8 in the previous scalar version), with one `vqtbl1q_s8` per 16 lookups instead of 16 scalar L1 hits.
Cross-model verification
| Model | Speed gap (R9 → R10) | PPL (R10) |
|---|---|---|
| SmolLM2 135M | -14.5% → -3.1% | +5.7% |
| Llama 3.2 1B | -16.3% → -1.3% | +5.4% |
| Llama 3.2 3B | -8.4% → +4.5% ⬆ | +3.8% |
All three models show massive speed improvement. Llama 3.2 3B is now at parity. PPL also slightly improved on all three (the int8 discretization happens to align favorably with key statistics).
Honest framing change
| Before v0.7.0 | After v0.7.0 |
|---|---|
| "92% of fp32 speed at 7× compression" | "PARITY with fp32 speed at 7× compression" |
What you should use
```bash
cmake -B build -DCMAKE_BUILD_TYPE=Release # default: TQ_BUILD_METAL=OFF
cmake --build build -j
./build/quant model.gguf # turbo_kv_4b default (now fp32-parity)
./build/quant model.gguf -k turbo_kv_5b # near-lossless quality, still scalar
```
What's NOT in v0.7.0
The 5b/3b variants still use the previous scalar inner loop. Their numbers in the table are from Round 9. v0.7.1 will apply the same NEON tbl pattern to them (8-entry table for 3b, 32-entry split table for 5b).
Tests
35/35 unit tests pass. Regression tests pin attention cosine ≥ 0.99 (4b) — the int8 codebook precision loss is well within bounds.
The lesson
The user kept pushing: "답은 언제나 존재합니다. 그것을 찾아내는게 어려울 뿐입니다." (The answer always exists; finding it is the hard part.)
For 9 rounds we had been guessing at local optimizations. Round 10 was the result of:
- Stopping the guessing and running the existing `--profile` flag
- Reading the data: the entire gap was in attention, not matmul
- Web search for similar optimization patterns (NEON tbl, MLX implementations, sparse V)
- Choosing the right SIMD primitive (`vqtbl1q_s8`) for our specific 16-entry codebook
- Accepting the small precision loss (int8 vs fp32 LUT) because the regression tests guard quality
Three sessions of careful Karpathy discipline + one round of profile-driven analysis = the answer existed all along.
v0.6.5 — Re-baseline without Metal (3rd honest correction)
🚨 P3 Metal investigation found the existing Metal backend is slower than CPU-only
While exploring Option P3 (GPU compute graph for KV attention), we measured the existing Metal backend (`TQ_BUILD_METAL=ON`) and discovered it is net negative on every model size we tested.
The numbers (3 runs each, Llama 3.2 3B Instruct PPL eval)
| Build | KV type | tok/s |
|---|---|---|
| Metal ON | fp32 | 15.07 |
| Metal OFF | fp32 | 17.87 (+19%) |
| Metal ON | turbo_kv_4b | 14.17 |
| Metal OFF | turbo_kv_4b | 16.53 (+17%) |
| Metal ON | turbo_kv_5b | 13.43 |
| Metal OFF | turbo_kv_5b | 15.33 (+14%) |
Across model sizes:
| Model | Metal-OFF speedup |
|---|---|
| SmolLM2 135M | neutral |
| Llama 3.2 1B | +13–17% |
| Llama 3.2 3B | +14–22% |
| Gemma 4 26B | +40% |
Even on the largest model we have access to, Metal is net negative. The per-matmul dispatch + commit + waitUntilCompleted pattern has overhead that exceeds the GPU compute benefit at batch-1 inference.
Impact on past benchmarks
The CMake default has always been `TQ_BUILD_METAL=OFF`, so end users were always getting the fast path. The bug was in our internal benchmark methodology: we built with `-DTQ_BUILD_METAL=ON` and reported numbers that were 14–22% slower than what users actually get.
This means our v0.6.0–v0.6.4 release notes UNDERSTATE the project's actual speed by 14–22% on Apple Silicon. v0.6.5 republishes the corrected numbers.
Re-baselined (Llama 3.2 3B Instruct, FP32 = 13.56 PPL)
| Type | Bytes/block | Compression | tok/s | vs FP32 | PPL Δ |
|---|---|---|---|---|---|
| FP32 KV | — | 1× | 18.13 | baseline | — |
| `turbo_kv_4b` ⭐ default | 72 | 7.1× | 16.60 | −8.4% | +5.7% |
| `turbo_kv_5b` 🏆 quality | 88 | 5.8× | 15.43 | −14.9% | +0.7% |
| `turbo_kv_3b` | 56 | 9.1× | 15.77 | −13.0% | +13.3% |
| `turbo_kv_4bo` 🧪 | 96 | 5.3× | 15.20 | −16.2% | +2.5% |
| `uniform_4b` | 68 | 7.5× | 13.27 | −26.8% | +7.7% |
The relative gaps are essentially unchanged (turbo_kv_4b is still ~8% slower than fp32) — both paths got the same ~20% speedup from removing Metal overhead. Pareto rankings unchanged.
What we did NOT do
The original P3 plan was to add Metal kernels for the new turbo_kv_4b/5b attention path. We abandoned that plan after measuring the existing Metal backend is already net negative — adding more Metal kernels would compound the problem until the existing dispatch path is fixed. See issue #16 for the investigation plan.
The third honest correction
This is the third honest correction we've caught and fixed before it spread:
- v0.6.0: "lossless 7× compression" → measured "+6.3% PPL"
- v0.6.4: "turbo_kv beats fp32 KV speed" → measured "−7% vs fp32 (NEON)"
- v0.6.5: "benchmarks with Metal" → re-measured "benchmarks without Metal (the user default)"
Each correction was caught by our validation discipline. Validation > marketing.
What you should use
```bash
cmake -B build -DCMAKE_BUILD_TYPE=Release # default: TQ_BUILD_METAL=OFF
cmake --build build -j
./build/quant model.gguf # turbo_kv_4b default
./build/quant model.gguf -k turbo_kv_5b # near-lossless quality
```
Do not add `-DTQ_BUILD_METAL=ON` until issue #16 is resolved.
Tests
35/35 unit tests pass on macOS / Linux / Windows.
Filed
- Issue #16 — Metal backend currently slower than CPU-only on all tested models
v0.6.4 — Honest validation pass + correction of v0.6.3 speed claims
⚠️ This release exists because validation matters
v0.6.3 shipped with the headline 'turbo_kv beats fp32 KV speed'. After running the comparison again with both paths NEON-optimized, that claim was wrong. v0.6.4 publishes the honest numbers.
Final honest measurements (Llama 3.2 3B PPL eval, 3 runs each)
| Type | Avg tok/s | vs FP32 | PPL | PPL Δ | Compression |
|---|---|---|---|---|---|
| FP32 KV (NEON) | 14.63 | baseline | 13.56 | — | 1× |
| `turbo_kv_4b` ⭐ default | 13.57 | −7.2% | 14.33 | +5.7% | 7.1× |
| `turbo_kv_3b` | 13.13 | −10.2% | 15.36 | +13.3% | 9.1× |
| `turbo_kv_5b` 🏆 quality | 12.90 | −11.8% | 13.65 | +0.7% | 5.8× |
The Round 5 optimization in v0.6.3 (transformer → traits->attention) was real and meaningful: turbo_kv_4b went from 6.9 → 13.57 tok/s (+97%). What was wrong was the comparison baseline: fp32 was unoptimized scalar.
What changed in v0.6.4
| File | Change |
|---|---|
| `tq_transformer.c` | NEON-optimized fp32 attention path. fp32 went 12.6 → 14.83 tok/s (+18%). |
| `README.md`, `README.ko.md` | Headline tables and ASCII charts updated with honest numbers and a Correction note. |
| `CHANGELOG.md` | v0.6.3 entry has a prominent Correction notice; v0.6.4 entry documents the validation pass. |
| v0.6.3 release notes | Updated with the same Correction notice. |
| `tq_transformer.c` | Round 8 prefetch attempt and Round 9 strided-attention concept reverted (no measurable benefit). |
What we learned
Validation is the most valuable step. It found the wrong claim before it spread to users.
The 9-round Karpathy loop was in good faith but the comparison baseline was unfair. Once we fixed the unfair baseline, the headline flipped from 'beats fp32' to 'within 8% of fp32 with 7× compression'. Both stories are interesting — but only one is true.
Pareto position (still strong)
`turbo_kv_4b` is still strictly better than `uniform_4b` on every axis:
| turbo_kv_4b | uniform_4b | |
|---|---|---|
| PPL on Llama 3.2 3B | 14.33 | 14.60 |
| Speed | 13.57 tok/s | 11.7 tok/s |
| Compression | 7.1× | 7.5× |
The compression edge for uniform_4b is marginal; turbo_kv_4b wins on the other two by clear margins.
Tests
35/35 unit tests pass on macOS / Linux / Windows. Regression tests pin cosine ≥ 0.99 (4b) and ≥ 0.999 (5b).
Closes
- Honest validation of v0.6.3 speed claims ✅
- Corrected README and release notes ✅
- Local optimum reached for the current attention path
- Future structural work (e.g., GPU dispatch, tensor graph IR) tracked separately
v0.6.3 — Karpathy round 5+6: closes turbo_kv speed gap from −45% to −8%
⚠️ Correction
The original release notes claimed 'turbo_kv beats fp32 KV speed'. That was wrong — an artifact of the fp32 attention path being unoptimized scalar while the quant path had NEON. After adding NEON to fp32 (commit `4490c83`):
| Type | Bytes/block | Compression | tok/s | vs FP32 | PPL Δ |
|---|---|---|---|---|---|
| FP32 KV (NEON) | — | 1× | 14.83 | baseline | — |
| `turbo_kv_4b` ⭐ | 72 | 7.1× | 13.67 | −7.8% | +5.7% |
| `turbo_kv_5b` 🏆 | 88 | 5.8× | 13.13 | −11.5% | +0.7% |
| `turbo_kv_3b` | 56 | 9.1× | 13.4 | −9.6% | +13.3% |
The honest story: 9 rounds of Karpathy iteration closed the quant-KV speed gap from −45% to −8%, while the types compress 5.8–9.1×. We do not (yet) beat fp32 raw speed.
What actually changed in this release
Round 5 (the real bottleneck)
`tq_transformer.c`'s `use_quant_kv` path was calling `traits->dequantize` once per cached key per token, which internally ran `tq_rht_inverse()` (O(d log d)) per call — dominating the total cost at long context. Round 5 changes the inner loop to use the type's optimized `traits->attention` kernel, which:
- Pre-rotates the query ONCE per layer
- Does fused dequant + dot product per block in rotated space
- Skips per-position inverse RHT entirely
This took the quant path from 6.9 → 13.7 tok/s on Llama 3.2 3B (a real ~2× speedup). The fast path bypasses the slow path for the common case (no QK-norm-on-stored-keys, no high-res window, no sliding-window attention).
Round 6
Hoist LUT in `turbo_kv_4bo` and `turbo_kv_3bo` dequant functions to match the optimization patterns in 3b/4b/5b.
Karpathy round-by-round
| Round | What changed | turbo_kv_4b tok/s |
|---|---|---|
| 0 | Baseline (per-position dequant + inline dot) | 6.9 |
| 1 | Single-pass dequant with hoisted LUT | 7.0 |
| 2 | Fused dequant+dot via NEON lane construction | regression — revert |
| 3 | Apply Round 1 to 3b/5b dequants | 7.0 |
| 4 | Pure scalar fused with 4 accumulators | 7.0 |
| 5 | transformer uses traits->attention (no per-pos RHT inverse) | 13.5 ✅ |
| 6 | Hoist LUT in 4bo/3bo dequants | 13.7 |
| 7 | NEON-optimize fp32 path (validation finding) | fp32: 12.6 → 14.8 |
Lessons
The validation step (running the same comparison after fixing the unfair baseline) flipped the headline. This is exactly what validation is for. We caught the wrong claim before it shipped to users.
The Karpathy loop's measure → modify → measure → revert if worse discipline kept us honest at each step. Round 2 looked clever but regressed. Round 5 was an unglamorous transformer-level structural change that nothing in the local optimizations could find — but the systematic measurement revealed it.
What you should use
- `./build/quant model.gguf` — defaults to `turbo_kv_4b` (best size × quality, 92% of fp32 speed)
- `-k turbo_kv_5b` — when you need near-lossless quality
- `-k turbo_kv_3b` — for maximum compression (9.1×) at ~13% PPL cost
- `-k fp32` — when you have memory to spare and want max speed
Tests
35/35 tests pass. Regression tests pin cosine ≥ 0.99 (4b/5b) and 5b ≥ 4b accuracy invariant.
v0.6.2 — Per-channel outlier handling (turbo_kv_4bo / turbo_kv_3bo)
🆕 Per-channel outlier handling
Two new research types add per-block outlier handling on top of the Variant F base, validating the technique from the Google TurboQuant paper.
Each block stores the K=8 channels with the largest `|rotated[i]|` as exact FP16 values that overwrite the codebook reconstruction at dequant time. The non-outlier channels share a tighter codebook (max-abs computed from the body only), so the codebook doesn't waste resolution on the tails the outliers already capture exactly.
Karpathy-loop result: gap cut in half on Llama 3.2 3B
| Type | Bytes/block | PPL | Δ vs FP32 | vs 4b |
|---|---|---|---|---|
| FP32 | — | 13.56 | — | — |
| `turbo_kv_4b` ⭐ default | 72 | 14.28 | +5.3% | — |
| `turbo_kv_3bo` 🧪 | 80 | 14.03 | +3.5% | −34% gap |
| `turbo_kv_5b` 🏆 quality | 88 | 13.60 | +0.34% | −94% gap |
| `turbo_kv_4bo` 🧪 | 96 | 13.86 | +2.2% | −58% gap |
Honest disclosure: model-dependent
The outlier types are data-dependent. On SmolLM2 135M:
| Type | Bytes | PPL | Δ vs FP32 |
|---|---|---|---|
| FP32 | — | 18.62 | — |
| `turbo_kv_4b` | 72 | 19.70 | +5.8% |
| `turbo_kv_3bo` | 80 | 20.45 | +9.8% (regression) |
| `turbo_kv_5b` | 88 | 18.94 | +1.7% |
| `turbo_kv_4bo` | 96 | 19.29 | +3.6% |
On a smaller-dimension model the 3-bit base in 3bo is too coarse even with outliers, and 4bo is dominated by 5b. The Pareto-optimal recommendations remain:
- `turbo_kv_4b` as the default (production)
- `turbo_kv_5b` for quality (production)
- `turbo_kv_4bo` / `turbo_kv_3bo` as research types (selectable via `-k turbo_kv_4bo` / `turbo_kv_3bo`)
Why ship them anyway?
- Validates the per-channel outlier technique — proves that local outlier handling closes meaningful PPL gap on heavy-tailed models
- Data point for Issue #15 and the ongoing TurboQuant paper reproduction work
- Foundation for per-model auto-selection — a future release could pick 4b vs 4bo per layer/head
CLI
```bash
./build/quant model.gguf -k turbo_kv_4bo # research, 96B blocks
./build/quant model.gguf -k turbo_kv_3bo # research, 80B blocks
./build/quant model.gguf -k turbo_kv_5b # production quality, 88B blocks
./build/quant model.gguf # default = turbo_kv_4b, 72B blocks
```
Tests
35/35 unit tests pass. The existing regression tests on `turbo_kv_4b` and `turbo_kv_5b` cosine quality remain unchanged and continue to gate any future regression.
Closes from issue #15
- ✅ Per-channel outlier handling (Google paper's 32-channel split) — explored, model-dependent
Still open in #15:
- Paper-faithful Llama 3.1 8B + LongBench-E reproduction
- Per-head rotation seeds
v0.6.1 — turbo_kv_5b near-lossless + regression tests
🆕 turbo_kv_5b — near-lossless KV at +0.34% PPL
5-bit (32-level) Lloyd-Max-Gaussian codebook on RHT-rotated keys, following the same Variant F single-stage architecture as `turbo_kv_4b`. The new quality-maximizing option for users who can spare 22% more KV memory than 4b.
| Type | Bytes/block | Compression | Llama 3.2 3B PPL | Δ vs FP32 |
|---|---|---|---|---|
| FP32 baseline | 4/elem | 1× | 13.56 | — |
| `turbo_kv_3b` | 56 | 9.1× | 15.39 | +13.5% |
| `turbo_kv_4b` ⭐ default | 72 | 7.1× | 14.28 | +5.3% |
| `turbo_kv_5b` 🏆 | 88 | 5.8× | 13.60 | +0.34% |
CLI: `./build/quant model.gguf -k turbo_kv_5b`
Regression tests
Three new deterministic tests in `test_turbo_kv.cpp` pin the Variant F quality thresholds so future Karpathy-loop iterations cannot regress past them without failing CI:
- `KV_4B_AttentionCosine` — `turbo_kv_4b` cosine ≥ 0.99 vs FP32 reference on synthetic data
- `KV_5B_AttentionCosine` — `turbo_kv_5b` cosine ≥ 0.999
- `KV_5B_BeatsKV_4B` — invariant: more bits must give ≥ accuracy
Tests use synthetic Gaussian-with-outliers vectors (~3% injected outliers at ±5× scale) and run in < 1 second. No model file needed.
Compatibility
- Block layout for `turbo_kv_3b`/`4b` is unchanged from v0.6.0 — only new `turbo_kv_5b` type added
- All 35 unit tests pass on macOS / Linux / Windows
Closes one item from issue #15
The 5-bit codebook variant follow-up from #15 is now shipped. Remaining items: per-channel outlier handling, Llama 3.1 8B + LongBench-E reproduction.
v0.6.0 — turbo_kv_4b champion, beats production baseline
🏆 Highlights
After 6 rounds of Karpathy-loop iteration starting from a literal port of Google TurboQuant (ICLR 2026), turbo_kv_4b is now the best 4-bit KV quantization in the project — beating both our previous production baseline (`uniform_4b`) and llama.cpp's `q4_0` KV at the same bit budget.
| KV type | Bits/elem | Llama 3.2 3B PPL | Δ vs FP32 |
|---|---|---|---|
| FP32 baseline | 32 | 13.56 | — |
| `turbo_kv_4b` ⭐ | 4 | 14.28 | +5.3% |
| `uniform_4b` | 4 | 14.41 | +6.3% |
| `turbo_kv_3b` | 3 | 15.39 | +13.5% |
| llama.cpp q4_0 KV (rough) | 4 | ~14.99 | +10.6% |
The story
The literal paper port (RHT → Lloyd-Max codebook → 1-bit QJL residual + ‖r‖₂) gave PPL 16.03 — worse than the simpler `uniform_4b` (14.41). A Karpathy-loop ablation found the QJL stage contributed byte-identical zero to attention scores. We dropped it and reinvested the freed 16 bytes per block in a 2× larger codebook (3-bit → 4-bit / 8 → 16 levels). Same total block size, finer reconstruction, structurally simpler.
Full optimization history: bench/results/turboquant_reproduction.md
Other changes
- CLI default switched — `quant model.gguf` now uses `turbo_kv_4b` automatically
- @quantcpp/wasm npm package — `npm install @quantcpp/wasm` to drop a 192KB GGUF inference engine into any web project
- Windows CI green — pthread_cond_wait SRWLOCK deadlock fixed, MSVC `_builtin*` shims, /tmp paths in tests, M_PI in test_neon_scalar. 35/35 tests pass on macOS / Linux / Windows.
- Honest TurboQuant story — public reproduction report with full ablation history. No overstated claims.
- Public PR triage — PR #12 (5 critical bug fixes) cherry-picked; PR #13 reformatting noise rejected, examples README + CMake separation salvaged.
What's tracked for next release
See issue #15:
- Per-channel outlier handling (Google paper's 32-channel split)
- Paper-faithful Llama 3.1 8B + LongBench-E reproduction
- 5-bit codebook variant for ~5 bpc
Bug fixes
- `tq_qjl.c`: NaN guard requires `dim > 0`
- `tq_uniform.c`: heap-allocate Q8 query buffer (was 512B stack)
- `tq_transformer.c`: NULL-check key/value cache calloc results
- `tq_ops.c`: Windows pthread_cond_wait must use SRW variant (CS variant on SRWLOCK = deadlock in test_ops thread pool)
Citations
If you use quant.cpp's KV compression in research, please cite:
v0.5.0 — Gemma 4 MoE + 7x KV Compression + WASM
What's New
Gemma 4 26B-A4B MoE Support
Full support for Gemma 4's hybrid MoE architecture: 128 experts, dual-FFN, hybrid attention (sliding + full), QK-norm, learned RoPE, GeGLU activation. Generates correct answers in English and Korean.
7x KV Cache Compression
Same hardware, 7x longer context, zero quality loss.
| Model | FP16 KV | quant.cpp KV | Gain |
|---|---|---|---|
| Llama 3.2 3B (16GB Mac) | 50K tokens | 350K tokens | 6.9x |
| Gemma 4 26B (16GB Mac) | 4K tokens | 30K tokens | 6.9x |
New Models
- Llama 3.2 3B Instruct — 17 tok/s, correct code generation
- Gemma 4 26B-A4B-it — 3.9 tok/s, 128-expert MoE
WASM Browser Demo
192KB binary. Drag and drop a GGUF model, chat in the browser. Everything runs client-side.
→ Try it
Windows (MSVC) Support
Compiles with Visual Studio 2019/2022. pthread shim, C11 atomics compat.
quant.h Synced
Single header now includes Gemma 4, Llama 3, IQ3_XXS support. cc app.c -lm -lpthread — done.
Documentation
- API Reference — 730 lines, all platforms
- Custom Quantization Guide — add your own KV type in 3 functions
- ROADMAP — project direction
Performance
- Gemma 4 26B: 549ms → 257ms/token (-53%)
- Metal GPU: 7 compute kernels implemented (infrastructure for batch inference)
Bug Fixes
- Gemma 4 NaN regression, Llama head_dim misdetection
- TQ_STATIC_ASSERT in C mode, stack buffer overflow
- Zero build warnings, 34/34 tests pass, score 99.2%
Full changelog: CHANGELOG.md
Full Changelog: v0.2.0...v0.5.0