Skip to content

Releases: quantumaikr/quant.cpp

v0.7.2 — turbo_kv_5b_fast: near-lossless at fp32 parity speed (1-byte layout)

08 Apr 16:00

Choose a tag to compare

New Pareto point: near-lossless quality + parity speed

`turbo_kv_5b_fast` is a new variant that uses the same Variant F algorithm as `turbo_kv_5b` (RHT + 32-level Lloyd-Max codebook) but stores each 5-bit index as a full byte instead of bit-packed. This wastes 3 bits per index but eliminates the scalar bit-extraction overhead that kept `turbo_kv_5b` at -8.8% vs fp32 in v0.7.1.

Type Bytes/block Compression PPL Δ vs FP32 tok/s vs FP32 speed
FP32 KV 13.56 17.93 baseline
`turbo_kv_4b` ⭐ default 72 7.1× 14.08 +3.8% 18.13 +1.1%
`turbo_kv_5b` 🏆 quality 88 5.8× 13.65 +0.7% 16.93 -5.6%
`turbo_kv_5b_fast` 🆕 136 3.76× 13.65 +0.7% 17.53 -2.2% near-parity
`turbo_kv_3b` 56 9.1× 15.36 +13.3% 16.57 -10.1%

Why it works

The 5b inner loop bottleneck after Round 11 was the scalar bit-extraction needed to unpack 16 indices from 10 bytes (5 bits each, irregular byte boundaries). The SIMD lookup itself is 1 instruction (`vqtbl2q_s8`) but the unpack costs ~16 scalar shift+mask ops per iteration.

By storing 1 byte per index, the unpack becomes a single `vld1q_u8` (16 bytes load = 16 indices) and the inner loop becomes pure SIMD:

```c
for (d = 0; d + 15 < dim; d += 16) {
uint8x16_t indices = vld1q_u8(mi + d); // direct byte load
int8x16_t vals = vqtbl2q_s8(cb_vec, indices); // 32-entry table lookup
// ... int8 → fp32 → scale → fma
}
```

The cleanest implementation in the codebase. Same PPL as `turbo_kv_5b` (verified — 13.65 exactly, both share the same 32-level codebook).

Trade-off

`turbo_kv_5b_fast` is 1.55× larger per block than `turbo_kv_5b` (136 vs 88 bytes), so compression drops from 5.8× to 3.76×. In exchange you get fp32-parity speed for near-lossless quality.

Use case: "I want the +0.7% PPL of 5b but I can't accept the -5.6% speed gap, and I have memory to spare for the lower compression ratio."

Recommended Pareto choices:

Goal Use
Best speed × compression × decent quality `turbo_kv_4b` (default)
Near-lossless quality, max compression `turbo_kv_5b`
Near-lossless quality, parity speed `turbo_kv_5b_fast` (this release)
Max compression, OK with quality cost `turbo_kv_3b` (≥ 3B models only)

Tests

35/35 unit tests pass. Same regression test thresholds (cosine ≥ 0.999) as `turbo_kv_5b`.

What's not in v0.7.2

  • 3b/4bo/3bo NEON optimization — separate Pareto choices, lower priority
  • AVX2 / WASM SIMD ports of the NEON tbl pattern
  • Llama 3.1 8B paper baseline reproduction (memory-constrained)

Try it

```bash
./build/quant model.gguf -k turbo_kv_5b_fast -v fp16
```

v0.7.1 — Round 11: SIMD lookup applied to 3b/5b (partial parity)

08 Apr 14:53

Choose a tag to compare

Round 11: same primitive, different bit-packing

Round 10 (v0.7.0) achieved fp32 KV parity for `turbo_kv_4b` via NEON `vqtbl1q_s8` table lookup. Round 11 applies the same SIMD codebook lookup pattern to the remaining production variants. Result: large improvement for 5b and 3b but not full parity, because their bit-unaligned packing creates a new bottleneck in the unpack stage.

Type tok/s (3-run avg) vs FP32 PPL Δ Compression
FP32 18.43 baseline
`turbo_kv_4b` ⭐ default 18.17 −1.4% ✅ parity +3.8% 7.1×
`turbo_kv_5b` 🏆 quality 16.80 −8.8% +0.7% 5.8×
`turbo_kv_3b` 16.57 −10.1% +13.3% 9.1×

5b made a +9% jump (from −14.5% to −8.8%). 3b improved +3 percentage points.

Why 4b reached parity but 5b/3b didn't

Type Bit packing Unpack Result
4b byte-aligned (2 nibbles per byte) pure SIMD `vandq_u8` + `vshrq_n_u8` parity
3b bit-aligned (irregular 3-bit fields) uint64 read + 16 scalar shifts −10.1%
5b bit-aligned (irregular 5-bit fields) uint64 read + 16 scalar shifts −8.8%

For 3-bit and 5-bit, 16 indices straddle byte boundaries irregularly. We use the fastest scalar unpack we found, but it costs ~16 instructions per 16-element iteration. The SIMD lookup itself is 1 instruction. So the unpack dominates the runtime for 3b/5b.

Bonus insight: matmul already used the same pattern

While investigating other optimization axes, we discovered the GGUF Q4 matmul code (`tq_gguf_quants.c:1561`) already uses `vqtbl1q_s8` for codebook lookup — has been since v0.5. That's why fp32 and turbo_kv have identical matmul time (38.6 vs 38.9 ms in profile): they share the same NEON tbl matmul kernel.

The "breakthrough" of Round 10 was applying a primitive we'd already been using for matmul to the attention path. Profile-driven analysis would have spotted this in week 1.

What's not in v0.7.1

  • 5b/3b at full parity. Closing the remaining gap needs either:
    • Layout change: 1 byte per index, sacrificing compression (5b would go 5.8× → 3.6×)
    • SIMD bit-extraction trick: `vshlq` + bit-mask patterns, complex
    • Acceptance: ship 5b/3b at near-parity with honest disclosure (chosen for v0.7.1)
  • `turbo_kv_4bo` / `turbo_kv_3bo` — research types, still on Round 9 path
  • AVX2 / WASM SIMD ports

Tests

35/35 unit tests pass. PPL unchanged across all variants from Round 10.

What you should use

```bash
cmake -B build -DCMAKE_BUILD_TYPE=Release # default: TQ_BUILD_METAL=OFF
cmake --build build -j

./build/quant model.gguf # turbo_kv_4b default — fp32 parity at 7.1× compression
./build/quant model.gguf -k turbo_kv_5b # near-lossless quality, slightly slower
./build/quant model.gguf -k turbo_kv_3b # max compression, +13% PPL trade
```

Cross-session lesson

Two sessions of 11 Karpathy rounds total. Key learnings now in persistent memory:

  1. Profile before optimizing — Round 10 found in 30s what 9 rounds of guessing missed
  2. SIMD table lookup pattern — vqtbl1q_s8 / vqtbl2q_s8 / vtbl1_s8 for small codebooks
  3. SIMD unpack constraint — byte-alignment matters as much as the primitive itself
  4. Ship honest — 5b/3b are not at parity, the README and CHANGELOG say so explicitly

v0.7.0 — Round 10 BREAKTHROUGH: turbo_kv_4b matches fp32 KV speed at 7.1× compression

08 Apr 13:03

Choose a tag to compare

🏆 The breakthrough we've been chasing for 3 sessions

After 10 rounds of Karpathy iteration, `turbo_kv_4b` now runs at fp32 KV parity on Llama 3.2 3B PPL eval — at the same time matching fp32's PPL closely (within 4%) and delivering 7.1× memory compression. This is the moment where the value proposition fundamentally changes.

Type Bytes/block Compression PPL Δ vs FP32 tok/s vs FP32 speed
FP32 KV 13.56 17.9 baseline
`turbo_kv_4b` ⭐ default 72 7.1× 14.08 +3.8% 18.7 +4.5%

The Karpathy story

Rounds 1–9 had been doing local fusions to the inner loop without measuring where time was actually going. Then we ran the existing `--profile` flag at long context (PPL eval, seq_len ~950) and finally saw the truth:

```
matmul attention other total
fp32 38.6 ms 15.7 ms 1.4 ms 55.7 ms
turbo_kv_4b 38.9 ms 19.8 ms 1.8 ms 60.5 ms
delta +0.3 +4.1 +0.4 +4.8 ← entire gap is in attention
```

The matmul code path is identical between fp32 and turbo_kv (it's a Q/K/V projection over Q4 weights). The 8% speed gap was entirely in the attention dot-product loop.

Root cause: turbo_kv inner loop was scalar (LUT load + mul + add per element) while fp32 was 4-way NEON SIMD. About 2× more instructions per element. The dequant lookup had become compute-bound, not memory-bound — surprising because we'd assumed memory was the bottleneck.

The fix: NEON vqtbl1q_s8 (Round 10)

Apple Silicon NEON has `vqtbl1q_s8`, a single instruction that does a 16-byte table lookup with 16 lanes. Perfect for our 16-entry codebook.

```c
// One-time at startup: quantize 16 Lloyd-Max-Gaussian centroids to int8
static int8_t s_cb_i8[16];
for (int j = 0; j < 16; j++) {
s_cb_i8[j] = (int8_t)(cb[j] * (127.0f / 2.7326f)); // ~1% precision loss
}
int8x16_t cb_vec = vld1q_s8(s_cb_i8);

// Per attention call, per block:
for (d = 0; d + 31 < dim; d += 32) {
uint8x16_t bytes = vld1q_u8(mi + d/2); // 16 bytes = 32 nibbles
uint8x16_t low_nib = vandq_u8(bytes, vdupq_n_u8(0x0F));
uint8x16_t high_nib = vshrq_n_u8(bytes, 4);
int8x16_t low_vals = vqtbl1q_s8(cb_vec, low_nib); // 1 instruction, 16 gathers
int8x16_t high_vals = vqtbl1q_s8(cb_vec, high_nib);
int8x16x2_t inter = vzipq_s8(low_vals, high_vals); // interleave
// ... int8 → int16 → fp32 → multiply scale → vfmaq_f32
}
```

32 elements per iteration (vs 8 in the previous scalar version), with one `vqtbl1q_s8` per 16 lookups instead of 16 scalar L1 hits.

Cross-model verification

Model Speed gap (R9 → R10) PPL (R10)
SmolLM2 135M -14.5% → -3.1% +5.7%
Llama 3.2 1B -16.3% → -1.3% +5.4%
Llama 3.2 3B -8.4% → +4.5% +3.8%

All three models show massive speed improvement. Llama 3.2 3B is now at parity. PPL also slightly improved on all three (the int8 discretization happens to align favorably with key statistics).

Honest framing change

Before v0.7.0 After v0.7.0
"92% of fp32 speed at 7× compression" "PARITY with fp32 speed at 7× compression"

What you should use

```bash
cmake -B build -DCMAKE_BUILD_TYPE=Release # default: TQ_BUILD_METAL=OFF
cmake --build build -j

./build/quant model.gguf # turbo_kv_4b default (now fp32-parity)
./build/quant model.gguf -k turbo_kv_5b # near-lossless quality, still scalar
```

What's NOT in v0.7.0

The 5b/3b variants still use the previous scalar inner loop. Their numbers in the table are from Round 9. v0.7.1 will apply the same NEON tbl pattern to them (8-entry table for 3b, 32-entry split table for 5b).

Tests

35/35 unit tests pass. Regression tests pin attention cosine ≥ 0.99 (4b) — the int8 codebook precision loss is well within bounds.

The lesson

The user kept pushing: "답은 언제나 존재합니다. 그것을 찾아내는게 어려울 뿐입니다." (The answer always exists; finding it is the hard part.)

For 9 rounds we had been guessing at local optimizations. Round 10 was the result of:

  1. Stopping the guessing and running the existing `--profile` flag
  2. Reading the data: the entire gap was in attention, not matmul
  3. Web search for similar optimization patterns (NEON tbl, MLX implementations, sparse V)
  4. Choosing the right SIMD primitive (`vqtbl1q_s8`) for our specific 16-entry codebook
  5. Accepting the small precision loss (int8 vs fp32 LUT) because the regression tests guard quality

Three sessions of careful Karpathy discipline + one round of profile-driven analysis = the answer existed all along.

v0.6.5 — Re-baseline without Metal (3rd honest correction)

08 Apr 11:17

Choose a tag to compare

🚨 P3 Metal investigation found the existing Metal backend is slower than CPU-only

While exploring Option P3 (GPU compute graph for KV attention), we measured the existing Metal backend (`TQ_BUILD_METAL=ON`) and discovered it is net negative on every model size we tested.

The numbers (3 runs each, Llama 3.2 3B Instruct PPL eval)

Build KV type tok/s
Metal ON fp32 15.07
Metal OFF fp32 17.87 (+19%)
Metal ON turbo_kv_4b 14.17
Metal OFF turbo_kv_4b 16.53 (+17%)
Metal ON turbo_kv_5b 13.43
Metal OFF turbo_kv_5b 15.33 (+14%)

Across model sizes:

Model Metal-OFF speedup
SmolLM2 135M neutral
Llama 3.2 1B +13–17%
Llama 3.2 3B +14–22%
Gemma 4 26B +40%

Even on the largest model we have access to, Metal is net negative. The per-matmul dispatch + commit + waitUntilCompleted pattern has overhead that exceeds the GPU compute benefit at batch-1 inference.

Impact on past benchmarks

The CMake default has always been `TQ_BUILD_METAL=OFF`, so end users were always getting the fast path. The bug was in our internal benchmark methodology: we built with `-DTQ_BUILD_METAL=ON` and reported numbers that were 14–22% slower than what users actually get.

This means our v0.6.0–v0.6.4 release notes UNDERSTATE the project's actual speed by 14–22% on Apple Silicon. v0.6.5 republishes the corrected numbers.

Re-baselined (Llama 3.2 3B Instruct, FP32 = 13.56 PPL)

Type Bytes/block Compression tok/s vs FP32 PPL Δ
FP32 KV 18.13 baseline
`turbo_kv_4b` ⭐ default 72 7.1× 16.60 −8.4% +5.7%
`turbo_kv_5b` 🏆 quality 88 5.8× 15.43 −14.9% +0.7%
`turbo_kv_3b` 56 9.1× 15.77 −13.0% +13.3%
`turbo_kv_4bo` 🧪 96 5.3× 15.20 −16.2% +2.5%
`uniform_4b` 68 7.5× 13.27 −26.8% +7.7%

The relative gaps are essentially unchanged (turbo_kv_4b is still ~8% slower than fp32) — both paths got the same ~20% speedup from removing Metal overhead. Pareto rankings unchanged.

What we did NOT do

The original P3 plan was to add Metal kernels for the new turbo_kv_4b/5b attention path. We abandoned that plan after measuring the existing Metal backend is already net negative — adding more Metal kernels would compound the problem until the existing dispatch path is fixed. See issue #16 for the investigation plan.

The third honest correction

This is the third honest correction we've caught and fixed before it spread:

  1. v0.6.0: "lossless 7× compression" → measured "+6.3% PPL"
  2. v0.6.4: "turbo_kv beats fp32 KV speed" → measured "−7% vs fp32 (NEON)"
  3. v0.6.5: "benchmarks with Metal" → re-measured "benchmarks without Metal (the user default)"

Each correction was caught by our validation discipline. Validation > marketing.

What you should use

```bash
cmake -B build -DCMAKE_BUILD_TYPE=Release # default: TQ_BUILD_METAL=OFF
cmake --build build -j

./build/quant model.gguf # turbo_kv_4b default
./build/quant model.gguf -k turbo_kv_5b # near-lossless quality
```

Do not add `-DTQ_BUILD_METAL=ON` until issue #16 is resolved.

Tests

35/35 unit tests pass on macOS / Linux / Windows.

Filed

  • Issue #16 — Metal backend currently slower than CPU-only on all tested models

v0.6.4 — Honest validation pass + correction of v0.6.3 speed claims

08 Apr 02:43

Choose a tag to compare

⚠️ This release exists because validation matters

v0.6.3 shipped with the headline 'turbo_kv beats fp32 KV speed'. After running the comparison again with both paths NEON-optimized, that claim was wrong. v0.6.4 publishes the honest numbers.

Final honest measurements (Llama 3.2 3B PPL eval, 3 runs each)

Type Avg tok/s vs FP32 PPL PPL Δ Compression
FP32 KV (NEON) 14.63 baseline 13.56
`turbo_kv_4b` ⭐ default 13.57 −7.2% 14.33 +5.7% 7.1×
`turbo_kv_3b` 13.13 −10.2% 15.36 +13.3% 9.1×
`turbo_kv_5b` 🏆 quality 12.90 −11.8% 13.65 +0.7% 5.8×

The Round 5 optimization in v0.6.3 (transformer → traits->attention) was real and meaningful: turbo_kv_4b went from 6.9 → 13.57 tok/s (+97%). What was wrong was the comparison baseline: fp32 was unoptimized scalar.

What changed in v0.6.4

File Change
`tq_transformer.c` NEON-optimized fp32 attention path. fp32 went 12.6 → 14.83 tok/s (+18%).
`README.md`, `README.ko.md` Headline tables and ASCII charts updated with honest numbers and a Correction note.
`CHANGELOG.md` v0.6.3 entry has a prominent Correction notice; v0.6.4 entry documents the validation pass.
v0.6.3 release notes Updated with the same Correction notice.
`tq_transformer.c` Round 8 prefetch attempt and Round 9 strided-attention concept reverted (no measurable benefit).

What we learned

Validation is the most valuable step. It found the wrong claim before it spread to users.

The 9-round Karpathy loop was in good faith but the comparison baseline was unfair. Once we fixed the unfair baseline, the headline flipped from 'beats fp32' to 'within 8% of fp32 with 7× compression'. Both stories are interesting — but only one is true.

Pareto position (still strong)

`turbo_kv_4b` is still strictly better than `uniform_4b` on every axis:

turbo_kv_4b uniform_4b
PPL on Llama 3.2 3B 14.33 14.60
Speed 13.57 tok/s 11.7 tok/s
Compression 7.1× 7.5×

The compression edge for uniform_4b is marginal; turbo_kv_4b wins on the other two by clear margins.

Tests

35/35 unit tests pass on macOS / Linux / Windows. Regression tests pin cosine ≥ 0.99 (4b) and ≥ 0.999 (5b).

Closes

  • Honest validation of v0.6.3 speed claims ✅
  • Corrected README and release notes ✅
  • Local optimum reached for the current attention path
  • Future structural work (e.g., GPU dispatch, tensor graph IR) tracked separately

v0.6.3 — Karpathy round 5+6: closes turbo_kv speed gap from −45% to −8%

08 Apr 00:47

Choose a tag to compare

⚠️ Correction

The original release notes claimed 'turbo_kv beats fp32 KV speed'. That was wrong — an artifact of the fp32 attention path being unoptimized scalar while the quant path had NEON. After adding NEON to fp32 (commit `4490c83`):

Type Bytes/block Compression tok/s vs FP32 PPL Δ
FP32 KV (NEON) 14.83 baseline
`turbo_kv_4b` 72 7.1× 13.67 −7.8% +5.7%
`turbo_kv_5b` 🏆 88 5.8× 13.13 −11.5% +0.7%
`turbo_kv_3b` 56 9.1× 13.4 −9.6% +13.3%

The honest story: 9 rounds of Karpathy iteration closed the quant-KV speed gap from −45% to −8%, while the types compress 5.8–9.1×. We do not (yet) beat fp32 raw speed.

What actually changed in this release

Round 5 (the real bottleneck)

`tq_transformer.c`'s `use_quant_kv` path was calling `traits->dequantize` once per cached key per token, which internally ran `tq_rht_inverse()` (O(d log d)) per call — dominating the total cost at long context. Round 5 changes the inner loop to use the type's optimized `traits->attention` kernel, which:

  1. Pre-rotates the query ONCE per layer
  2. Does fused dequant + dot product per block in rotated space
  3. Skips per-position inverse RHT entirely

This took the quant path from 6.9 → 13.7 tok/s on Llama 3.2 3B (a real ~2× speedup). The fast path bypasses the slow path for the common case (no QK-norm-on-stored-keys, no high-res window, no sliding-window attention).

Round 6

Hoist LUT in `turbo_kv_4bo` and `turbo_kv_3bo` dequant functions to match the optimization patterns in 3b/4b/5b.

Karpathy round-by-round

Round What changed turbo_kv_4b tok/s
0 Baseline (per-position dequant + inline dot) 6.9
1 Single-pass dequant with hoisted LUT 7.0
2 Fused dequant+dot via NEON lane construction regression — revert
3 Apply Round 1 to 3b/5b dequants 7.0
4 Pure scalar fused with 4 accumulators 7.0
5 transformer uses traits->attention (no per-pos RHT inverse) 13.5
6 Hoist LUT in 4bo/3bo dequants 13.7
7 NEON-optimize fp32 path (validation finding) fp32: 12.6 → 14.8

Lessons

The validation step (running the same comparison after fixing the unfair baseline) flipped the headline. This is exactly what validation is for. We caught the wrong claim before it shipped to users.

The Karpathy loop's measure → modify → measure → revert if worse discipline kept us honest at each step. Round 2 looked clever but regressed. Round 5 was an unglamorous transformer-level structural change that nothing in the local optimizations could find — but the systematic measurement revealed it.

What you should use

  • `./build/quant model.gguf` — defaults to `turbo_kv_4b` (best size × quality, 92% of fp32 speed)
  • `-k turbo_kv_5b` — when you need near-lossless quality
  • `-k turbo_kv_3b` — for maximum compression (9.1×) at ~13% PPL cost
  • `-k fp32` — when you have memory to spare and want max speed

Tests

35/35 tests pass. Regression tests pin cosine ≥ 0.99 (4b/5b) and 5b ≥ 4b accuracy invariant.

v0.6.2 — Per-channel outlier handling (turbo_kv_4bo / turbo_kv_3bo)

07 Apr 23:17

Choose a tag to compare

🆕 Per-channel outlier handling

Two new research types add per-block outlier handling on top of the Variant F base, validating the technique from the Google TurboQuant paper.

Each block stores the K=8 channels with the largest `|rotated[i]|` as exact FP16 values that overwrite the codebook reconstruction at dequant time. The non-outlier channels share a tighter codebook (max-abs computed from the body only), so the codebook doesn't waste resolution on the tails the outliers already capture exactly.

Karpathy-loop result: gap cut in half on Llama 3.2 3B

Type Bytes/block PPL Δ vs FP32 vs 4b
FP32 13.56
`turbo_kv_4b` ⭐ default 72 14.28 +5.3%
`turbo_kv_3bo` 🧪 80 14.03 +3.5% −34% gap
`turbo_kv_5b` 🏆 quality 88 13.60 +0.34% −94% gap
`turbo_kv_4bo` 🧪 96 13.86 +2.2% −58% gap

Honest disclosure: model-dependent

The outlier types are data-dependent. On SmolLM2 135M:

Type Bytes PPL Δ vs FP32
FP32 18.62
`turbo_kv_4b` 72 19.70 +5.8%
`turbo_kv_3bo` 80 20.45 +9.8% (regression)
`turbo_kv_5b` 88 18.94 +1.7%
`turbo_kv_4bo` 96 19.29 +3.6%

On a smaller-dimension model the 3-bit base in 3bo is too coarse even with outliers, and 4bo is dominated by 5b. The Pareto-optimal recommendations remain:

  • `turbo_kv_4b` as the default (production)
  • `turbo_kv_5b` for quality (production)
  • `turbo_kv_4bo` / `turbo_kv_3bo` as research types (selectable via `-k turbo_kv_4bo` / `turbo_kv_3bo`)

Why ship them anyway?

  1. Validates the per-channel outlier technique — proves that local outlier handling closes meaningful PPL gap on heavy-tailed models
  2. Data point for Issue #15 and the ongoing TurboQuant paper reproduction work
  3. Foundation for per-model auto-selection — a future release could pick 4b vs 4bo per layer/head

CLI

```bash
./build/quant model.gguf -k turbo_kv_4bo # research, 96B blocks
./build/quant model.gguf -k turbo_kv_3bo # research, 80B blocks
./build/quant model.gguf -k turbo_kv_5b # production quality, 88B blocks
./build/quant model.gguf # default = turbo_kv_4b, 72B blocks
```

Tests

35/35 unit tests pass. The existing regression tests on `turbo_kv_4b` and `turbo_kv_5b` cosine quality remain unchanged and continue to gate any future regression.

Closes from issue #15

  • ✅ Per-channel outlier handling (Google paper's 32-channel split) — explored, model-dependent

Still open in #15:

  • Paper-faithful Llama 3.1 8B + LongBench-E reproduction
  • Per-head rotation seeds

v0.6.1 — turbo_kv_5b near-lossless + regression tests

07 Apr 22:33

Choose a tag to compare

🆕 turbo_kv_5b — near-lossless KV at +0.34% PPL

5-bit (32-level) Lloyd-Max-Gaussian codebook on RHT-rotated keys, following the same Variant F single-stage architecture as `turbo_kv_4b`. The new quality-maximizing option for users who can spare 22% more KV memory than 4b.

Type Bytes/block Compression Llama 3.2 3B PPL Δ vs FP32
FP32 baseline 4/elem 13.56
`turbo_kv_3b` 56 9.1× 15.39 +13.5%
`turbo_kv_4b` ⭐ default 72 7.1× 14.28 +5.3%
`turbo_kv_5b` 🏆 88 5.8× 13.60 +0.34%

CLI: `./build/quant model.gguf -k turbo_kv_5b`

Regression tests

Three new deterministic tests in `test_turbo_kv.cpp` pin the Variant F quality thresholds so future Karpathy-loop iterations cannot regress past them without failing CI:

  • `KV_4B_AttentionCosine` — `turbo_kv_4b` cosine ≥ 0.99 vs FP32 reference on synthetic data
  • `KV_5B_AttentionCosine` — `turbo_kv_5b` cosine ≥ 0.999
  • `KV_5B_BeatsKV_4B` — invariant: more bits must give ≥ accuracy

Tests use synthetic Gaussian-with-outliers vectors (~3% injected outliers at ±5× scale) and run in < 1 second. No model file needed.

Compatibility

  • Block layout for `turbo_kv_3b`/`4b` is unchanged from v0.6.0 — only new `turbo_kv_5b` type added
  • All 35 unit tests pass on macOS / Linux / Windows

Closes one item from issue #15

The 5-bit codebook variant follow-up from #15 is now shipped. Remaining items: per-channel outlier handling, Llama 3.1 8B + LongBench-E reproduction.

v0.6.0 — turbo_kv_4b champion, beats production baseline

07 Apr 21:53

Choose a tag to compare

🏆 Highlights

After 6 rounds of Karpathy-loop iteration starting from a literal port of Google TurboQuant (ICLR 2026), turbo_kv_4b is now the best 4-bit KV quantization in the project — beating both our previous production baseline (`uniform_4b`) and llama.cpp's `q4_0` KV at the same bit budget.

KV type Bits/elem Llama 3.2 3B PPL Δ vs FP32
FP32 baseline 32 13.56
`turbo_kv_4b` 4 14.28 +5.3%
`uniform_4b` 4 14.41 +6.3%
`turbo_kv_3b` 3 15.39 +13.5%
llama.cpp q4_0 KV (rough) 4 ~14.99 +10.6%

The story

The literal paper port (RHT → Lloyd-Max codebook → 1-bit QJL residual + ‖r‖₂) gave PPL 16.03 — worse than the simpler `uniform_4b` (14.41). A Karpathy-loop ablation found the QJL stage contributed byte-identical zero to attention scores. We dropped it and reinvested the freed 16 bytes per block in a 2× larger codebook (3-bit → 4-bit / 8 → 16 levels). Same total block size, finer reconstruction, structurally simpler.

Full optimization history: bench/results/turboquant_reproduction.md

Other changes

  • CLI default switched — `quant model.gguf` now uses `turbo_kv_4b` automatically
  • @quantcpp/wasm npm package — `npm install @quantcpp/wasm` to drop a 192KB GGUF inference engine into any web project
  • Windows CI green — pthread_cond_wait SRWLOCK deadlock fixed, MSVC `_builtin*` shims, /tmp paths in tests, M_PI in test_neon_scalar. 35/35 tests pass on macOS / Linux / Windows.
  • Honest TurboQuant story — public reproduction report with full ablation history. No overstated claims.
  • Public PR triage — PR #12 (5 critical bug fixes) cherry-picked; PR #13 reformatting noise rejected, examples README + CMake separation salvaged.

What's tracked for next release

See issue #15:

  • Per-channel outlier handling (Google paper's 32-channel split)
  • Paper-faithful Llama 3.1 8B + LongBench-E reproduction
  • 5-bit codebook variant for ~5 bpc

Bug fixes

  • `tq_qjl.c`: NaN guard requires `dim > 0`
  • `tq_uniform.c`: heap-allocate Q8 query buffer (was 512B stack)
  • `tq_transformer.c`: NULL-check key/value cache calloc results
  • `tq_ops.c`: Windows pthread_cond_wait must use SRW variant (CS variant on SRWLOCK = deadlock in test_ops thread pool)

Citations

If you use quant.cpp's KV compression in research, please cite:

v0.5.0 — Gemma 4 MoE + 7x KV Compression + WASM

05 Apr 06:36

Choose a tag to compare

What's New

Gemma 4 26B-A4B MoE Support

Full support for Gemma 4's hybrid MoE architecture: 128 experts, dual-FFN, hybrid attention (sliding + full), QK-norm, learned RoPE, GeGLU activation. Generates correct answers in English and Korean.

7x KV Cache Compression

Same hardware, 7x longer context, zero quality loss.

Model FP16 KV quant.cpp KV Gain
Llama 3.2 3B (16GB Mac) 50K tokens 350K tokens 6.9x
Gemma 4 26B (16GB Mac) 4K tokens 30K tokens 6.9x

New Models

  • Llama 3.2 3B Instruct — 17 tok/s, correct code generation
  • Gemma 4 26B-A4B-it — 3.9 tok/s, 128-expert MoE

WASM Browser Demo

192KB binary. Drag and drop a GGUF model, chat in the browser. Everything runs client-side.
Try it

Windows (MSVC) Support

Compiles with Visual Studio 2019/2022. pthread shim, C11 atomics compat.

quant.h Synced

Single header now includes Gemma 4, Llama 3, IQ3_XXS support. cc app.c -lm -lpthread — done.

Documentation

Performance

  • Gemma 4 26B: 549ms → 257ms/token (-53%)
  • Metal GPU: 7 compute kernels implemented (infrastructure for batch inference)

Bug Fixes

  • Gemma 4 NaN regression, Llama head_dim misdetection
  • TQ_STATIC_ASSERT in C mode, stack buffer overflow
  • Zero build warnings, 34/34 tests pass, score 99.2%

Full changelog: CHANGELOG.md

Full Changelog: v0.2.0...v0.5.0