Releases · quantumaikr/quant.cpp

08 Apr 16:00

unamedkr

v0.7.2

c8a76ef

v0.7.2 — turbo_kv_5b_fast: near-lossless at fp32 parity speed (1-byte layout) Latest

Latest

New Pareto point: near-lossless quality + parity speed

`turbo_kv_5b_fast` is a new variant that uses the same Variant F algorithm as `turbo_kv_5b` (RHT + 32-level Lloyd-Max codebook) but stores each 5-bit index as a full byte instead of bit-packed. This wastes 3 bits per index but eliminates the scalar bit-extraction overhead that kept `turbo_kv_5b` at -8.8% vs fp32 in v0.7.1.

Type	Bytes/block	Compression	PPL	Δ vs FP32	tok/s	vs FP32 speed
FP32 KV	—	1×	13.56	—	17.93	baseline
`turbo_kv_4b` ⭐ default	72	7.1×	14.08	+3.8%	18.13	+1.1% ✅
`turbo_kv_5b` 🏆 quality	88	5.8×	13.65	+0.7%	16.93	-5.6%
`turbo_kv_5b_fast` 🆕	136	3.76×	13.65	+0.7%	17.53	-2.2% near-parity
`turbo_kv_3b`	56	9.1×	15.36	+13.3%	16.57	-10.1%

Why it works

The 5b inner loop bottleneck after Round 11 was the scalar bit-extraction needed to unpack 16 indices from 10 bytes (5 bits each, irregular byte boundaries). The SIMD lookup itself is 1 instruction (`vqtbl2q_s8`) but the unpack costs ~16 scalar shift+mask ops per iteration.

By storing 1 byte per index, the unpack becomes a single `vld1q_u8` (16 bytes load = 16 indices) and the inner loop becomes pure SIMD:

```c
for (d = 0; d + 15 < dim; d += 16) {
uint8x16_t indices = vld1q_u8(mi + d); // direct byte load
int8x16_t vals = vqtbl2q_s8(cb_vec, indices); // 32-entry table lookup
// ... int8 → fp32 → scale → fma
}
```

The cleanest implementation in the codebase. Same PPL as `turbo_kv_5b` (verified — 13.65 exactly, both share the same 32-level codebook).

Trade-off

`turbo_kv_5b_fast` is 1.55× larger per block than `turbo_kv_5b` (136 vs 88 bytes), so compression drops from 5.8× to 3.76×. In exchange you get fp32-parity speed for near-lossless quality.

Use case: "I want the +0.7% PPL of 5b but I can't accept the -5.6% speed gap, and I have memory to spare for the lower compression ratio."

Recommended Pareto choices:

Goal	Use
Best speed × compression × decent quality	`turbo_kv_4b` (default)
Near-lossless quality, max compression	`turbo_kv_5b`
Near-lossless quality, parity speed	`turbo_kv_5b_fast` (this release)
Max compression, OK with quality cost	`turbo_kv_3b` (≥ 3B models only)

Tests

35/35 unit tests pass. Same regression test thresholds (cosine ≥ 0.999) as `turbo_kv_5b`.

What's not in v0.7.2

3b/4bo/3bo NEON optimization — separate Pareto choices, lower priority
AVX2 / WASM SIMD ports of the NEON tbl pattern
Llama 3.1 8B paper baseline reproduction (memory-constrained)

Try it

```bash
./build/quant model.gguf -k turbo_kv_5b_fast -v fp16
```

Assets 2

08 Apr 14:53

unamedkr

v0.7.1

e6ac01c

v0.7.1 — Round 11: SIMD lookup applied to 3b/5b (partial parity)

Round 11: same primitive, different bit-packing

Round 10 (v0.7.0) achieved fp32 KV parity for `turbo_kv_4b` via NEON `vqtbl1q_s8` table lookup. Round 11 applies the same SIMD codebook lookup pattern to the remaining production variants. Result: large improvement for 5b and 3b but not full parity, because their bit-unaligned packing creates a new bottleneck in the unpack stage.

Type	tok/s (3-run avg)	vs FP32	PPL Δ	Compression
FP32	18.43	baseline	—	1×
`turbo_kv_4b` ⭐ default	18.17	−1.4% ✅ parity	+3.8%	7.1×
`turbo_kv_5b` 🏆 quality	16.80	−8.8%	+0.7%	5.8×
`turbo_kv_3b`	16.57	−10.1%	+13.3%	9.1×

5b made a +9% jump (from −14.5% to −8.8%). 3b improved +3 percentage points.

Why 4b reached parity but 5b/3b didn't

Type	Bit packing	Unpack	Result
4b	byte-aligned (2 nibbles per byte)	pure SIMD `vandq_u8` + `vshrq_n_u8`	parity ✅
3b	bit-aligned (irregular 3-bit fields)	uint64 read + 16 scalar shifts	−10.1%
5b	bit-aligned (irregular 5-bit fields)	uint64 read + 16 scalar shifts	−8.8%

For 3-bit and 5-bit, 16 indices straddle byte boundaries irregularly. We use the fastest scalar unpack we found, but it costs ~16 instructions per 16-element iteration. The SIMD lookup itself is 1 instruction. So the unpack dominates the runtime for 3b/5b.

Bonus insight: matmul already used the same pattern

While investigating other optimization axes, we discovered the GGUF Q4 matmul code (`tq_gguf_quants.c:1561`) already uses `vqtbl1q_s8` for codebook lookup — has been since v0.5. That's why fp32 and turbo_kv have identical matmul time (38.6 vs 38.9 ms in profile): they share the same NEON tbl matmul kernel.

The "breakthrough" of Round 10 was applying a primitive we'd already been using for matmul to the attention path. Profile-driven analysis would have spotted this in week 1.

What's not in v0.7.1

5b/3b at full parity. Closing the remaining gap needs either:
- Layout change: 1 byte per index, sacrificing compression (5b would go 5.8× → 3.6×)
- SIMD bit-extraction trick: `vshlq` + bit-mask patterns, complex
- Acceptance: ship 5b/3b at near-parity with honest disclosure (chosen for v0.7.1)
`turbo_kv_4bo` / `turbo_kv_3bo` — research types, still on Round 9 path
AVX2 / WASM SIMD ports

Tests

35/35 unit tests pass. PPL unchanged across all variants from Round 10.

What you should use

```bash
cmake -B build -DCMAKE_BUILD_TYPE=Release # default: TQ_BUILD_METAL=OFF
cmake --build build -j

./build/quant model.gguf # turbo_kv_4b default — fp32 parity at 7.1× compression
./build/quant model.gguf -k turbo_kv_5b # near-lossless quality, slightly slower
./build/quant model.gguf -k turbo_kv_3b # max compression, +13% PPL trade
```

Cross-session lesson

Two sessions of 11 Karpathy rounds total. Key learnings now in persistent memory:

Profile before optimizing — Round 10 found in 30s what 9 rounds of guessing missed
SIMD table lookup pattern — vqtbl1q_s8 / vqtbl2q_s8 / vtbl1_s8 for small codebooks
SIMD unpack constraint — byte-alignment matters as much as the primitive itself
Ship honest — 5b/3b are not at parity, the README and CHANGELOG say so explicitly

Assets 2

08 Apr 13:03

unamedkr

v0.7.0

6c3c60e

v0.7.0 — Round 10 BREAKTHROUGH: turbo_kv_4b matches fp32 KV speed at 7.1× compression

🏆 The breakthrough we've been chasing for 3 sessions

After 10 rounds of Karpathy iteration, `turbo_kv_4b` now runs at fp32 KV parity on Llama 3.2 3B PPL eval — at the same time matching fp32's PPL closely (within 4%) and delivering 7.1× memory compression. This is the moment where the value proposition fundamentally changes.

Type	Bytes/block	Compression	PPL	Δ vs FP32	tok/s	vs FP32 speed
FP32 KV	—	1×	13.56	—	17.9	baseline
`turbo_kv_4b` ⭐ default	72	7.1×	14.08	+3.8%	18.7	+4.5% ⬆

The Karpathy story

Rounds 1–9 had been doing local fusions to the inner loop without measuring where time was actually going. Then we ran the existing `--profile` flag at long context (PPL eval, seq_len ~950) and finally saw the truth:

```
matmul attention other total
fp32 38.6 ms 15.7 ms 1.4 ms 55.7 ms
turbo_kv_4b 38.9 ms 19.8 ms 1.8 ms 60.5 ms
delta +0.3 +4.1 +0.4 +4.8 ← entire gap is in attention
```

The matmul code path is identical between fp32 and turbo_kv (it's a Q/K/V projection over Q4 weights). The 8% speed gap was entirely in the attention dot-product loop.

Root cause: turbo_kv inner loop was scalar (LUT load + mul + add per element) while fp32 was 4-way NEON SIMD. About 2× more instructions per element. The dequant lookup had become compute-bound, not memory-bound — surprising because we'd assumed memory was the bottleneck.

The fix: NEON vqtbl1q_s8 (Round 10)

Apple Silicon NEON has `vqtbl1q_s8`, a single instruction that does a 16-byte table lookup with 16 lanes. Perfect for our 16-entry codebook.

```c
// One-time at startup: quantize 16 Lloyd-Max-Gaussian centroids to int8
static int8_t s_cb_i8[16];
for (int j = 0; j < 16; j++) {
s_cb_i8[j] = (int8_t)(cb[j] * (127.0f / 2.7326f)); // ~1% precision loss
}
int8x16_t cb_vec = vld1q_s8(s_cb_i8);

// Per attention call, per block:
for (d = 0; d + 31 < dim; d += 32) {
uint8x16_t bytes = vld1q_u8(mi + d/2); // 16 bytes = 32 nibbles
uint8x16_t low_nib = vandq_u8(bytes, vdupq_n_u8(0x0F));
uint8x16_t high_nib = vshrq_n_u8(bytes, 4);
int8x16_t low_vals = vqtbl1q_s8(cb_vec, low_nib); // 1 instruction, 16 gathers
int8x16_t high_vals = vqtbl1q_s8(cb_vec, high_nib);
int8x16x2_t inter = vzipq_s8(low_vals, high_vals); // interleave
// ... int8 → int16 → fp32 → multiply scale → vfmaq_f32
}
```

32 elements per iteration (vs 8 in the previous scalar version), with one `vqtbl1q_s8` per 16 lookups instead of 16 scalar L1 hits.

Cross-model verification

Model	Speed gap (R9 → R10)	PPL (R10)
SmolLM2 135M	-14.5% → -3.1%	+5.7%
Llama 3.2 1B	-16.3% → -1.3%	+5.4%
Llama 3.2 3B	-8.4% → +4.5% ⬆	+3.8%

All three models show massive speed improvement. Llama 3.2 3B is now at parity. PPL also slightly improved on all three (the int8 discretization happens to align favorably with key statistics).

Honest framing change

Before v0.7.0	After v0.7.0
"92% of fp32 speed at 7× compression"	"PARITY with fp32 speed at 7× compression"

What you should use

```bash
cmake -B build -DCMAKE_BUILD_TYPE=Release # default: TQ_BUILD_METAL=OFF
cmake --build build -j

./build/quant model.gguf # turbo_kv_4b default (now fp32-parity)
./build/quant model.gguf -k turbo_kv_5b # near-lossless quality, still scalar
```

What's NOT in v0.7.0

The 5b/3b variants still use the previous scalar inner loop. Their numbers in the table are from Round 9. v0.7.1 will apply the same NEON tbl pattern to them (8-entry table for 3b, 32-entry split table for 5b).

Tests

35/35 unit tests pass. Regression tests pin attention cosine ≥ 0.99 (4b) — the int8 codebook precision loss is well within bounds.

The lesson

The user kept pushing: "답은 언제나 존재합니다. 그것을 찾아내는게 어려울 뿐입니다." (The answer always exists; finding it is the hard part.)

For 9 rounds we had been guessing at local optimizations. Round 10 was the result of:

Stopping the guessing and running the existing `--profile` flag
Reading the data: the entire gap was in attention, not matmul
Web search for similar optimization patterns (NEON tbl, MLX implementations, sparse V)
Choosing the right SIMD primitive (`vqtbl1q_s8`) for our specific 16-entry codebook
Accepting the small precision loss (int8 vs fp32 LUT) because the regression tests guard quality

Three sessions of careful Karpathy discipline + one round of profile-driven analysis = the answer existed all along.

Assets 2

08 Apr 11:17

unamedkr

v0.6.5

b7b20b8

v0.6.5 — Re-baseline without Metal (3rd honest correction)

🚨 P3 Metal investigation found the existing Metal backend is slower than CPU-only

While exploring Option P3 (GPU compute graph for KV attention), we measured the existing Metal backend (`TQ_BUILD_METAL=ON`) and discovered it is net negative on every model size we tested.

The numbers (3 runs each, Llama 3.2 3B Instruct PPL eval)

Build	KV type	tok/s
Metal ON	fp32	15.07
Metal OFF	fp32	17.87 (+19%)
Metal ON	turbo_kv_4b	14.17
Metal OFF	turbo_kv_4b	16.53 (+17%)
Metal ON	turbo_kv_5b	13.43
Metal OFF	turbo_kv_5b	15.33 (+14%)

Across model sizes:

Model	Metal-OFF speedup
SmolLM2 135M	neutral
Llama 3.2 1B	+13–17%
Llama 3.2 3B	+14–22%
Gemma 4 26B	+40%

Even on the largest model we have access to, Metal is net negative. The per-matmul dispatch + commit + waitUntilCompleted pattern has overhead that exceeds the GPU compute benefit at batch-1 inference.

Impact on past benchmarks

The CMake default has always been `TQ_BUILD_METAL=OFF`, so end users were always getting the fast path. The bug was in our internal benchmark methodology: we built with `-DTQ_BUILD_METAL=ON` and reported numbers that were 14–22% slower than what users actually get.

This means our v0.6.0–v0.6.4 release notes UNDERSTATE the project's actual speed by 14–22% on Apple Silicon. v0.6.5 republishes the corrected numbers.

Re-baselined (Llama 3.2 3B Instruct, FP32 = 13.56 PPL)

Type	Bytes/block	Compression	tok/s	vs FP32	PPL Δ
FP32 KV	—	1×	18.13	baseline	—
`turbo_kv_4b` ⭐ default	72	7.1×	16.60	−8.4%	+5.7%
`turbo_kv_5b` 🏆 quality	88	5.8×	15.43	−14.9%	+0.7%
`turbo_kv_3b`	56	9.1×	15.77	−13.0%	+13.3%
`turbo_kv_4bo` 🧪	96	5.3×	15.20	−16.2%	+2.5%
`uniform_4b`	68	7.5×	13.27	−26.8%	+7.7%

The relative gaps are essentially unchanged (turbo_kv_4b is still ~8% slower than fp32) — both paths got the same ~20% speedup from removing Metal overhead. Pareto rankings unchanged.

What we did NOT do

The original P3 plan was to add Metal kernels for the new turbo_kv_4b/5b attention path. We abandoned that plan after measuring the existing Metal backend is already net negative — adding more Metal kernels would compound the problem until the existing dispatch path is fixed. See issue #16 for the investigation plan.

The third honest correction

This is the third honest correction we've caught and fixed before it spread:

v0.6.0: "lossless 7× compression" → measured "+6.3% PPL"
v0.6.4: "turbo_kv beats fp32 KV speed" → measured "−7% vs fp32 (NEON)"
v0.6.5: "benchmarks with Metal" → re-measured "benchmarks without Metal (the user default)"

Each correction was caught by our validation discipline. Validation > marketing.

What you should use

```bash
cmake -B build -DCMAKE_BUILD_TYPE=Release # default: TQ_BUILD_METAL=OFF
cmake --build build -j

./build/quant model.gguf # turbo_kv_4b default
./build/quant model.gguf -k turbo_kv_5b # near-lossless quality
```

Do not add `-DTQ_BUILD_METAL=ON` until issue #16 is resolved.

Tests

35/35 unit tests pass on macOS / Linux / Windows.

Filed

Issue #16 — Metal backend currently slower than CPU-only on all tested models

Assets 2

08 Apr 02:43

unamedkr

v0.6.4

b78ae1c

v0.6.4 — Honest validation pass + correction of v0.6.3 speed claims

⚠️ This release exists because validation matters

v0.6.3 shipped with the headline 'turbo_kv beats fp32 KV speed'. After running the comparison again with both paths NEON-optimized, that claim was wrong. v0.6.4 publishes the honest numbers.

Final honest measurements (Llama 3.2 3B PPL eval, 3 runs each)

Type	Avg tok/s	vs FP32	PPL	PPL Δ	Compression
FP32 KV (NEON)	14.63	baseline	13.56	—	1×
`turbo_kv_4b` ⭐ default	13.57	−7.2%	14.33	+5.7%	7.1×
`turbo_kv_3b`	13.13	−10.2%	15.36	+13.3%	9.1×
`turbo_kv_5b` 🏆 quality	12.90	−11.8%	13.65	+0.7%	5.8×

The Round 5 optimization in v0.6.3 (transformer → traits->attention) was real and meaningful: turbo_kv_4b went from 6.9 → 13.57 tok/s (+97%). What was wrong was the comparison baseline: fp32 was unoptimized scalar.

What changed in v0.6.4

File	Change
`tq_transformer.c`	NEON-optimized fp32 attention path. fp32 went 12.6 → 14.83 tok/s (+18%).
`README.md`, `README.ko.md`	Headline tables and ASCII charts updated with honest numbers and a Correction note.
`CHANGELOG.md`	v0.6.3 entry has a prominent Correction notice; v0.6.4 entry documents the validation pass.
v0.6.3 release notes	Updated with the same Correction notice.
`tq_transformer.c`	Round 8 prefetch attempt and Round 9 strided-attention concept reverted (no measurable benefit).

What we learned

Validation is the most valuable step. It found the wrong claim before it spread to users.

The 9-round Karpathy loop was in good faith but the comparison baseline was unfair. Once we fixed the unfair baseline, the headline flipped from 'beats fp32' to 'within 8% of fp32 with 7× compression'. Both stories are interesting — but only one is true.

Pareto position (still strong)

`turbo_kv_4b` is still strictly better than `uniform_4b` on every axis:

	turbo_kv_4b	uniform_4b
PPL on Llama 3.2 3B	14.33	14.60
Speed	13.57 tok/s	11.7 tok/s
Compression	7.1×	7.5×

The compression edge for uniform_4b is marginal; turbo_kv_4b wins on the other two by clear margins.

Tests

35/35 unit tests pass on macOS / Linux / Windows. Regression tests pin cosine ≥ 0.99 (4b) and ≥ 0.999 (5b).

Closes

Honest validation of v0.6.3 speed claims ✅
Corrected README and release notes ✅
Local optimum reached for the current attention path
Future structural work (e.g., GPU dispatch, tensor graph IR) tracked separately

Assets 2

08 Apr 00:47

unamedkr

v0.6.3

c58d4d7

v0.6.3 — Karpathy round 5+6: closes turbo_kv speed gap from −45% to −8%

⚠️ Correction

The original release notes claimed 'turbo_kv beats fp32 KV speed'. That was wrong — an artifact of the fp32 attention path being unoptimized scalar while the quant path had NEON. After adding NEON to fp32 (commit `4490c83`):

Type	Bytes/block	Compression	tok/s	vs FP32	PPL Δ
FP32 KV (NEON)	—	1×	14.83	baseline	—
`turbo_kv_4b` ⭐	72	7.1×	13.67	−7.8%	+5.7%
`turbo_kv_5b` 🏆	88	5.8×	13.13	−11.5%	+0.7%
`turbo_kv_3b`	56	9.1×	13.4	−9.6%	+13.3%

The honest story: 9 rounds of Karpathy iteration closed the quant-KV speed gap from −45% to −8%, while the types compress 5.8–9.1×. We do not (yet) beat fp32 raw speed.

What actually changed in this release

Round 5 (the real bottleneck)

`tq_transformer.c`'s `use_quant_kv` path was calling `traits->dequantize` once per cached key per token, which internally ran `tq_rht_inverse()` (O(d log d)) per call — dominating the total cost at long context. Round 5 changes the inner loop to use the type's optimized `traits->attention` kernel, which:

Pre-rotates the query ONCE per layer
Does fused dequant + dot product per block in rotated space
Skips per-position inverse RHT entirely

This took the quant path from 6.9 → 13.7 tok/s on Llama 3.2 3B (a real ~2× speedup). The fast path bypasses the slow path for the common case (no QK-norm-on-stored-keys, no high-res window, no sliding-window attention).

Round 6

Hoist LUT in `turbo_kv_4bo` and `turbo_kv_3bo` dequant functions to match the optimization patterns in 3b/4b/5b.

Karpathy round-by-round

Round	What changed	turbo_kv_4b tok/s
0	Baseline (per-position dequant + inline dot)	6.9
1	Single-pass dequant with hoisted LUT	7.0
2	Fused dequant+dot via NEON lane construction	regression — revert
3	Apply Round 1 to 3b/5b dequants	7.0
4	Pure scalar fused with 4 accumulators	7.0
5	transformer uses traits->attention (no per-pos RHT inverse)	13.5 ✅
6	Hoist LUT in 4bo/3bo dequants	13.7
7	NEON-optimize fp32 path (validation finding)	fp32: 12.6 → 14.8

Lessons

The validation step (running the same comparison after fixing the unfair baseline) flipped the headline. This is exactly what validation is for. We caught the wrong claim before it shipped to users.

The Karpathy loop's measure → modify → measure → revert if worse discipline kept us honest at each step. Round 2 looked clever but regressed. Round 5 was an unglamorous transformer-level structural change that nothing in the local optimizations could find — but the systematic measurement revealed it.

What you should use

`./build/quant model.gguf` — defaults to `turbo_kv_4b` (best size × quality, 92% of fp32 speed)
`-k turbo_kv_5b` — when you need near-lossless quality
`-k turbo_kv_3b` — for maximum compression (9.1×) at ~13% PPL cost
`-k fp32` — when you have memory to spare and want max speed

Tests

35/35 tests pass. Regression tests pin cosine ≥ 0.99 (4b/5b) and 5b ≥ 4b accuracy invariant.

Assets 2

07 Apr 23:17

unamedkr

v0.6.2

0f1d1a6

v0.6.2 — Per-channel outlier handling (turbo_kv_4bo / turbo_kv_3bo)

🆕 Per-channel outlier handling

Two new research types add per-block outlier handling on top of the Variant F base, validating the technique from the Google TurboQuant paper.

Each block stores the K=8 channels with the largest `|rotated[i]|` as exact FP16 values that overwrite the codebook reconstruction at dequant time. The non-outlier channels share a tighter codebook (max-abs computed from the body only), so the codebook doesn't waste resolution on the tails the outliers already capture exactly.

Karpathy-loop result: gap cut in half on Llama 3.2 3B

Type	Bytes/block	PPL	Δ vs FP32	vs 4b
FP32	—	13.56	—	—
`turbo_kv_4b` ⭐ default	72	14.28	+5.3%	—
`turbo_kv_3bo` 🧪	80	14.03	+3.5%	−34% gap
`turbo_kv_5b` 🏆 quality	88	13.60	+0.34%	−94% gap
`turbo_kv_4bo` 🧪	96	13.86	+2.2%	−58% gap

Honest disclosure: model-dependent

The outlier types are data-dependent. On SmolLM2 135M:

Type	Bytes	PPL	Δ vs FP32
FP32	—	18.62	—
`turbo_kv_4b`	72	19.70	+5.8%
`turbo_kv_3bo`	80	20.45	+9.8% (regression)
`turbo_kv_5b`	88	18.94	+1.7%
`turbo_kv_4bo`	96	19.29	+3.6%

On a smaller-dimension model the 3-bit base in 3bo is too coarse even with outliers, and 4bo is dominated by 5b. The Pareto-optimal recommendations remain:

`turbo_kv_4b` as the default (production)
`turbo_kv_5b` for quality (production)
`turbo_kv_4bo` / `turbo_kv_3bo` as research types (selectable via `-k turbo_kv_4bo` / `turbo_kv_3bo`)

Why ship them anyway?

Validates the per-channel outlier technique — proves that local outlier handling closes meaningful PPL gap on heavy-tailed models
Data point for Issue #15 and the ongoing TurboQuant paper reproduction work
Foundation for per-model auto-selection — a future release could pick 4b vs 4bo per layer/head

CLI

```bash
./build/quant model.gguf -k turbo_kv_4bo # research, 96B blocks
./build/quant model.gguf -k turbo_kv_3bo # research, 80B blocks
./build/quant model.gguf -k turbo_kv_5b # production quality, 88B blocks
./build/quant model.gguf # default = turbo_kv_4b, 72B blocks
```

Tests

35/35 unit tests pass. The existing regression tests on `turbo_kv_4b` and `turbo_kv_5b` cosine quality remain unchanged and continue to gate any future regression.

Closes from issue #15

✅ Per-channel outlier handling (Google paper's 32-channel split) — explored, model-dependent

Still open in #15:

Paper-faithful Llama 3.1 8B + LongBench-E reproduction
Per-head rotation seeds

Assets 2

07 Apr 22:33

unamedkr

v0.6.1

475872c

v0.6.1 — turbo_kv_5b near-lossless + regression tests

🆕 turbo_kv_5b — near-lossless KV at +0.34% PPL

5-bit (32-level) Lloyd-Max-Gaussian codebook on RHT-rotated keys, following the same Variant F single-stage architecture as `turbo_kv_4b`. The new quality-maximizing option for users who can spare 22% more KV memory than 4b.

Type	Bytes/block	Compression	Llama 3.2 3B PPL	Δ vs FP32
FP32 baseline	4/elem	1×	13.56	—
`turbo_kv_3b`	56	9.1×	15.39	+13.5%
`turbo_kv_4b` ⭐ default	72	7.1×	14.28	+5.3%
`turbo_kv_5b` 🏆	88	5.8×	13.60	+0.34%

CLI: `./build/quant model.gguf -k turbo_kv_5b`

Regression tests

Three new deterministic tests in `test_turbo_kv.cpp` pin the Variant F quality thresholds so future Karpathy-loop iterations cannot regress past them without failing CI:

`KV_4B_AttentionCosine` — `turbo_kv_4b` cosine ≥ 0.99 vs FP32 reference on synthetic data
`KV_5B_AttentionCosine` — `turbo_kv_5b` cosine ≥ 0.999
`KV_5B_BeatsKV_4B` — invariant: more bits must give ≥ accuracy

Tests use synthetic Gaussian-with-outliers vectors (~3% injected outliers at ±5× scale) and run in < 1 second. No model file needed.

Compatibility

Block layout for `turbo_kv_3b`/`4b` is unchanged from v0.6.0 — only new `turbo_kv_5b` type added
All 35 unit tests pass on macOS / Linux / Windows

Closes one item from issue #15

The 5-bit codebook variant follow-up from #15 is now shipped. Remaining items: per-channel outlier handling, Llama 3.1 8B + LongBench-E reproduction.

Assets 2

07 Apr 21:53

unamedkr

v0.6.0

5f6387b

v0.6.0 — turbo_kv_4b champion, beats production baseline

🏆 Highlights

After 6 rounds of Karpathy-loop iteration starting from a literal port of Google TurboQuant (ICLR 2026), turbo_kv_4b is now the best 4-bit KV quantization in the project — beating both our previous production baseline (`uniform_4b`) and llama.cpp's `q4_0` KV at the same bit budget.

KV type	Bits/elem	Llama 3.2 3B PPL	Δ vs FP32
FP32 baseline	32	13.56	—
`turbo_kv_4b` ⭐	4	14.28	+5.3%
`uniform_4b`	4	14.41	+6.3%
`turbo_kv_3b`	3	15.39	+13.5%
llama.cpp q4_0 KV (rough)	4	~14.99	+10.6%

The story

The literal paper port (RHT → Lloyd-Max codebook → 1-bit QJL residual + ‖r‖₂) gave PPL 16.03 — worse than the simpler `uniform_4b` (14.41). A Karpathy-loop ablation found the QJL stage contributed byte-identical zero to attention scores. We dropped it and reinvested the freed 16 bytes per block in a 2× larger codebook (3-bit → 4-bit / 8 → 16 levels). Same total block size, finer reconstruction, structurally simpler.

Full optimization history: bench/results/turboquant_reproduction.md

Other changes

CLI default switched — `quant model.gguf` now uses `turbo_kv_4b` automatically
@quantcpp/wasm npm package — `npm install @quantcpp/wasm` to drop a 192KB GGUF inference engine into any web project
Windows CI green — pthread_cond_wait SRWLOCK deadlock fixed, MSVC `_builtin*` shims, /tmp paths in tests, M_PI in test_neon_scalar. 35/35 tests pass on macOS / Linux / Windows.
Honest TurboQuant story — public reproduction report with full ablation history. No overstated claims.
Public PR triage — PR #12 (5 critical bug fixes) cherry-picked; PR #13 reformatting noise rejected, examples README + CMake separation salvaged.

What's tracked for next release

See issue #15:

Per-channel outlier handling (Google paper's 32-channel split)
Paper-faithful Llama 3.1 8B + LongBench-E reproduction
5-bit codebook variant for ~5 bpc

Bug fixes

`tq_qjl.c`: NaN guard requires `dim > 0`
`tq_uniform.c`: heap-allocate Q8 query buffer (was 512B stack)
`tq_transformer.c`: NULL-check key/value cache calloc results
`tq_ops.c`: Windows pthread_cond_wait must use SRW variant (CS variant on SRWLOCK = deadlock in test_ops thread pool)

Citations

If you use quant.cpp's KV compression in research, please cite:

Assets 2

05 Apr 06:36

unamedkr

v0.5.0

d78b3f8

v0.5.0 — Gemma 4 MoE + 7x KV Compression + WASM

What's New

Gemma 4 26B-A4B MoE Support

Full support for Gemma 4's hybrid MoE architecture: 128 experts, dual-FFN, hybrid attention (sliding + full), QK-norm, learned RoPE, GeGLU activation. Generates correct answers in English and Korean.

7x KV Cache Compression

Same hardware, 7x longer context, zero quality loss.

Model	FP16 KV	quant.cpp KV	Gain
Llama 3.2 3B (16GB Mac)	50K tokens	350K tokens	6.9x
Gemma 4 26B (16GB Mac)	4K tokens	30K tokens	6.9x

New Models

Llama 3.2 3B Instruct — 17 tok/s, correct code generation
Gemma 4 26B-A4B-it — 3.9 tok/s, 128-expert MoE

WASM Browser Demo

192KB binary. Drag and drop a GGUF model, chat in the browser. Everything runs client-side.
→ Try it

Windows (MSVC) Support

Compiles with Visual Studio 2019/2022. pthread shim, C11 atomics compat.

quant.h Synced

Single header now includes Gemma 4, Llama 3, IQ3_XXS support. cc app.c -lm -lpthread — done.

Documentation

API Reference — 730 lines, all platforms
Custom Quantization Guide — add your own KV type in 3 functions
ROADMAP — project direction

Performance

Gemma 4 26B: 549ms → 257ms/token (-53%)
Metal GPU: 7 compute kernels implemented (infrastructure for batch inference)

Bug Fixes

Gemma 4 NaN regression, Llama head_dim misdetection
TQ_STATIC_ASSERT in C mode, stack buffer overflow
Zero build warnings, 34/34 tests pass, score 99.2%

Full changelog: CHANGELOG.md

Full Changelog: v0.2.0...v0.5.0

Assets 4

Releases: quantumaikr/quant.cpp

v0.7.2 — turbo_kv_5b_fast: near-lossless at fp32 parity speed (1-byte layout)

New Pareto point: near-lossless quality + parity speed

Why it works

Trade-off

Tests

What's not in v0.7.2

Try it

Uh oh!

v0.7.1 — Round 11: SIMD lookup applied to 3b/5b (partial parity)

Round 11: same primitive, different bit-packing

Why 4b reached parity but 5b/3b didn't

Bonus insight: matmul already used the same pattern

What's not in v0.7.1

Tests

What you should use

Cross-session lesson

Uh oh!

v0.7.0 — Round 10 BREAKTHROUGH: turbo_kv_4b matches fp32 KV speed at 7.1× compression

🏆 The breakthrough we've been chasing for 3 sessions

The Karpathy story

The fix: NEON vqtbl1q_s8 (Round 10)

Cross-model verification

Honest framing change

What you should use

What's NOT in v0.7.0

Tests

The lesson

Uh oh!

v0.6.5 — Re-baseline without Metal (3rd honest correction)

🚨 P3 Metal investigation found the existing Metal backend is slower than CPU-only

The numbers (3 runs each, Llama 3.2 3B Instruct PPL eval)

Impact on past benchmarks

Re-baselined (Llama 3.2 3B Instruct, FP32 = 13.56 PPL)

What we did NOT do

The third honest correction

What you should use

Tests

Filed

Uh oh!

v0.6.4 — Honest validation pass + correction of v0.6.3 speed claims

⚠️ This release exists because validation matters

Final honest measurements (Llama 3.2 3B PPL eval, 3 runs each)

What changed in v0.6.4

What we learned

Pareto position (still strong)

Tests

Closes

Uh oh!

v0.6.3 — Karpathy round 5+6: closes turbo_kv speed gap from −45% to −8%

⚠️ Correction

What actually changed in this release

Round 5 (the real bottleneck)

Round 6

Karpathy round-by-round

Lessons

What you should use

Tests

Uh oh!

v0.6.2 — Per-channel outlier handling (turbo_kv_4bo / turbo_kv_3bo)

🆕 Per-channel outlier handling

Karpathy-loop result: gap cut in half on Llama 3.2 3B

Honest disclosure: model-dependent

Why ship them anyway?

CLI

Tests

Closes from issue #15

Uh oh!

v0.6.1 — turbo_kv_5b near-lossless + regression tests

🆕 turbo_kv_5b — near-lossless KV at +0.34% PPL

Regression tests

Compatibility

Closes one item from issue #15

Uh oh!

v0.6.0 — turbo_kv_4b champion, beats production baseline

🏆 Highlights

The story

Other changes

What's tracked for next release

Bug fixes