Release v0.7.2 — turbo_kv_5b_fast: near-lossless at fp32 parity speed (1-byte layout) · quantumaikr/quant.cpp

New Pareto point: near-lossless quality + parity speed

`turbo_kv_5b_fast` is a new variant that uses the same Variant F algorithm as `turbo_kv_5b` (RHT + 32-level Lloyd-Max codebook) but stores each 5-bit index as a full byte instead of bit-packed. This wastes 3 bits per index but eliminates the scalar bit-extraction overhead that kept `turbo_kv_5b` at -8.8% vs fp32 in v0.7.1.

Type	Bytes/block	Compression	PPL	Δ vs FP32	tok/s	vs FP32 speed
FP32 KV	—	1×	13.56	—	17.93	baseline
`turbo_kv_4b` ⭐ default	72	7.1×	14.08	+3.8%	18.13	+1.1% ✅
`turbo_kv_5b` 🏆 quality	88	5.8×	13.65	+0.7%	16.93	-5.6%
`turbo_kv_5b_fast` 🆕	136	3.76×	13.65	+0.7%	17.53	-2.2% near-parity
`turbo_kv_3b`	56	9.1×	15.36	+13.3%	16.57	-10.1%

Why it works

The 5b inner loop bottleneck after Round 11 was the scalar bit-extraction needed to unpack 16 indices from 10 bytes (5 bits each, irregular byte boundaries). The SIMD lookup itself is 1 instruction (`vqtbl2q_s8`) but the unpack costs ~16 scalar shift+mask ops per iteration.

By storing 1 byte per index, the unpack becomes a single `vld1q_u8` (16 bytes load = 16 indices) and the inner loop becomes pure SIMD:

```c
for (d = 0; d + 15 < dim; d += 16) {
uint8x16_t indices = vld1q_u8(mi + d); // direct byte load
int8x16_t vals = vqtbl2q_s8(cb_vec, indices); // 32-entry table lookup
// ... int8 → fp32 → scale → fma
}
```

The cleanest implementation in the codebase. Same PPL as `turbo_kv_5b` (verified — 13.65 exactly, both share the same 32-level codebook).

Trade-off

`turbo_kv_5b_fast` is 1.55× larger per block than `turbo_kv_5b` (136 vs 88 bytes), so compression drops from 5.8× to 3.76×. In exchange you get fp32-parity speed for near-lossless quality.

Use case: "I want the +0.7% PPL of 5b but I can't accept the -5.6% speed gap, and I have memory to spare for the lower compression ratio."

Recommended Pareto choices:

Goal	Use
Best speed × compression × decent quality	`turbo_kv_4b` (default)
Near-lossless quality, max compression	`turbo_kv_5b`
Near-lossless quality, parity speed	`turbo_kv_5b_fast` (this release)
Max compression, OK with quality cost	`turbo_kv_3b` (≥ 3B models only)

Tests

35/35 unit tests pass. Same regression test thresholds (cosine ≥ 0.999) as `turbo_kv_5b`.

What's not in v0.7.2

3b/4bo/3bo NEON optimization — separate Pareto choices, lower priority
AVX2 / WASM SIMD ports of the NEON tbl pattern
Llama 3.1 8B paper baseline reproduction (memory-constrained)

Try it

```bash
./build/quant model.gguf -k turbo_kv_5b_fast -v fp16
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.7.2 — turbo_kv_5b_fast: near-lossless at fp32 parity speed (1-byte layout)

Choose a tag to compare

Sorry, something went wrong.