New Pareto point: near-lossless quality + parity speed
`turbo_kv_5b_fast` is a new variant that uses the same Variant F algorithm as `turbo_kv_5b` (RHT + 32-level Lloyd-Max codebook) but stores each 5-bit index as a full byte instead of bit-packed. This wastes 3 bits per index but eliminates the scalar bit-extraction overhead that kept `turbo_kv_5b` at -8.8% vs fp32 in v0.7.1.
| Type | Bytes/block | Compression | PPL | Δ vs FP32 | tok/s | vs FP32 speed |
|---|---|---|---|---|---|---|
| FP32 KV | — | 1× | 13.56 | — | 17.93 | baseline |
| `turbo_kv_4b` ⭐ default | 72 | 7.1× | 14.08 | +3.8% | 18.13 | +1.1% ✅ |
| `turbo_kv_5b` 🏆 quality | 88 | 5.8× | 13.65 | +0.7% | 16.93 | -5.6% |
| `turbo_kv_5b_fast` 🆕 | 136 | 3.76× | 13.65 | +0.7% | 17.53 | -2.2% near-parity |
| `turbo_kv_3b` | 56 | 9.1× | 15.36 | +13.3% | 16.57 | -10.1% |
Why it works
The 5b inner loop bottleneck after Round 11 was the scalar bit-extraction needed to unpack 16 indices from 10 bytes (5 bits each, irregular byte boundaries). The SIMD lookup itself is 1 instruction (`vqtbl2q_s8`) but the unpack costs ~16 scalar shift+mask ops per iteration.
By storing 1 byte per index, the unpack becomes a single `vld1q_u8` (16 bytes load = 16 indices) and the inner loop becomes pure SIMD:
```c
for (d = 0; d + 15 < dim; d += 16) {
uint8x16_t indices = vld1q_u8(mi + d); // direct byte load
int8x16_t vals = vqtbl2q_s8(cb_vec, indices); // 32-entry table lookup
// ... int8 → fp32 → scale → fma
}
```
The cleanest implementation in the codebase. Same PPL as `turbo_kv_5b` (verified — 13.65 exactly, both share the same 32-level codebook).
Trade-off
`turbo_kv_5b_fast` is 1.55× larger per block than `turbo_kv_5b` (136 vs 88 bytes), so compression drops from 5.8× to 3.76×. In exchange you get fp32-parity speed for near-lossless quality.
Use case: "I want the +0.7% PPL of 5b but I can't accept the -5.6% speed gap, and I have memory to spare for the lower compression ratio."
Recommended Pareto choices:
| Goal | Use |
|---|---|
| Best speed × compression × decent quality | `turbo_kv_4b` (default) |
| Near-lossless quality, max compression | `turbo_kv_5b` |
| Near-lossless quality, parity speed | `turbo_kv_5b_fast` (this release) |
| Max compression, OK with quality cost | `turbo_kv_3b` (≥ 3B models only) |
Tests
35/35 unit tests pass. Same regression test thresholds (cosine ≥ 0.999) as `turbo_kv_5b`.
What's not in v0.7.2
- 3b/4bo/3bo NEON optimization — separate Pareto choices, lower priority
- AVX2 / WASM SIMD ports of the NEON tbl pattern
- Llama 3.1 8B paper baseline reproduction (memory-constrained)
Try it
```bash
./build/quant model.gguf -k turbo_kv_5b_fast -v fp16
```