You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
arXiv draft: add Round 10 to derivation history and update Section 4.2
Section 3.2 Karpathy table now has the 10-round history with the
Round 10 NEON tbl breakthrough callout. Section 4.2 results table
updated with the post-Round 10 numbers (turbo_kv_4b at fp32 parity
+4.5%, PPL +3.8%). Highlights that Round 6 (drop QJL + larger
codebook) and Round 10 (NEON tbl) carried 95% of the value.
The single most impactful round was Round 5: dropping the QJL residual stage entirely. Ablation showed the QJL correction term contributed *byte-identical zero* to the final attention scores in our regime. Rather than debug the QJL stage, we removed it and reinvested the freed 16 bytes per block into a finer Lloyd-Max codebook (3-bit → 4-bit, 8 → 16 levels). Combined with max-abs scaling instead of theoretical √d, this single change took `turbo_kv_4b` PPL from 16.03 to 14.28 — a structural simplification, not a tuning win.
94
+
Two rounds carried 95% of the value:
95
+
96
+
**Round 6** dropped the QJL residual stage entirely. Ablation showed the QJL correction term contributed *byte-identical zero* to the final attention scores in our regime. Rather than debug the QJL stage, we removed it and reinvested the freed 16 bytes per block into a finer Lloyd-Max codebook (3-bit → 4-bit, 8 → 16 levels). Combined with max-abs scaling instead of theoretical √d, this single change took `turbo_kv_4b` PPL from 16.03 to 14.28 — a structural simplification, not a tuning win.
97
+
98
+
**Round 10** is the more recent and arguably more important breakthrough. Profile data at long context (PPL eval, seq_len ~950) revealed the entire 8% speed gap between `turbo_kv_4b` and fp32 was in the attention dot-product loop — matmul code was identical between the two paths. The previous 9 rounds had been iterating local fusions to the inner loop without measuring where time was actually going. The diff was simple: turbo_kv was scalar (per-element LUT load + mul + add) while fp32 was 4-way NEON SIMD. About 2× more instructions per element. The dequant path was **compute-bound, not memory-bound** — surprising for what looked like a memory-bandwidth-light kernel.
99
+
100
+
The fix uses Apple Silicon's `vqtbl1q_s8` instruction, which performs 16 byte-table lookups across 16 lanes in one instruction. We quantize the 16 Lloyd-Max-Gaussian centroids to int8 (~1% precision loss, well below the regression threshold of cosine ≥ 0.99) and store them in a single NEON register. The inner loop processes 32 elements per iteration (16 bytes of `mse_indices` = 32 nibbles) with one `vqtbl1q_s8` per 16 lookups. Result: turbo_kv_4b at fp32 parity (+0.8% on Llama 3.2 3B 3-run average, +4.5% on a single representative run), with PPL also slightly improved (14.33 → 14.08) because the int8 discretization happens to align favorably with key statistics.
101
+
102
+
The honest framing change: from "92% of fp32 KV speed at 7× compression" to **"PARITY with fp32 KV speed at 7× compression"**. The lesson: the answer existed; nine rounds of guessing missed what `--profile` would have revealed in 30 seconds.
94
103
95
104
### 3.3 Validation: the fp32 baseline correction
96
105
@@ -122,19 +131,20 @@ This validation step is now part of our standard process: **after any claimed pe
122
131
-**Quality metric**: Forward-pass perplexity via `--ppl` flag (teacher-forced)
123
132
-**Speed metric**: Tokens per second on the same PPL eval (representative of attention-heavy workloads)
These numbers are with CMake default `TQ_BUILD_METAL=OFF`. The Metal backend is currently a net negative on Apple Silicon at batch-1 inference (per-matmul dispatch overhead exceeds GPU compute benefit) and is disabled by default. See Section 5.5 for the investigation.
145
+
The headline result is `turbo_kv_4b` matching FP32 KV speed at 7.1× memory compression with 3.8% PPL trade-off. This is the first KV quantization in the project that doesn't lose throughput vs uncompressed KV. Variants at higher bit budget (5b) preserve quality nearly perfectly (+0.7% PPL), at lower bit budget (3b) trade quality for compression (9.1× at +13.3% PPL); both are still on the pre-Round-10 scalar path and will receive the same NEON treatment in v0.7.1.
146
+
147
+
These numbers are with CMake default `TQ_BUILD_METAL=OFF`. The Metal backend is currently a net negative on Apple Silicon at batch-1 inference (per-matmul dispatch overhead exceeds GPU compute benefit) and is disabled by default. See Section 5.4 for the investigation.
0 commit comments