Skip to content

Commit efbc023

Browse files
committed
arXiv draft: re-baseline Section 4.2 + add Section 5.4 (Metal investigation)
Section 4.2 results table updated with the corrected CPU-only baseline (FP32 18.13 tok/s, turbo_kv_4b 16.60). Section 5.5 (was 5.4) is now the new Section 5.5; Section 5.4 documents the Metal investigation that triggered the v0.6.5 re-baseline.
1 parent b7b20b8 commit efbc023

1 file changed

Lines changed: 29 additions & 9 deletions

File tree

docs/papers/quant_cpp_arxiv_draft.md

Lines changed: 29 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -122,19 +122,20 @@ This validation step is now part of our standard process: **after any claimed pe
122122
- **Quality metric**: Forward-pass perplexity via `--ppl` flag (teacher-forced)
123123
- **Speed metric**: Tokens per second on the same PPL eval (representative of attention-heavy workloads)
124124

125-
### 4.2 Llama 3.2 3B Instruct results
125+
### 4.2 Llama 3.2 3B Instruct results (CPU-only, CMake default)
126126

127127
| KV Config | Bytes/block | Compression | PPL | Δ vs FP32 | tok/s | vs FP32 speed |
128128
|:----------|------------:|------------:|----:|----------:|------:|--------------:|
129-
| FP32 reference (NEON) ||| 13.56 || 14.83 | baseline |
130-
| `turbo_kv_5b` (quality) | 88 | 5.8× | **13.65** | **+0.7%** | 13.13 | −11.5% |
131-
| `turbo_kv_4bo` (research) | 96 | 5.3× | 13.90 | +2.5% | 12.7 | −14% |
132-
| `turbo_kv_4b` (default) | 72 | 7.1× | 14.33 | +5.7% | 13.67 | **−7.8%** |
133-
| `turbo_kv_3b` | 56 | 9.1× | 15.36 | +13.3% | 13.4 | −9.6% |
134-
| `turbo_kv_3bo` (research) | 80 | 6.4× | 14.17 | +4.5% | 9.3 | −37% |
135-
| `uniform_4b` (legacy) | 68 | 7.5× | 14.60 | +7.7% | 11.7 | −21% |
129+
| FP32 reference ||| 13.56 || **18.13** | baseline |
130+
| `turbo_kv_5b` (quality) | 88 | 5.8× | **13.65** | **+0.7%** | 15.43 | −14.9% |
131+
| `turbo_kv_4bo` (research) | 96 | 5.3× | 13.90 | +2.5% | 15.20 | −16.2% |
132+
| `turbo_kv_4b` (default) | 72 | 7.1× | 14.33 | +5.7% | **16.60** | **−8.4%** |
133+
| `turbo_kv_3b` | 56 | 9.1× | 15.36 | +13.3% | 15.77 | −13.0% |
134+
| `uniform_4b` (legacy) | 68 | 7.5× | 14.60 | +7.7% | 13.27 | −26.8% |
136135
| llama.cpp `q4_0` KV (lit. survey) | ~70 | ~7.3× | ~14.99 | +10.6% |||
137136

137+
These numbers are with CMake default `TQ_BUILD_METAL=OFF`. The Metal backend is currently a net negative on Apple Silicon at batch-1 inference (per-matmul dispatch overhead exceeds GPU compute benefit) and is disabled by default. See Section 5.5 for the investigation.
138+
138139
The Pareto-optimal recommendations are:
139140

140141
- **`turbo_kv_4b`** (default): 7.1× compression, +5.7% PPL, 92% of FP32 KV speed
@@ -200,7 +201,26 @@ Against the published TurboQuant (which we cannot directly run for comparison),
200201

201202
A central design constraint of quant.cpp is single-header portability. The 192 KB WebAssembly binary, the iOS / Android / MSVC support, and the absence of any framework dependency are deliberate choices that exclude many research-grade techniques (e.g., learned codebooks, per-token routing) that would require runtime infrastructure beyond `libc + libm + pthreads`. Variant F was selected partly because it fits into 64 bytes of inline state per 128-element block with no auxiliary tables.
202203

203-
### 5.4 What we learned about Karpathy-loop discipline
204+
### 5.4 Metal backend investigation: dispatch overhead at batch-1
205+
206+
We initially planned to add Metal compute kernels for the Variant F attention path, hoping to push beyond the CPU NEON ceiling. While benchmarking the existing Metal matmul backend (which has been in the codebase since v0.5) with `TQ_BUILD_METAL=ON`, we discovered that **enabling Metal makes inference 13–40% slower** on every model size we tested, including the largest model we have access to (Gemma 4 26B-A4B).
207+
208+
| Model | Metal-OFF speedup vs Metal-ON |
209+
|---|---|
210+
| SmolLM2 135M | neutral (within noise) |
211+
| Llama 3.2 1B | +13–17% |
212+
| Llama 3.2 3B | +14–22% |
213+
| Gemma 4 26B-A4B | **+40%** |
214+
215+
The current Metal path uses per-matmul dispatch with `commit + waitUntilCompleted` at flush points. The per-op dispatch overhead exceeds the GPU compute benefit at batch-1 inference. This is the same issue that killed earlier attempts at a full GPU compute graph.
216+
217+
The CMake default has always been `TQ_BUILD_METAL=OFF`, so end users were always getting the fast CPU path. But our internal benchmarks for v0.6.0–v0.6.4 used `-DTQ_BUILD_METAL=ON` and were therefore 14–22% slower than what users actually got. v0.6.5 republished the corrected numbers (this section reflects the corrected baseline).
218+
219+
The lesson: **always benchmark with the exact build flags a user gets from `cmake -B build`, not the flags in your dev environment**. A parallel `build_default/` directory built without overrides is the canonical comparison.
220+
221+
We did not pursue adding Metal kernels for turbo_kv attention because the existing Metal path needs to be fixed (or removed) first; adding more Metal kernels would compound the problem. Issue #16 in the project tracker documents the investigation plan: profile the dispatch overhead source, find a model-size threshold above which Metal wins, or remove the Metal path entirely.
222+
223+
### 5.5 What we learned about Karpathy-loop discipline
204224

205225
Two lessons stand out:
206226

0 commit comments

Comments
 (0)