You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Section 4.2 results table updated with the corrected CPU-only baseline
(FP32 18.13 tok/s, turbo_kv_4b 16.60). Section 5.5 (was 5.4) is now
the new Section 5.5; Section 5.4 documents the Metal investigation
that triggered the v0.6.5 re-baseline.
These numbers are with CMake default `TQ_BUILD_METAL=OFF`. The Metal backend is currently a net negative on Apple Silicon at batch-1 inference (per-matmul dispatch overhead exceeds GPU compute benefit) and is disabled by default. See Section 5.5 for the investigation.
@@ -200,7 +201,26 @@ Against the published TurboQuant (which we cannot directly run for comparison),
200
201
201
202
A central design constraint of quant.cpp is single-header portability. The 192 KB WebAssembly binary, the iOS / Android / MSVC support, and the absence of any framework dependency are deliberate choices that exclude many research-grade techniques (e.g., learned codebooks, per-token routing) that would require runtime infrastructure beyond `libc + libm + pthreads`. Variant F was selected partly because it fits into 64 bytes of inline state per 128-element block with no auxiliary tables.
202
203
203
-
### 5.4 What we learned about Karpathy-loop discipline
204
+
### 5.4 Metal backend investigation: dispatch overhead at batch-1
205
+
206
+
We initially planned to add Metal compute kernels for the Variant F attention path, hoping to push beyond the CPU NEON ceiling. While benchmarking the existing Metal matmul backend (which has been in the codebase since v0.5) with `TQ_BUILD_METAL=ON`, we discovered that **enabling Metal makes inference 13–40% slower** on every model size we tested, including the largest model we have access to (Gemma 4 26B-A4B).
207
+
208
+
| Model | Metal-OFF speedup vs Metal-ON |
209
+
|---|---|
210
+
| SmolLM2 135M | neutral (within noise) |
211
+
| Llama 3.2 1B | +13–17% |
212
+
| Llama 3.2 3B | +14–22% |
213
+
| Gemma 4 26B-A4B |**+40%**|
214
+
215
+
The current Metal path uses per-matmul dispatch with `commit + waitUntilCompleted` at flush points. The per-op dispatch overhead exceeds the GPU compute benefit at batch-1 inference. This is the same issue that killed earlier attempts at a full GPU compute graph.
216
+
217
+
The CMake default has always been `TQ_BUILD_METAL=OFF`, so end users were always getting the fast CPU path. But our internal benchmarks for v0.6.0–v0.6.4 used `-DTQ_BUILD_METAL=ON` and were therefore 14–22% slower than what users actually got. v0.6.5 republished the corrected numbers (this section reflects the corrected baseline).
218
+
219
+
The lesson: **always benchmark with the exact build flags a user gets from `cmake -B build`, not the flags in your dev environment**. A parallel `build_default/` directory built without overrides is the canonical comparison.
220
+
221
+
We did not pursue adding Metal kernels for turbo_kv attention because the existing Metal path needs to be fixed (or removed) first; adding more Metal kernels would compound the problem. Issue #16 in the project tracker documents the investigation plan: profile the dispatch overhead source, find a model-size threshold above which Metal wins, or remove the Metal path entirely.
222
+
223
+
### 5.5 What we learned about Karpathy-loop discipline
0 commit comments