Commit 4490c83
HONEST: NEON-optimize fp32 KV attention path too (was scalar)
Validation revealed the previous v0.6.3 'turbo_kv beats fp32 KV speed'
claim was an artifact: the fp32 attention path used a pure scalar
inner loop while the quant path used NEON. After adding NEON to the
fp32 path:
Llama 3.2 3B PPL eval, 3 runs each:
Type Before (scalar fp32) After (NEON fp32) vs FP32
-------------- -------------------- ----------------- -------
fp32 12.6 tok/s 14.8 tok/s baseline
turbo_kv_4b 13.7 tok/s 13.7 tok/s -7.4%
turbo_kv_5b 13.2 tok/s 13.2 tok/s -10.8%
turbo_kv_3b 13.4 tok/s 13.4 tok/s -9.5%
The Round 5 optimization (transformer → traits->attention) is still a
real ~2× speedup of the quant path (6.9 → 13.7 tok/s), and the speed
gap to fp32 KV is closed from -45% to -7%. But the headline is no
longer 'beats fp32' — it's 'within 8% of fp32 with 7× compression'.
This is what the validation step is for. Better to discover and fix
the unfair comparison BEFORE publishing.
35/35 tests pass.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>1 parent c58d4d7 commit 4490c83
1 file changed
+15
-1
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1719 | 1719 | | |
1720 | 1720 | | |
1721 | 1721 | | |
1722 | | - | |
| 1722 | + | |
1723 | 1723 | | |
1724 | 1724 | | |
1725 | 1725 | | |
| |||
1728 | 1728 | | |
1729 | 1729 | | |
1730 | 1730 | | |
| 1731 | + | |
| 1732 | + | |
| 1733 | + | |
| 1734 | + | |
| 1735 | + | |
| 1736 | + | |
| 1737 | + | |
| 1738 | + | |
| 1739 | + | |
| 1740 | + | |
| 1741 | + | |
| 1742 | + | |
| 1743 | + | |
1731 | 1744 | | |
1732 | 1745 | | |
1733 | 1746 | | |
| 1747 | + | |
1734 | 1748 | | |
1735 | 1749 | | |
1736 | 1750 | | |
| |||
0 commit comments