HONEST: NEON-optimize fp32 KV attention path too (was scalar)

unamedkr · claude · unamedkr · commit 4490c8318bf9 · 2026-04-08T10:57:17.000+09:00
Validation revealed the previous v0.6.3 'turbo_kv beats fp32 KV speed'
claim was an artifact: the fp32 attention path used a pure scalar
inner loop while the quant path used NEON. After adding NEON to the
fp32 path:

  Llama 3.2 3B PPL eval, 3 runs each:

    Type            Before (scalar fp32)  After (NEON fp32)  vs FP32
    --------------  --------------------  -----------------  -------
    fp32            12.6 tok/s            14.8 tok/s         baseline
    turbo_kv_4b     13.7 tok/s            13.7 tok/s         -7.4%
    turbo_kv_5b     13.2 tok/s            13.2 tok/s         -10.8%
    turbo_kv_3b     13.4 tok/s            13.4 tok/s         -9.5%

The Round 5 optimization (transformer → traits-&gt;attention) is still a
real ~2× speedup of the quant path (6.9 → 13.7 tok/s), and the speed
gap to fp32 KV is closed from -45% to -7%. But the headline is no
longer 'beats fp32' — it's 'within 8% of fp32 with 7× compression'.

This is what the validation step is for. Better to discover and fix
the unfair comparison BEFORE publishing.

35/35 tests pass.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/src/engine/tq_transformer.c b/src/engine/tq_transformer.c
@@ -1719,7 +1719,7 @@ static void self_attn_forward(tq_model_t* model, tq_state_t* s, int l, int pos)
                 }
             }
         } else {
-            /* FP32 attention scores (no quantization) */
+            /* FP32 attention scores (no quantization) — NEON-optimized */
             float inv_scale = 1.0f / sqrtf(attn_scale_dim);
             /* Set positions outside sliding window to -inf */
             for (int t = 0; t < attn_start; t++) {
@@ -1728,9 +1728,23 @@ static void self_attn_forward(tq_model_t* model, tq_state_t* s, int l, int pos)
             for (int t = attn_start; t < seq_len; t++) {
                 const float* kt = key_cache_layer + (size_t)t * cache_kv_dim + kv_h * head_dim;
                 float score = 0.0f;
+#ifdef __ARM_NEON
+                float32x4_t vsum = vdupq_n_f32(0.0f);
+                int d = 0;
+                for (; d + 4 <= head_dim; d += 4) {
+                    float32x4_t vq = vld1q_f32(qh + d);
+                    float32x4_t vk = vld1q_f32(kt + d);
+                    vsum = vfmaq_f32(vsum, vq, vk);
+                }
+                score = vaddvq_f32(vsum);
+                for (; d < head_dim; d++) {
+                    score += qh[d] * kt[d];
+                }
+#else
                 for (int d = 0; d < head_dim; d++) {
                     score += qh[d] * kt[d];
                 }
+#endif
                 atth[t] = score * inv_scale;
             }
         }