Skip to content

Commit 4490c83

Browse files
unamedkrclaude
andcommitted
HONEST: NEON-optimize fp32 KV attention path too (was scalar)
Validation revealed the previous v0.6.3 'turbo_kv beats fp32 KV speed' claim was an artifact: the fp32 attention path used a pure scalar inner loop while the quant path used NEON. After adding NEON to the fp32 path: Llama 3.2 3B PPL eval, 3 runs each: Type Before (scalar fp32) After (NEON fp32) vs FP32 -------------- -------------------- ----------------- ------- fp32 12.6 tok/s 14.8 tok/s baseline turbo_kv_4b 13.7 tok/s 13.7 tok/s -7.4% turbo_kv_5b 13.2 tok/s 13.2 tok/s -10.8% turbo_kv_3b 13.4 tok/s 13.4 tok/s -9.5% The Round 5 optimization (transformer → traits->attention) is still a real ~2× speedup of the quant path (6.9 → 13.7 tok/s), and the speed gap to fp32 KV is closed from -45% to -7%. But the headline is no longer 'beats fp32' — it's 'within 8% of fp32 with 7× compression'. This is what the validation step is for. Better to discover and fix the unfair comparison BEFORE publishing. 35/35 tests pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent c58d4d7 commit 4490c83

File tree

1 file changed

+15
-1
lines changed

1 file changed

+15
-1
lines changed

src/engine/tq_transformer.c

Lines changed: 15 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1719,7 +1719,7 @@ static void self_attn_forward(tq_model_t* model, tq_state_t* s, int l, int pos)
17191719
}
17201720
}
17211721
} else {
1722-
/* FP32 attention scores (no quantization) */
1722+
/* FP32 attention scores (no quantization) — NEON-optimized */
17231723
float inv_scale = 1.0f / sqrtf(attn_scale_dim);
17241724
/* Set positions outside sliding window to -inf */
17251725
for (int t = 0; t < attn_start; t++) {
@@ -1728,9 +1728,23 @@ static void self_attn_forward(tq_model_t* model, tq_state_t* s, int l, int pos)
17281728
for (int t = attn_start; t < seq_len; t++) {
17291729
const float* kt = key_cache_layer + (size_t)t * cache_kv_dim + kv_h * head_dim;
17301730
float score = 0.0f;
1731+
#ifdef __ARM_NEON
1732+
float32x4_t vsum = vdupq_n_f32(0.0f);
1733+
int d = 0;
1734+
for (; d + 4 <= head_dim; d += 4) {
1735+
float32x4_t vq = vld1q_f32(qh + d);
1736+
float32x4_t vk = vld1q_f32(kt + d);
1737+
vsum = vfmaq_f32(vsum, vq, vk);
1738+
}
1739+
score = vaddvq_f32(vsum);
1740+
for (; d < head_dim; d++) {
1741+
score += qh[d] * kt[d];
1742+
}
1743+
#else
17311744
for (int d = 0; d < head_dim; d++) {
17321745
score += qh[d] * kt[d];
17331746
}
1747+
#endif
17341748
atth[t] = score * inv_scale;
17351749
}
17361750
}

0 commit comments

Comments
 (0)