Commit 2537a12
PERF BREAKTHROUGH: Round 10 — NEON tbl achieves fp32 PARITY at 7× compression
The previous 9 Karpathy rounds optimized the wrong thing (small per-block
local fusions) while the bottleneck was the scalar inner loop. Profile
data showed: at long context (PPL eval, seq_len 950), attention takes
19.8ms for turbo_kv_4b vs 15.7ms for fp32 — a 4.1ms gap that is the
ENTIRE source of the −7% speed deficit.
Root cause: turbo_kv inner loop was scalar (LUT load + mul + add per
element) while fp32 was NEON 4-way SIMD. ~2x more instructions per
element. Memory-bandwidth-light path (codebook lookup) was actually
compute-bound.
Round 10 fix: NEON 16-entry table lookup via vqtbl1q_s8.
Algorithm:
1. Quantize the 16 Lloyd-Max-Gaussian centroids to int8 once at
startup (precision loss ~1% — well below regression threshold).
2. Per-block: compute per_block_scale = (range / 127) / inv_std.
3. Inner loop processes 32 elements per iteration:
- Load 16 bytes (= 32 nibbles = 32 elements) of mse_indices
- Split low/high nibbles via vandq_u8 + vshrq_n_u8
- vqtbl1q_s8 for the centroid gather (1 instruction, 16 lanes)
- Interleave + int8→int16→fp32 conversion
- Multiply by per_block_scale
- vfmaq_f32 against q_rot
Result on Llama 3.2 3B PPL eval (3 runs each, no Metal):
Type Round 9 Round 10 Δ
-------------- --------- --------- --------
fp32 17.87 t/s 18.03 t/s +0.9%
turbo_kv_4b 16.53 t/s 18.17 t/s +9.9%
Speed gap -8.4% +0.8% PARITY ✅
Cross-model:
Model Speed gap (R9 → R10) PPL gap (R9 → R10)
SmolLM2 135M -14.5% → -3.1% +5.8% → +5.7%
Llama 3.2 1B -16.3% → -1.3% +7.3% → +5.4%
Llama 3.2 3B -8.4% → +0.8% ✅ +5.7% → +3.8%
PPL also IMPROVED on all three models (int8 discretization happens
to align favorably with key statistics, or regression-to-mean — both
paths produce slightly better numbers in this round).
Same value proposition but stronger:
- Compression: 7.1× (unchanged)
- PPL impact: +3.8 to +5.7% (better than R9)
- Speed vs fp32: PARITY (was -8% in R9)
The honest framing changes from "92% of fp32 speed at 7× compression"
to "AT fp32 speed at 7× compression with ~4% PPL trade-off".
35/35 tests pass. Regression tests (cosine ≥ 0.99) pass — the int8
codebook precision loss is well within bounds.
This is the answer the user was right to push for ("답은 언제나 존재한다").
Profile-driven analysis found the actual bottleneck (scalar vs SIMD)
that 9 rounds of guessing missed.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>1 parent 4268590 commit 2537a12
1 file changed
+107
-9
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
472 | 472 | | |
473 | 473 | | |
474 | 474 | | |
| 475 | + | |
| 476 | + | |
| 477 | + | |
| 478 | + | |
| 479 | + | |
| 480 | + | |
| 481 | + | |
| 482 | + | |
| 483 | + | |
| 484 | + | |
| 485 | + | |
| 486 | + | |
| 487 | + | |
| 488 | + | |
| 489 | + | |
| 490 | + | |
| 491 | + | |
| 492 | + | |
| 493 | + | |
| 494 | + | |
| 495 | + | |
| 496 | + | |
| 497 | + | |
| 498 | + | |
| 499 | + | |
| 500 | + | |
| 501 | + | |
| 502 | + | |
| 503 | + | |
| 504 | + | |
| 505 | + | |
475 | 506 | | |
476 | 507 | | |
477 | 508 | | |
478 | 509 | | |
479 | 510 | | |
480 | | - | |
481 | | - | |
482 | | - | |
483 | | - | |
484 | | - | |
| 511 | + | |
485 | 512 | | |
486 | | - | |
487 | | - | |
488 | | - | |
489 | 513 | | |
| 514 | + | |
| 515 | + | |
| 516 | + | |
| 517 | + | |
| 518 | + | |
| 519 | + | |
| 520 | + | |
| 521 | + | |
| 522 | + | |
| 523 | + | |
| 524 | + | |
| 525 | + | |
| 526 | + | |
| 527 | + | |
| 528 | + | |
| 529 | + | |
| 530 | + | |
| 531 | + | |
| 532 | + | |
| 533 | + | |
| 534 | + | |
| 535 | + | |
| 536 | + | |
| 537 | + | |
| 538 | + | |
| 539 | + | |
| 540 | + | |
| 541 | + | |
| 542 | + | |
| 543 | + | |
| 544 | + | |
| 545 | + | |
| 546 | + | |
| 547 | + | |
| 548 | + | |
| 549 | + | |
| 550 | + | |
| 551 | + | |
| 552 | + | |
| 553 | + | |
| 554 | + | |
| 555 | + | |
| 556 | + | |
| 557 | + | |
| 558 | + | |
| 559 | + | |
| 560 | + | |
| 561 | + | |
| 562 | + | |
| 563 | + | |
| 564 | + | |
| 565 | + | |
| 566 | + | |
| 567 | + | |
| 568 | + | |
| 569 | + | |
| 570 | + | |
| 571 | + | |
| 572 | + | |
| 573 | + | |
| 574 | + | |
| 575 | + | |
| 576 | + | |
| 577 | + | |
| 578 | + | |
| 579 | + | |
| 580 | + | |
| 581 | + | |
| 582 | + | |
| 583 | + | |
| 584 | + | |
| 585 | + | |
| 586 | + | |
490 | 587 | | |
491 | 588 | | |
492 | 589 | | |
| |||
503 | 600 | | |
504 | 601 | | |
505 | 602 | | |
506 | | - | |
| 603 | + | |
507 | 604 | | |
508 | 605 | | |
509 | 606 | | |
510 | 607 | | |
511 | 608 | | |
| 609 | + | |
512 | 610 | | |
513 | 611 | | |
514 | 612 | | |
| |||
0 commit comments