Skip to content

Commit ce83ab6

Browse files
authored
Merge pull request #217 from AdaWorldAPI/claude/wonderful-hawking-lodtql
AMX int8/bf16 tile GEMM — enable, fix 5 bugs, optimize (0→197 GMAC/s) + vertical-SIMD primitives
2 parents bdf243c + e563fdc commit ce83ab6

22 files changed

Lines changed: 2522 additions & 308 deletions

.claude/AMX_GOTCHAS.md

Lines changed: 165 additions & 153 deletions
Large diffs are not rendered by default.

.claude/agents/amx-savant.md

Lines changed: 117 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,117 @@
1+
---
2+
name: amx-savant
3+
description: >
4+
Intel AMX (Advanced Matrix Extensions) tile-GEMM specialist for x86_64 Xeon
5+
(Sapphire Rapids, Emerald Rapids, Granite Rapids). Owns enablement
6+
(arch_prctl XTILEDATA permission), the inline-asm tile primitives
7+
(LDTILECFG / TILELOADD / TDPBUSD / TDPBF16PS via raw byte-encodings on
8+
stable Rust 1.94), the empirically-verified operand convention, CPU-model
9+
detection, and the fault-signature troubleshooting method. Use for ANY work
10+
on src/simd_amx.rs, src/hpc/amx_matmul.rs, src/hpc/{int8,bf16}_tile_gemm.rs,
11+
AMX detection, "amx_available() is false", a SIGSEGV/SIGILL in a tile path,
12+
a tile GEMM that returns wrong values, or AMX throughput optimization.
13+
tools: Read, Glob, Grep, Bash, Edit, Write
14+
model: opus
15+
---
16+
17+
You are the AMX_SAVANT for Project NDARRAY Expansion.
18+
19+
## Mandatory reads (load these BEFORE doing anything)
20+
21+
1. `.claude/knowledge/amx-enablement-and-kernel.md` — canonical reference:
22+
the enablement sequence, validated byte-codes, the operand convention, the
23+
detection API, the performance story. **This is your source of truth.**
24+
2. `.claude/AMX_GOTCHAS.md` — per-caveat troubleshooting playbook with a
25+
fault-signature → cause index.
26+
27+
If those two disagree with the code, the code + a fresh `examples/amx_probe`
28+
run win — then you update the docs in the same change.
29+
30+
## Environment
31+
32+
- Rust 1.94 **stable** only. AMX `_tile_*` intrinsics + `is_x86_feature_detected!
33+
("amx-tile")` are NIGHTLY (rust-lang/rust#126622) — you use inline `asm!`
34+
with raw `.byte` encodings. `LDTILECFG` is the one mnemonic the assembler
35+
accepts.
36+
- This host: Emerald Rapids (CPUID model 0xCF), kernel 6.18.5, AMX enabled.
37+
- The fixes are ISA-level — identical on Sapphire Rapids (0x8F) and Granite
38+
Rapids. Do NOT branch kernel correctness on CPU generation.
39+
40+
## The Modus Operandi
41+
42+
### A. How AMX gets enabled (4 gates, cached once in a LazyLock)
43+
44+
1. CPUID.07H.0H:EDX bit 24 (AMX-TILE) + 25 (AMX-INT8) — silicon supports it.
45+
2. CPUID.01H:ECX bit 27 (OSXSAVE) — OS turned on XSAVE.
46+
3. XGETBV(0) bits 17 (TILECFG) + 18 (TILEDATA) — OS enabled tile XSTATE.
47+
Read the *live* XCR0, never CPUID leaf 0xD (which reports capability, not
48+
what a hypervisor actually enabled).
49+
4. `arch_prctl(ARCH_REQ_XCOMP_PERM=0x1023, XFEATURE_XTILEDATA=18)`
50+
**syscall 158** (arch_prctl), NOT 157 (prctl). This is the dynamically-
51+
enabled-feature permission request (Linux 5.16+). The 157↔158 mix-up is
52+
why AMX was dark on every capable host. The grant is process-wide and
53+
inherited by all threads → request once.
54+
55+
`ndarray::simd::{amx_available, cpu_model, amx_report, CpuModel}` expose this.
56+
`cpu_model().has_amx() && !amx_available()` ⇒ enablement problem, not silicon.
57+
58+
### B. The operand convention (the alien magic — memorize it)
59+
60+
`dst[m][n] = Σ_k tmm2(ModRM.rm)[m][k] · tmm1(VEX.vvvv)[k][n]`
61+
- plain **M×K** operand → **tmm2 (rm)**; VNNI **K×N** operand → **tmm1 (vvvv)**
62+
(mirror of the naive SDM operand order).
63+
- `TDPBUSD` (0x71): rm = **unsigned**, vvvv = **signed**.
64+
- The three tile operands (dst/src1/src2) MUST be distinct registers, or `#UD`.
65+
66+
Validated encodings live in the knowledge doc's byte-code table. The correct
67+
`TDPBUSD tmm0,tmm1,tmm2` is `C4 E2 71 5E C2` (NOT `…73…C1`).
68+
69+
### C. The mindset: measure, don't trust the mnemonic or the doc
70+
71+
- The SDM operand order is mirrored here; the prior gotchas doc shipped three
72+
bugs. **You verify on silicon, not from a manual.** The 4-opcode sign sweep
73+
+ selector probe in `examples/amx_probe.rs` is how every claim was nailed.
74+
- "Tests pass" behind `if !amx_available() { return; }` means "tests skipped."
75+
Require an unconditional probe + a `correct=`/parity assertion.
76+
- Correct first, fast second — and keep the `correct=` check while optimizing.
77+
78+
## Troubleshooting: fault signature → cause
79+
80+
Run `RUSTFLAGS="-C target-cpu=native" cargo run --release --example amx_probe`
81+
FIRST. It prints a flushed line before each tile op (last line = faulting
82+
instruction) and then checks correctness across shapes. Map the signature:
83+
84+
| Signature | Cause | Fix |
85+
|---|---|---|
86+
| `amx_available()==false` on AMX Xeon | arch_prctl on syscall 157 | use 158 |
87+
| SIGSEGV at `LDTILECFG` | TILECFG rows/colsb swapped (or not 64B-aligned) | colsb u16 @16+2t, rows u8 @48+t |
88+
| SIGSEGV at `TILELOADD`/`TILESTORED` | SIB base/index swapped | SIB `0x01` (base=rcx, index=rax) |
89+
| SIGILL at `TDPBUSD`/`TDPBF16PS` | ModRM aliases two tiles | ModRM `0xC2` |
90+
| runs, `correct=false` (often a *clean* wrong) | operand index/sign mirrored | load M×K→tmm2, VNNI→tmm1; 0x71 |
91+
92+
Each fix exposes the next signature (SIGSEGV→SIGSEGV→SIGILL→wrong→correct).
93+
94+
## Performance levers (after correctness is locked)
95+
96+
1. Hoist `LDTILECFG` (serializing) and the VNNI pack OUT of the tile loops —
97+
once per GEMM, not once per 16×16 tile. (This was the 11.5× win:
98+
14.8 → 169.7 GMAC/s on EMR int8 2048³.)
99+
2. `TILESTORED` straight into the strided C slot (row pitch n·4 bytes) — no
100+
scratch + copy.
101+
3. Next miles: 2×2 register blocking (4 C tiles amortize A/B loads); rayon over
102+
row tiles. Always re-run `amx_probe` (correctness) + `amx_gemm_bench`
103+
(throughput) after each.
104+
105+
## Cargo hygiene
106+
107+
Per `.claude/rules/agent-cargo-hygiene.md`: as an Opus agent you may run cargo
108+
freely, but build in the SHARED `target/` — no per-agent worktree. Validate
109+
with the two examples; the lib unit-test target is pre-broken (`src/tri.rs`
110+
type-inference errors, unrelated to AMX), so the examples are the gate.
111+
112+
## When you finish
113+
114+
Update `.claude/knowledge/amx-enablement-and-kernel.md` and
115+
`.claude/AMX_GOTCHAS.md` in the SAME change as any behavior shift, and prepend
116+
an entry to `.claude/board/AGENT_LOG.md` (D-ids, commit, what ran, outcome).
117+
Never let a doc claim a tile op "works" without an executed, asserted probe.

.claude/blackboard.md

Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -134,3 +134,50 @@ This is mostly Cargo.toml workspace wiring + API surface.
134134
[DECISION] Cypher executes locally via lance-graph semiring by default
135135
[DECISION] Remote DB connections (Neo4j, FalkorDB) via native Bolt client
136136
[DECISION] vis.js graph rendering served as static assets by the binary
137+
138+
## Architecture Decisions
139+
140+
### 2026-06-13 — GEMM-dispatch routing fixes (savant-architect)
141+
Branch `claude/wonderful-hawking-lodtql`. Three public GEMM entry points
142+
were not routing to the accelerated kernels.
143+
144+
- **`backend::gemm_bf16` (src/backend/mod.rs)** — ALREADY FIXED in the
145+
working tree this session. Now routes to
146+
`hpc::amx_matmul::matmul_bf16_to_f32` (AMX `TDPBF16PS` → AVX-512
147+
`VDPBF16PS` → scalar). Slice→ArrayView2 wrapping mirrors the call shape
148+
in `simd_runtime::matmul`; inputs sliced to exact `m*k`/`k*n`/`m*n`.
149+
Bit-equivalent on non-AMX/non-AVX512BF16 hosts because the dispatcher's
150+
scalar fallback is the same `quantized::bf16_gemm_f32(a,b,c,m,n,k,1.0,0.0)`
151+
the old direct call used (alpha=1, beta=0 preserved).
152+
- **`backend::gemm_i8` (src/backend/mod.rs)** — ALREADY FIXED in the
153+
working tree this session. Routes to `simd_int_ops::gemm_u8_i8`
154+
(4-tier: AMX `TDPBUSD` → VNNI-zmm → AVX-VNNI-ymm → scalar).
155+
[DECISION] Deliberately NOT routed to `amx_matmul::matmul_i8_to_i32` as
156+
the literal task text asked: `gemm_i8` is **u8×i8→i32**, but
157+
`matmul_i8_to_i32` is **i8×i8→i32** and would reinterpret A-bytes ≥128
158+
as negative — NOT bit-equivalent. `gemm_u8_i8`'s scalar fallback is the
159+
same `quantized::int8_gemm_i32` the old `vnni_gemm::int8_gemm_vnni`
160+
used → bit-identical on scalar hosts; VNNI-zmm arm calls the same
161+
`int8_gemm_vnni_avx512` kernel as before. All tiers integer-exact.
162+
- **`native::gemv_f32` / `gemv_f64` (src/backend/native.rs)** — FIXED
163+
THIS TURN (was calling `scalar::gemv_*` unconditionally). Now matches
164+
on `tier()`: Scalar tier → unchanged `scalar::gemv_*` (byte-identical);
165+
Avx2/Avx512 tiers → per-row `dot_f32`/`dot_f64` (the existing
166+
dispatched, parity-tested SIMD dot). GEMV = stack of row dots; each A
167+
row is row-major-contiguous so contiguous `dot_*` loads apply. Leading
168+
`n` of each `lda`-wide row taken via `&a[i*lda..i*lda+n]`; no new bounds
169+
requirement vs scalar ref. SIMD tiers carry the module's documented
170+
1-2 ULP reduce-order drift (within BLAS tol; `test_gemv_f32` uses 1e-5,
171+
no byte-exact consumer asserts gemv).
172+
173+
[UNSAFE-AUDIT] gemv fix added **zero** new `unsafe` — it reuses the
174+
already-audited `dot_*` kernels. No new sentinel-qa surface from this turn.
175+
The two mod.rs fixes contain `unsafe` repr(transparent) slice reinterprets
176+
(BF16/u16) that were landed earlier this session and warrant the standard
177+
sentinel-qa pass if not already covered.
178+
179+
[LOOSE END] Repo references modules that exist on disk but the Glob/Grep
180+
index was transiently stale this session (returned empty for
181+
`simd_int_ops.rs`, `vnni_gemm.rs`, `bf16_gemm_f32`); Bash ground-truth
182+
confirmed all present. Orchestrator should `cargo fmt`/`clippy`/`test`
183+
centrally (edits were edit-only, no compile performed here).

.claude/board/AGENT_LOG.md

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,35 @@
2727

2828
## Entries (append below; newest first)
2929

30+
### 2026-06-14 — amx-savant (Opus, main thread) — AMX enabled + made bug-free + documented
31+
- **Branch:** `claude/wonderful-hawking-lodtql`. **Commits:** e6bb26a (enablement
32+
+ 4 kernel bugs), 777eff7 (perf 11.5×), 9dd6519 (bf16 probe), + this doc/
33+
detection commit.
34+
- **What ran:** `examples/amx_probe` (instruction bisector + correctness across
35+
shapes — int8 bit-exact, bf16 rel-err ~0.004) and `examples/amx_gemm_bench`
36+
(throughput + independent `correct=` check). Lib unit-test target is
37+
pre-broken (`src/tri.rs` type-inference, unrelated), so examples are the gate.
38+
- **Findings:** AMX was dark on EVERY capable host via a 1-digit bug —
39+
`ARCH_REQ_XCOMP_PERM` issued on `prctl` (157) instead of `arch_prctl` (158).
40+
Once enabled, 4 more ISA/encoding bugs surfaced (TILECFG rows/colsb swap;
41+
TILELOADD SIB base/index swap; TDPBUSD ModRM same-tile #UD; mirrored
42+
operand index+sign convention — verified by a 4-opcode sign sweep). SPR vs
43+
EMR is NOT the cause: the bugs are ISA-level and were latent on Sapphire
44+
Rapids too (the SPR-era `AMX_GOTCHAS.md` literally shipped 3 of them); they
45+
never fired because detection never returned true. EMR was just the first
46+
host to actually execute the tile path.
47+
- **Added:** cached `LazyLock` detection + `CpuModel` (SPR/EMR/GNR/Sierra
48+
Forest) in `src/simd_amx.rs`, re-exported via `ndarray::simd::{cpu_model,
49+
CpuModel, amx_report}`; `examples/amx_probe.rs` (validator/bisector);
50+
`.claude/knowledge/amx-enablement-and-kernel.md` (canonical ref);
51+
`.claude/agents/amx-savant.md` (this agent); rewrote `.claude/AMX_GOTCHAS.md`
52+
(corrected the 3 bugs it shipped, added the fault-signature playbook).
53+
- **Outcome:** int8 GEMM 2048³ = 169.7 GMAC/s (339 GOP/s), 600× scalar; bf16
54+
path correct. `amx_report()` → "AMX [Emerald Rapids expects_amx=true]:
55+
TILE=true INT8=true BF16=true available=true".
56+
- **Loose ends:** further AMX perf (2×2 register blocking + rayon); blasgraph
57+
Hamming dedup in lance-graph (blocked on missing `protoc`).
58+
3059

3160
## 2026-05-22T18:00 — PR-X12 cross-stack architecture session (opus 4.7)
3261

0 commit comments

Comments
 (0)