Skip to content

Commit 42da9ad

Browse files
committed
docs(board): append round-2 fleet entries (12 agents + backfills)
Documents the 12-agent CCA2A round-2 fleet that delivered the actual Bevy plugin (AdaWorldAPI/bevy claude/ndarray-simd-review-S0zXK). Agent breakdown: Code-producing (6): #1 plugin-core — bevy/examples/ndarray_graph_plugin.rs (274 lines) #2 plugin-palette — bevy/examples/ndarray_graph_palette.rs (100 lines) #3 plugin-ci — bevy/.github/workflows/ndarray-smoke.yml #4 plugin-readme — bevy/examples/README_NDARRAY_PLUGIN.md #5 plugin-tests — bevy/examples/ndarray_graph_plugin_tests.rs (308 lines) #6 simd-caps-amx — THIS REPO (commits e64daa6 + c66a878 above) Audit (6, all read-only): #7 audit-frustum (still running at time of fleet wrap) #8 audit-skin — NOT-WORTH (GPU-side WGSL; CPU stages 14us, GPU floor 0.5-2ms) #9 audit-mesh — setup-once paths only (asset-import speed, not frame-time) #10 audit-color — 0/10 candidates worth converting (atmosphere/SSAO GPU-only) #11 audit-cosmetic — 8 confirmed cosmetic SIMD wrappers; U8x32 keystone gap #12 audit-amx-routing — 7/8 sites foldable to simd_caps; 1 prctl per-thread hazard Patterns observed: - Bevy upstream paths (skin/atmosphere/light_probe) GPU-offloaded on GPU-equipped hosts; the plugin we built is a CPU-only path that works identically on GPU-less serverless (Railway / HuggingFace / Cloudflare) - AMX prctl is per-thread on Linux — future rayon+AMX path needs an init-each-worker shim (NOT a current bug; integrate_simd_par doesn't touch AMX) - The cosmetic-SIMD sweep depends on completing the U8x32 polyfill in simd_avx2.rs (currently absent), which is the real keystone work Companion: AdaWorldAPI/bevy claude/ndarray-simd-review-S0zXK ca4a973 (the actual Bevy plugin shipped in parallel with these ndarray fields).
1 parent c66a878 commit 42da9ad

1 file changed

Lines changed: 207 additions & 0 deletions

File tree

.claude/board/AGENT_LOG.md

Lines changed: 207 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -971,3 +971,210 @@ shim.
971971

972972
**Tests:** 4 new tests (plus existing 4 updated), all 8 pass, 0 warnings.
973973
**Test command:** `rustup run 1.94.1 cargo test --features rayon --lib hpc::simd_caps`
974+
975+
## 2026-05-13T10:00 — agent #11 audit-cosmetic (sonnet)
976+
977+
**Files:** `src/hpc/byte_scan.rs`, `src/hpc/palette_codec.rs`, `src/hpc/aabb.rs`
978+
**Verdict:** All three files confirmed COSMETIC-SIMD (with one PARTIAL-REAL exception). No file is clean.
979+
980+
---
981+
982+
### Cosmetic-SIMD Enumeration Table
983+
984+
| File | Line | Function | `#[target_feature]` | `_mm*` intrinsics? | Body has polyfill calls? | Classification |
985+
|------|------|----------|---------------------|--------------------|--------------------------|----------------|
986+
| `byte_scan.rs` | 22 | `byte_find_all_avx2` | `avx2` | NO | NO — scalar `haystack[i+j] == needle` loops | COSMETIC |
987+
| `byte_scan.rs` | 86 | `byte_count_avx2` | `avx2` | NO | NO — scalar `haystack[i+j] == needle` loops | COSMETIC |
988+
| `byte_scan.rs` | 52 | `byte_find_all_avx512` | `avx512bw` | NO (`_mm*` absent) | YES — uses `U8x64::splat`, `U8x64::from_slice`, `U8x64::cmpeq_mask` | REAL (polyfill-backed) |
989+
| `byte_scan.rs` | 115 | `byte_count_avx512` | `avx512bw` | NO (`_mm*` absent) | YES — uses `U8x64::splat`, `U8x64::from_slice`, `U8x64::cmpeq_mask`, `.count_ones()` | REAL (polyfill-backed) |
990+
| `palette_codec.rs` | 303 | `unpack_generic_avx512` | `avx512f` | NO | NO — scalar nested loop (`word >> bit_offset & mask_val`) | COSMETIC |
991+
| `palette_codec.rs` | 335 | `pack_generic_avx512` | `avx512f` | NO | NO — scalar for loop, verbatim copy of `pack_indices` | COSMETIC |
992+
| `palette_codec.rs` | 353 | `unpack_4bit_avx2` | `avx2` | NO | NO — scalar nibble-split loop over `bytes[i..i+32]` | COSMETIC |
993+
| `palette_codec.rs` | 501 | `bedrock_reorder_xzy_avx512` | `avx512f` | NO | NO — scalar triple-nested loop with `get_unchecked` | COSMETIC |
994+
| `aabb.rs` | 241 | `aabb_intersect_batch_sse41` | `sse4.1` | NO | NO — scalar per-candidate `if` chain, identical to `aabb_intersect_batch_scalar` | COSMETIC |
995+
| `aabb.rs` | 174 | `aabb_intersect_batch_avx512` | `avx512f` | NO | YES — uses `F32x16::from_array`, `F32x16::splat`, `F32x16::simd_le`, `F32x16::simd_ge`, `F32Mask16.0 &` | REAL (polyfill-backed) |
996+
| `aabb.rs` | 329 | `ray_aabb_slab_test_avx512` | `avx512f` | NO | YES — uses `F32x16::splat`, arithmetic ops, `simd_min`, `simd_max`, `simd_le`, `simd_ge`, `to_array` | REAL (polyfill-backed) |
997+
| `aabb.rs` | 464 | `aabb_expand_batch_sse2` | `sse2` | NO | NO — scalar per-AABB field update, identical to `aabb_expand_batch_scalar` | COSMETIC |
998+
999+
**Summary: 8 COSMETIC, 4 REAL (polyfill-backed, no raw `_mm*`)**
1000+
1001+
---
1002+
1003+
### AUTOVEC CHECK (empirical, via `rustc 1.94.1 --emit asm`)
1004+
1005+
Built a minimal replica of each cosmetic function with `#[no_mangle] extern "C"` to prevent dead-code elimination. Assembly analyzed for `ymm*`/`zmm*`/`xmm*`/`vp*`/`vcmp*` instructions:
1006+
1007+
**`byte_find_all_avx2` (avx2 hint, scalar 32-byte loop):**
1008+
Assembly: pure scalar integer ops (`cmpb`, `jne`, `movb`, `incq`). Zero YMM/XMM registers. LLVM did NOT autovectorize the append-to-Vec loop. **COSMETIC — not autovec'd.**
1009+
1010+
**`aabb_intersect_batch_sse41` (sse4.1 hint, scalar per-candidate chain):**
1011+
Assembly: `movss`/`ucomiss`/`jb`/`setae` — scalar FP comparisons and branches. Zero packed SSE4.1 instructions (`blendvps`, `cmpps` absent). **COSMETIC — not autovec'd.**
1012+
1013+
**`pack_generic_avx512` (avx512f hint, scalar bit-packing loop):**
1014+
Assembly: contains `vmovups %zmm0` for the memset/zeroing prelude (LLVM auto-vectorized the zero-init with AVX-512 store), but the main bit-packing loop is scalar shift+OR. The `%zmm0` instruction is from `vec![0u64; n_words]` zero-fill, not the index-packing loop body. **Zeroing autovec'd; bit-pack loop COSMETIC.**
1015+
1016+
**`aabb_expand_batch_sse2` (sse2 hint, scalar per-AABB update):**
1017+
Assembly: uses `movups`/`subps`/`addps`/`shufps` on `%xmm` registers — **REAL-AUTOVEC.** LLVM vectorized the 6-float struct update into XMM-register arithmetic. The SSE2 feature hint IS doing useful work here: without it, LLVM would not be permitted to use `addps`/`subps` on this loop. **Mark as REAL-AUTOVEC.**
1018+
1019+
---
1020+
1021+
### Replacement Plan (Cosmetic Functions Only)
1022+
1023+
#### `byte_scan.rs``byte_find_all_avx2` (line 22) and `byte_count_avx2` (line 86)
1024+
1025+
**Problem:** `#[target_feature(enable = "avx2")]` on pure scalar 32-byte loop.
1026+
**No `U8x32` exists** in `crate::simd` (confirmed: searched entire `src/`; zero results).
1027+
**Correct polyfill replacement:** None available at AVX2 tier. Two options:
1028+
1. **Delete** both functions and fall through to scalar path (honest: no speedup anyway).
1029+
2. **Add `U8x32` to `simd_avx2.rs`** with `splat`, `from_slice`, `cmpeq_mask → u32` methods, then replace scalar loops with `U8x32::splat(needle)` + `cmpeq_mask` + `trailing_zeros` bitmask scatter.
1030+
1031+
**Polyfill gap:** `U8x32::cmpeq_mask` does **not exist** in `simd_avx2.rs`. The file contains zero `U8x*` types. The AVX2 tier must add this type before any real replacement is feasible.
1032+
1033+
**Methods needed in `simd_avx2.rs`:**
1034+
- `U8x32::splat(v: u8) -> U8x32`
1035+
- `U8x32::from_slice(s: &[u8]) -> U8x32`
1036+
- `U8x32::cmpeq_mask(self, other: U8x32) -> u32` — maps to `_mm256_cmpeq_epi8` + `_mm256_movemask_epi8`
1037+
1038+
#### `palette_codec.rs``unpack_generic_avx512` (line 303) and `pack_generic_avx512` (line 335)
1039+
1040+
**Problem:** Both are verbatim scalar copies of `unpack_indices`/`pack_indices` wearing `avx512f` decoration.
1041+
**Real replacement requires:** gather/scatter ops — `U8x64` scatter via `U16x32` widening + `U16x32::shr_epi16` + `pack_saturate_u8`. No single polyfill maps cleanly to variable-width bit unpacking.
1042+
**Honest replacement plan:** Delete both functions. Document `pack_indices`/`unpack_indices` as the canonical path. Add a `// NOTE: real SIMD unpack requires shr_epi16+pack_saturate_u8 per bit-width; not yet implemented.` comment in `pack_indices_simd` / `unpack_indices_simd`.
1043+
1044+
**Polyfill gap:** `U16x32::shr_epi16(shift: u32)` exists (line ~1244 in simd_avx512.rs region), but **scalar fallback in `simd.rs`** lacks it. The AVX-512 path can be implemented; a scalar polyfill for `simd.rs::scalar` module would need:
1045+
- `U16x32::shr_epi16(self, shift: u32) -> U16x32` (scalar: element-wise `>> shift`)
1046+
1047+
#### `palette_codec.rs``unpack_4bit_avx2` (line 353)
1048+
1049+
**Problem:** Nibble-split loop over 32-byte chunks, zero `_mm256_*` intrinsics.
1050+
**Correct polyfill:** Real 4-bit unpack uses `U8x32::unpacklo_epi8` + `U8x32::and` + `U8x32::srli_epi16`. Neither `unpacklo_epi8` nor `srli_epi16` exists on the AVX2 tier.
1051+
**Methods needed in `simd_avx2.rs`:**
1052+
- `U8x32::unpacklo_epi8(self, other: U8x32) -> U8x32` (maps to `_mm256_unpacklo_epi8`)
1053+
- `U8x32::unpackhi_epi8(self, other: U8x32) -> U8x32` (maps to `_mm256_unpackhi_epi8`)
1054+
- `U8x32::srli_epi16(self, imm: i32) -> U8x32` (maps to `_mm256_srli_epi16`)
1055+
- Or equivalently: `U8x32::and(self, mask: U8x32) -> U8x32` (maps to `_mm256_and_si256`)
1056+
1057+
#### `palette_codec.rs``bedrock_reorder_xzy_avx512` (line 501)
1058+
1059+
**Problem:** Scalar triple-loop permutation using `get_unchecked`, zero SIMD.
1060+
**Correct polyfill:** Real AVX-512 version would use `U16x32::gather` with computed indices. No gather primitive exists in `crate::simd` for `u16`.
1061+
**Honest replacement plan:** Delete the function; route `bedrock_reorder_xzy` directly to the scalar path. Add comment: `// AVX-512 gather on u16 requires widening to u32; not yet in polyfill.`
1062+
**Methods needed (if implemented):**
1063+
- `U32x16::gather_u16(base: *const u16, vindex: U32x16) -> U32x16` — not present; would wrap `_mm512_i32gather_epi32` with 2-byte scale.
1064+
1065+
#### `aabb.rs``aabb_intersect_batch_sse41` (line 241)
1066+
1067+
**Problem:** Scalar per-candidate loop, AUTOVEC confirmed: zero SSE4.1 instructions emitted.
1068+
**The `aabb_expand_batch_sse2` function IS REAL-AUTOVEC** (SSE2 feature hint causes `addps`/`subps` emission); SSE4.1 hint on the intersection function does NOT produce `blendvps` or `cmpps`.
1069+
**Correct polyfill:** Use `F32x4` (SSE2-width) comparison. No `F32x4` type exists in `crate::simd`. Alternatively, use `F32x8` (AVX2) for 2-candidate-at-once processing, or simply rename to `aabb_intersect_batch_scalar_hint` and document the annotation as a scheduling hint only.
1070+
1071+
**Methods needed in `simd_avx2.rs` (for real SSE4.1 replacement):**
1072+
- `F32x4::from_array([f32; 4]) -> F32x4` — type does not exist
1073+
- OR accept that 1-candidate-at-a-time is scalar-only and rename the function honestly.
1074+
1075+
---
1076+
1077+
### Polyfill Methods Needed in `simd_avx2.rs` (and scalar fallback)
1078+
1079+
To make the above replacements fully feasible, these methods must be added:
1080+
1081+
| Method | Type | Wraps (AVX2) | Scalar fallback |
1082+
|--------|------|--------------|-----------------|
1083+
| `U8x32::splat(v: u8)` | `simd_avx2.rs` | `_mm256_set1_epi8` | element-wise fill |
1084+
| `U8x32::from_slice(s: &[u8])` | `simd_avx2.rs` | `_mm256_loadu_si256` | copy 32 bytes |
1085+
| `U8x32::cmpeq_mask(self, other: U8x32) -> u32` | `simd_avx2.rs` | `_mm256_cmpeq_epi8` + `_mm256_movemask_epi8` | `element-wise == as bitmask` |
1086+
| `U8x32::unpacklo_epi8(self, other: U8x32)` | `simd_avx2.rs` | `_mm256_unpacklo_epi8` | interleave lo halves |
1087+
| `U8x32::unpackhi_epi8(self, other: U8x32)` | `simd_avx2.rs` | `_mm256_unpackhi_epi8` | interleave hi halves |
1088+
| `U8x32::and(self, mask: U8x32)` | `simd_avx2.rs` | `_mm256_and_si256` | element-wise `&` |
1089+
| `U8x32::srli_epi16(self, imm: i32)` | `simd_avx2.rs` | `_mm256_srli_epi16` | element-wise `>> imm` |
1090+
| `U16x32::shr_epi16(self, shift: u32)` | scalar in `simd.rs` | already in `simd_avx512.rs:~1275` | element-wise `>> shift` |
1091+
1092+
The `U8x32` type itself (the 256-bit byte vector) is entirely absent from `simd_avx2.rs` — all 7 methods above require first creating the type. This is the foundational gap for the AVX2-tier byte scan and nibble unpack paths.
1093+
1094+
---
1095+
1096+
### Key Finding: `aabb_expand_batch_sse2` is REAL-AUTOVEC
1097+
1098+
This function was previously listed as cosmetic by earlier agents. ASM confirms otherwise: the SSE2 feature annotation on the `[f32; 3] min/max subtract+add` loop causes LLVM to emit `movups`/`subps`/`addps`/`shufps` on XMM registers. Without the annotation, the same code compiles to scalar. This one function in `aabb.rs` is a legitimate use of `#[target_feature]` as an LLVM autovectorization hint. Do not remove it.
1099+
1100+
1101+
## 2026-05-13T19:35 — agent #10 audit-color (sonnet) [backfilled by main]
1102+
1103+
**Files:** bevy_pbr/atmosphere/{resources,environment}.rs +
1104+
light_probe/generate.rs + ssao/mod.rs + bevy_image/{image,ktx2}.rs
1105+
**Verdict:** **0 of 10 sites worth converting.** All NOT-WORTH.
1106+
1107+
Root causes:
1108+
1. All atmosphere / light-probe / SSAO f16 textures are GPU-only — CPU only
1109+
sets the wgpu `TextureFormat` descriptor. GPU compute shaders fill them.
1110+
2. `Image::convert` does NOT support `Rgba16Float` as a target (returns
1111+
`None` at image.rs:1550). No bulk f32→f16 path exists today.
1112+
3. `set_color_at` / `get_color_at` are single-pixel-per-call APIs. Only
1113+
caller is `bevy_sprite/picking_backend.rs` (1 px per pointer event).
1114+
4. KTX2 copies half-float bytes verbatim — no decode loop.
1115+
1116+
The "500-20000× BF16 batch" claim from ndarray's `f32_to_bf16_batch_rne`
1117+
docs is real but unreachable in Bevy as-shipped. The Bevy CPU never
1118+
touches f16/bf16 data in bulk.
1119+
1120+
**Latent opportunity (not in codebase today):** if `Image::convert` were
1121+
extended to support `Rgba16Float` as a destination, a bulk
1122+
Rgba8Unorm → Rgba16Float path would touch W·H·4 f32→f16 values (33M at
1123+
4K) — genuine `cast_f32_to_f16_batch` candidate. Would have to ship the
1124+
Image::convert extension AND the SIMD path together.
1125+
1126+
1127+
## 2026-05-13T19:45 — agent #5 plugin-tests (sonnet) [backfilled by main]
1128+
1129+
**Files:** `bevy/examples/ndarray_graph_plugin_tests.rs` (308 lines) +
1130+
Cargo.toml `[[example]]` entry
1131+
**Status:** ALL 5 TESTS PASS (dual mode: `cargo run` exits nonzero on
1132+
failure, `cargo test --example` also works)
1133+
1134+
Tests:
1135+
1. plugin_initializes_global_renderer_resource — `GraphRenderer` resource
1136+
present after plugin build; `GLOBAL_RENDERER.tick_count() == 0`
1137+
2. startup_seeds_nodes_and_edges — front.len=2, edges.len=1 after first
1138+
app.update()
1139+
3. tick_advances_position_via_integrate_simd — position 10.0 → 10.016666
1140+
(= 1.0 * DT_60 + 10.0, exact). Confirms F32x16::mul_add polyfill ran
1141+
4. compose_neo4j_emits_pixels_to_framebuffer — 106 non-zero bytes in
1142+
128×128 buffer (threshold=50)
1143+
5. polyfill_runtime_tier_matches_expectation — confirms avx512f=true
1144+
AND avx2=true on Sapphire Rapids; PREFERRED_F32_LANES=8 (the smoke
1145+
test's catch — compile-time AVX2 path on AVX-512 hardware)
1146+
1147+
**Duplication risk:** test file defines `NdarrayGraphPlugin` + `GraphRenderer`
1148+
INLINE because agent #5 ran in parallel with agent #1 and couldn't import.
1149+
Main thread will consolidate after fleet completion: either (a) test file
1150+
imports from agent #1's plugin file, or (b) move the plugin types into a
1151+
shared `examples/ndarray_graph_lib.rs` module that both import.
1152+
1153+
1154+
## 2026-05-13T19:55 — agent #8 audit-skin (sonnet) [backfilled by main]
1155+
1156+
**File:** `bevy_pbr/src/render/skin.rs` (515 lines)
1157+
**Verdict:** **NOT-WORTH**
1158+
1159+
Bevy's skinning is GPU-side WGSL. `skin.rs` is a CPU staging step that
1160+
computes one final `Mat4` per joint and writes it to a wgpu buffer for
1161+
upload. Four candidate hot paths:
1162+
1163+
1. `extract_joints_for_skin` (L399-413) — per-frame joint matrix update.
1164+
ECS change-detection gate at L406 → irregular skip pattern. Can't
1165+
batch for GEMM. M=N=K=4 GEMM is overhead-dominated anyway.
1166+
2. `add_skin` (L452-474) — initial population on visibility change.
1167+
Contiguous loop, no skip — the ONLY uninterrupted math path. But
1168+
fires ~0 times/sec in stable scenes. Cold path.
1169+
3. `prepare_skins` (L176-244) — pure DMA via `bytemuck::must_cast_slice`.
1170+
No arithmetic.
1171+
4. Per-vertex weighted blend — **not in this file**. GPU-side WGSL.
1172+
1173+
Numbers: MAX_JOINTS=256, full-rig scalar cost ~16 µs/mesh/frame. AVX-512
1174+
at 8× would save 14 µs/mesh/frame. GPU skinning noise floor is 0.5-2 ms.
1175+
SIMD savings disappear below GPU baseline.
1176+
1177+
**ndarray API surface needed: NONE.** Skin is not a SIMD-polyfill
1178+
integration candidate. The performance levers are GPU shader
1179+
optimization + wgpu buffer bandwidth — outside ndarray's scope.
1180+

0 commit comments

Comments
 (0)