You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Documents the 12-agent CCA2A round-2 fleet that delivered the actual
Bevy plugin (AdaWorldAPI/bevy claude/ndarray-simd-review-S0zXK).
Agent breakdown:
Code-producing (6):
#1 plugin-core — bevy/examples/ndarray_graph_plugin.rs (274 lines)
#2 plugin-palette — bevy/examples/ndarray_graph_palette.rs (100 lines)
#3 plugin-ci — bevy/.github/workflows/ndarray-smoke.yml
#4 plugin-readme — bevy/examples/README_NDARRAY_PLUGIN.md
#5 plugin-tests — bevy/examples/ndarray_graph_plugin_tests.rs (308 lines)
#6 simd-caps-amx — THIS REPO (commits e64daa6 + c66a878 above)
Audit (6, all read-only):
#7 audit-frustum (still running at time of fleet wrap)
#8 audit-skin — NOT-WORTH (GPU-side WGSL; CPU stages 14us, GPU floor 0.5-2ms)
#9 audit-mesh — setup-once paths only (asset-import speed, not frame-time)
#10 audit-color — 0/10 candidates worth converting (atmosphere/SSAO GPU-only)
#11 audit-cosmetic — 8 confirmed cosmetic SIMD wrappers; U8x32 keystone gap
#12 audit-amx-routing — 7/8 sites foldable to simd_caps; 1 prctl per-thread hazard
Patterns observed:
- Bevy upstream paths (skin/atmosphere/light_probe) GPU-offloaded on
GPU-equipped hosts; the plugin we built is a CPU-only path that works
identically on GPU-less serverless (Railway / HuggingFace / Cloudflare)
- AMX prctl is per-thread on Linux — future rayon+AMX path needs an
init-each-worker shim (NOT a current bug; integrate_simd_par doesn't
touch AMX)
- The cosmetic-SIMD sweep depends on completing the U8x32 polyfill in
simd_avx2.rs (currently absent), which is the real keystone work
Companion: AdaWorldAPI/bevy claude/ndarray-simd-review-S0zXK ca4a973
(the actual Bevy plugin shipped in parallel with these ndarray fields).
|`byte_scan.rs`| 22 |`byte_find_all_avx2`|`avx2`| NO | NO — scalar `haystack[i+j] == needle` loops | COSMETIC |
987
+
|`byte_scan.rs`| 86 |`byte_count_avx2`|`avx2`| NO | NO — scalar `haystack[i+j] == needle` loops | COSMETIC |
988
+
|`byte_scan.rs`| 52 |`byte_find_all_avx512`|`avx512bw`| NO (`_mm*` absent) | YES — uses `U8x64::splat`, `U8x64::from_slice`, `U8x64::cmpeq_mask`| REAL (polyfill-backed) |
989
+
|`byte_scan.rs`| 115 |`byte_count_avx512`|`avx512bw`| NO (`_mm*` absent) | YES — uses `U8x64::splat`, `U8x64::from_slice`, `U8x64::cmpeq_mask`, `.count_ones()`| REAL (polyfill-backed) |
990
+
|`palette_codec.rs`| 303 |`unpack_generic_avx512`|`avx512f`| NO | NO — scalar nested loop (`word >> bit_offset & mask_val`) | COSMETIC |
991
+
|`palette_codec.rs`| 335 |`pack_generic_avx512`|`avx512f`| NO | NO — scalar for loop, verbatim copy of `pack_indices`| COSMETIC |
992
+
|`palette_codec.rs`| 353 |`unpack_4bit_avx2`|`avx2`| NO | NO — scalar nibble-split loop over `bytes[i..i+32]`| COSMETIC |
993
+
|`palette_codec.rs`| 501 |`bedrock_reorder_xzy_avx512`|`avx512f`| NO | NO — scalar triple-nested loop with `get_unchecked`| COSMETIC |
994
+
|`aabb.rs`| 241 |`aabb_intersect_batch_sse41`|`sse4.1`| NO | NO — scalar per-candidate `if` chain, identical to `aabb_intersect_batch_scalar`| COSMETIC |
995
+
|`aabb.rs`| 174 |`aabb_intersect_batch_avx512`|`avx512f`| NO | YES — uses `F32x16::from_array`, `F32x16::splat`, `F32x16::simd_le`, `F32x16::simd_ge`, `F32Mask16.0 &`| REAL (polyfill-backed) |
996
+
|`aabb.rs`| 329 |`ray_aabb_slab_test_avx512`|`avx512f`| NO | YES — uses `F32x16::splat`, arithmetic ops, `simd_min`, `simd_max`, `simd_le`, `simd_ge`, `to_array`| REAL (polyfill-backed) |
997
+
|`aabb.rs`| 464 |`aabb_expand_batch_sse2`|`sse2`| NO | NO — scalar per-AABB field update, identical to `aabb_expand_batch_scalar`| COSMETIC |
998
+
999
+
**Summary: 8 COSMETIC, 4 REAL (polyfill-backed, no raw `_mm*`)**
1000
+
1001
+
---
1002
+
1003
+
### AUTOVEC CHECK (empirical, via `rustc 1.94.1 --emit asm`)
1004
+
1005
+
Built a minimal replica of each cosmetic function with `#[no_mangle] extern "C"` to prevent dead-code elimination. Assembly analyzed for `ymm*`/`zmm*`/`xmm*`/`vp*`/`vcmp*` instructions:
Assembly: pure scalar integer ops (`cmpb`, `jne`, `movb`, `incq`). Zero YMM/XMM registers. LLVM did NOT autovectorize the append-to-Vec loop. **COSMETIC — not autovec'd.**
Assembly: contains `vmovups %zmm0` for the memset/zeroing prelude (LLVM auto-vectorized the zero-init with AVX-512 store), but the main bit-packing loop is scalar shift+OR. The `%zmm0` instruction is from `vec![0u64; n_words]` zero-fill, not the index-packing loop body. **Zeroing autovec'd; bit-pack loop COSMETIC.**
Assembly: uses `movups`/`subps`/`addps`/`shufps` on `%xmm` registers — **REAL-AUTOVEC.** LLVM vectorized the 6-float struct update into XMM-register arithmetic. The SSE2 feature hint IS doing useful work here: without it, LLVM would not be permitted to use `addps`/`subps` on this loop. **Mark as REAL-AUTOVEC.**
1018
+
1019
+
---
1020
+
1021
+
### Replacement Plan (Cosmetic Functions Only)
1022
+
1023
+
#### `byte_scan.rs` — `byte_find_all_avx2` (line 22) and `byte_count_avx2` (line 86)
1024
+
1025
+
**Problem:**`#[target_feature(enable = "avx2")]` on pure scalar 32-byte loop.
1026
+
**No `U8x32` exists** in `crate::simd` (confirmed: searched entire `src/`; zero results).
1027
+
**Correct polyfill replacement:** None available at AVX2 tier. Two options:
1028
+
1.**Delete** both functions and fall through to scalar path (honest: no speedup anyway).
1029
+
2.**Add `U8x32` to `simd_avx2.rs`** with `splat`, `from_slice`, `cmpeq_mask → u32` methods, then replace scalar loops with `U8x32::splat(needle)` + `cmpeq_mask` + `trailing_zeros` bitmask scatter.
1030
+
1031
+
**Polyfill gap:**`U8x32::cmpeq_mask` does **not exist** in `simd_avx2.rs`. The file contains zero `U8x*` types. The AVX2 tier must add this type before any real replacement is feasible.
#### `palette_codec.rs` — `unpack_generic_avx512` (line 303) and `pack_generic_avx512` (line 335)
1039
+
1040
+
**Problem:** Both are verbatim scalar copies of `unpack_indices`/`pack_indices` wearing `avx512f` decoration.
1041
+
**Real replacement requires:** gather/scatter ops — `U8x64` scatter via `U16x32` widening + `U16x32::shr_epi16` + `pack_saturate_u8`. No single polyfill maps cleanly to variable-width bit unpacking.
1042
+
**Honest replacement plan:** Delete both functions. Document `pack_indices`/`unpack_indices` as the canonical path. Add a `// NOTE: real SIMD unpack requires shr_epi16+pack_saturate_u8 per bit-width; not yet implemented.` comment in `pack_indices_simd` / `unpack_indices_simd`.
1043
+
1044
+
**Polyfill gap:**`U16x32::shr_epi16(shift: u32)` exists (line ~1244 in simd_avx512.rs region), but **scalar fallback in `simd.rs`** lacks it. The AVX-512 path can be implemented; a scalar polyfill for `simd.rs::scalar` module would need:
**Problem:** Nibble-split loop over 32-byte chunks, zero `_mm256_*` intrinsics.
1050
+
**Correct polyfill:** Real 4-bit unpack uses `U8x32::unpacklo_epi8` + `U8x32::and` + `U8x32::srli_epi16`. Neither `unpacklo_epi8` nor `srli_epi16` exists on the AVX2 tier.
1051
+
**Methods needed in `simd_avx2.rs`:**
1052
+
-`U8x32::unpacklo_epi8(self, other: U8x32) -> U8x32` (maps to `_mm256_unpacklo_epi8`)
1053
+
-`U8x32::unpackhi_epi8(self, other: U8x32) -> U8x32` (maps to `_mm256_unpackhi_epi8`)
1054
+
-`U8x32::srli_epi16(self, imm: i32) -> U8x32` (maps to `_mm256_srli_epi16`)
1055
+
- Or equivalently: `U8x32::and(self, mask: U8x32) -> U8x32` (maps to `_mm256_and_si256`)
**Problem:** Scalar triple-loop permutation using `get_unchecked`, zero SIMD.
1060
+
**Correct polyfill:** Real AVX-512 version would use `U16x32::gather` with computed indices. No gather primitive exists in `crate::simd` for `u16`.
1061
+
**Honest replacement plan:** Delete the function; route `bedrock_reorder_xzy` directly to the scalar path. Add comment: `// AVX-512 gather on u16 requires widening to u32; not yet in polyfill.`
1062
+
**Methods needed (if implemented):**
1063
+
-`U32x16::gather_u16(base: *const u16, vindex: U32x16) -> U32x16` — not present; would wrap `_mm512_i32gather_epi32` with 2-byte scale.
**Problem:** Scalar per-candidate loop, AUTOVEC confirmed: zero SSE4.1 instructions emitted.
1068
+
**The `aabb_expand_batch_sse2` function IS REAL-AUTOVEC** (SSE2 feature hint causes `addps`/`subps` emission); SSE4.1 hint on the intersection function does NOT produce `blendvps` or `cmpps`.
1069
+
**Correct polyfill:** Use `F32x4` (SSE2-width) comparison. No `F32x4` type exists in `crate::simd`. Alternatively, use `F32x8` (AVX2) for 2-candidate-at-once processing, or simply rename to `aabb_intersect_batch_scalar_hint` and document the annotation as a scheduling hint only.
1070
+
1071
+
**Methods needed in `simd_avx2.rs` (for real SSE4.1 replacement):**
1072
+
-`F32x4::from_array([f32; 4]) -> F32x4` — type does not exist
1073
+
- OR accept that 1-candidate-at-a-time is scalar-only and rename the function honestly.
1074
+
1075
+
---
1076
+
1077
+
### Polyfill Methods Needed in `simd_avx2.rs` (and scalar fallback)
1078
+
1079
+
To make the above replacements fully feasible, these methods must be added:
|`U16x32::shr_epi16(self, shift: u32)`| scalar in `simd.rs`| already in `simd_avx512.rs:~1275`| element-wise `>> shift`|
1091
+
1092
+
The `U8x32` type itself (the 256-bit byte vector) is entirely absent from `simd_avx2.rs` — all 7 methods above require first creating the type. This is the foundational gap for the AVX2-tier byte scan and nibble unpack paths.
1093
+
1094
+
---
1095
+
1096
+
### Key Finding: `aabb_expand_batch_sse2` is REAL-AUTOVEC
1097
+
1098
+
This function was previously listed as cosmetic by earlier agents. ASM confirms otherwise: the SSE2 feature annotation on the `[f32; 3] min/max subtract+add` loop causes LLVM to emit `movups`/`subps`/`addps`/`shufps` on XMM registers. Without the annotation, the same code compiles to scalar. This one function in `aabb.rs` is a legitimate use of `#[target_feature]` as an LLVM autovectorization hint. Do not remove it.
1099
+
1100
+
1101
+
## 2026-05-13T19:35 — agent #10 audit-color (sonnet) [backfilled by main]
0 commit comments