feat(simd): add ndarray::simd::bf16_tile_gemm_16x16 polyfill primitive by AdaWorldAPI · Pull Request #222 · AdaWorldAPI/ndarray

AdaWorldAPI · 2026-06-20T16:56:49Z

A 16x16 BF16 tile GEMM (C[16,16] += A[16,K]·B[K,16], K multiple of 32)
built purely from the SIMD polyfill: BF16->f32 decode + F32x16::mul_add.
The F32x16 wrapper owns the per-arch dispatch (AVX-512 VFMADD231PS where
available -> AVX2 pair -> NEON -> scalar), so the kernel rides AMX/AVX-512
hosts automatically. No hpc reference, no AMX intrinsic, no external BLAS.

Lives in src/simd_ops.rs, re-exported via ndarray::simd. Parity test vs an
f32-accumulated scalar reference + a += accumulation test + doctest all
pass on AVX-512; clippy -D warnings + fmt clean.

Co-Authored-By: Claude Opus 4.8 noreply@anthropic.com
Claude-Session: https://claude.ai/code/session_01GJ4NVBSjq1w5h7RmTbVafb

A 16x16 BF16 tile GEMM (`C[16,16] += A[16,K]·B[K,16]`, K multiple of 32) built purely from the SIMD polyfill: BF16->f32 decode + `F32x16::mul_add`. The `F32x16` wrapper owns the per-arch dispatch (AVX-512 VFMADD231PS where available -> AVX2 pair -> NEON -> scalar), so the kernel rides AMX/AVX-512 hosts automatically. No `hpc` reference, no AMX intrinsic, no external BLAS. Lives in src/simd_ops.rs, re-exported via `ndarray::simd`. Parity test vs an f32-accumulated scalar reference + a `+=` accumulation test + doctest all pass on AVX-512; clippy -D warnings + fmt clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01GJ4NVBSjq1w5h7RmTbVafb

coderabbitai · 2026-06-20T16:56:59Z

Warning

Review limit reached

@AdaWorldAPI, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 59 minutes and 22 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits.

🚦 How do rate limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan refill rate.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, the refill rate gradually slows as usage increases. The highest same-day bursts are limited more strictly.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: cbccd2ee-60a7-4017-8788-99432df3f1d2

📥 Commits

Reviewing files that changed from the base of the PR and between 2d5c9bb and afb53c2.

📒 Files selected for processing (2)

src/simd.rs
src/simd_ops.rs

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: afb53c28c8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-20T17:00:44Z

+            let mut col = vec![0.0f32; k];
+            for (kk, slot) in col.iter_mut().enumerate() {
+                *slot = b_f32[kk * 16 + j];
+            }


Reuse B column buffers outside the row loop

For each output element (i, j), this allocates and fills a k-element Vec, so one 16×16 tile does 256 heap allocations and repeats the same B-column gather once for every row. In hot tiled GEMM use, especially for small or moderate k, that allocator and memory traffic can dominate the SIMD FMA work; pretranspose/gather the 16 B columns once per call or at least once per j and reuse them across all 16 rows.

Useful? React with 👍 / 👎.

PR #222 added ndarray::simd::bf16_tile_gemm_16x16 by copying the F32x16 kernel out of hpc::bf16_tile_gemm::fallback_path, leaving the same kernel in two places. Collapse it: the polyfill fn is the single source of truth; the hpc AMX wrapper's fallback now calls crate::simd::bf16_tile_gemm_16x16, with the AMX TDPBF16PS tile path still layered on top. Drops the now-unused F32x16 / bf16_to_f32_batch import. Both suites pass (hpc fallback + simd_ops parity); clippy -D warnings + fmt clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01GJ4NVBSjq1w5h7RmTbVafb

refactor(hpc): bf16_tile_gemm fallback delegates to the polyfill (dedup of #222)

chatgpt-codex-connector Bot reviewed Jun 20, 2026

View reviewed changes

AdaWorldAPI merged commit 7a8b793 into master Jun 21, 2026
18 checks passed

AdaWorldAPI mentioned this pull request Jun 21, 2026

refactor(hpc): bf16_tile_gemm fallback delegates to the polyfill (dedup of #222) #223

Merged

AdaWorldAPI added a commit that referenced this pull request Jun 21, 2026

Merge pull request #223 from AdaWorldAPI/claude/charming-johnson-ufstpw

f22a28b

refactor(hpc): bf16_tile_gemm fallback delegates to the polyfill (dedup of #222)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(simd): add ndarray::simd::bf16_tile_gemm_16x16 polyfill primitive#222

feat(simd): add ndarray::simd::bf16_tile_gemm_16x16 polyfill primitive#222
AdaWorldAPI merged 1 commit into
masterfrom
claude/charming-johnson-ufstpw

AdaWorldAPI commented Jun 20, 2026

Uh oh!

coderabbitai Bot commented Jun 20, 2026

Review limit reached

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

AdaWorldAPI commented Jun 20, 2026

Uh oh!

coderabbitai Bot commented Jun 20, 2026

Review limit reached

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants