DFN3 WASM optimization: tract 0.21 -> 0.22.1 + WASM SIMD kernel kit by czoli1976 · Pull Request #1 · czoli1976/DeepFilterNet

czoli1976 · 2026-04-28T06:22:08Z

Summary

libDF-side integration for Vonage's DFN3 WASM kernel investigation.
Captures the dependency upgrade and [patch.crates-io] wiring needed
to land the WASM SIMD kernel kit being developed at
czoli1976/tract@add-wasm-f32-full-kernel-kit.

Internal documentation PR — not for upstream contribution to
Rikorose/DeepFilterNet at this time (the kernel work needs to land
upstream in sonos/tract first; that's tracked separately at
sonos/tract#2161).

End-to-end impact

Layer	RTF	per-frame
Mezon baseline (tract 0.21.4 npm, no `+simd128`)	0.1290	1.290 ms
This branch (tract 0.22.1 + 6-kernel WASM SIMD kit + `+simd128`)	0.0516	0.516 ms

Net: -60% RTF, 2.5x faster on the 27.29s benchmark clip. Audio output
bit-identical to scalar reference.

Changes

File	What	Why
`libDF/Cargo.toml`	tract `^0.21.4` → `=0.22.1` (4 crates)	Required for the newer tract-linalg architecture the kernel kit targets
`libDF/src/tract.rs`	3 × `m.symbol_table.sym(\"S\")` → `m.symbols.sym(\"S\")`	API rename in tract 0.22.1
`libDF/src/{tract,transforms,wasm,wav_utils}.rs` + `bin/enhance_wav.rs`	`use ndarray::` → `use tract_core::ndarray::`	tract 0.22.1 vendors ndarray; bare `ndarray` import no longer resolves
`Cargo.toml` (workspace)	`[patch.crates-io]` override pointing tract-linalg at local fork	Routes the WASM SIMD kernel kit into libDF builds
`Cargo.lock`	regenerated	Reflects the deps changes

Discovery 1 — production builds need `RUSTFLAGS=+simd128`

Critical foundational find: production builds (Mezon's wasm-pack,
libDF's build_wasm_package.sh) never set
`RUSTFLAGS="-C target-feature=+simd128"`. As a result,
tract-linalg/src/wasm.rs (gated by
`#[cfg(all(target_family = "wasm", target_feature = "simd128"))]`)
was being excluded from the build entirely — every WASM build was
running the scalar generic_f32_4x4 fallback.

Fixing this alone (one line in build_wasm_package.sh) yielded a 16%
RTF reduction. The kernel kit work then builds on top of that.

This isn't a code change in this PR — it's a build-script change that
must be applied to the consumer (the Vonage NS library, downstream of
libDF). Documented here so the integration story is self-contained.

Caveat — `[patch.crates-io]` uses an absolute local path

The [patch.crates-io] override in workspace Cargo.toml currently
points at:

```toml
tract-linalg = { path = "/Users/CZoli/Desktop/tract-fork/tract-linalg-0.22.1" }
```

This is non-reproducible for anyone else. To make it portable, the
patch could be repointed at a git revision on
czoli1976/tract once the kernel
kit is stabilized:

```toml
tract-linalg = { git = "https://github.com/czoli1976/tract\", branch = "add-wasm-f32-full-kernel-kit" }
```

Doing that now would mix concerns; deferred until kernel kit lands
upstream or is given a stable internal version.

Tests

cargo test -p df --no-default-features --features tract,wav-utils
passes on macOS native. WASM tests live in the patched tract-linalg
fork (1508/1508 pass on wasm32-wasip1 + wasmtime + simd128, see
czoli1976/tract#2).

… via patched fork Captures the libDF-side changes that land Vonage's DFN3 WASM kernel investigation (RTF 0.1290 -> 0.0516, -60% / 2.5x faster, audio bit-identical). Changes: 1. tract bump 0.21.4 -> 0.22.1 in libDF/Cargo.toml. Required for the newer tract-linalg architecture that the kernel kit targets. 2. ndarray re-import via tract_core. tract 0.22.1 vendors ndarray under tract_core::ndarray; the older bare `use ndarray::prelude::*;` no longer resolves cleanly under the bumped dep. Updated: - libDF/src/bin/enhance_wav.rs - libDF/src/tract.rs - libDF/src/transforms.rs - libDF/src/wasm.rs - libDF/src/wav_utils.rs 3. m.symbol_table.sym('S') -> m.symbols.sym('S') in libDF/src/tract.rs (3 sites: encoder + erb_decoder + df_decoder). API rename in tract 0.22.1. 4. Workspace [patch.crates-io] override pointing tract-linalg at a local fork that adds six WASM SIMD kernels (4x4 existing + 4x1, 8x1, 16x1, 8x4, 8x8 new) plus a per-M dispatcher in Ops::mmv_f32. Source for the kernels: czoli1976/tract@add-wasm-f32-full-kernel-kit. 5. Cargo.lock updated to reflect deps changes. Production builds must set RUSTFLAGS=\"-C target-feature=+simd128\" (Discovery #1 from the investigation: simd128 was never on in production builds, which kept tract-linalg/src/wasm.rs cfg-gated out and forced the scalar generic_f32_4x4 path, costing 16% RTF). See WASM_SIMD_KERNEL_INVESTIGATION.md in the dfn3-wasm-opt-v2.1 worktree for the full investigation log + measurements.

+ describe full kernel kit Two changes in the [patch.crates-io] override: 1. Source swap: absolute local path -> git source on czoli1976/tract. Makes the build reproducible for anyone with access to the fork; was previously pinned to /Users/CZoli/Desktop/tract-fork/tract-linalg-0.22.1 which doesn't exist outside the original investigation environment. 2. Comment updated to describe the full WASM SIMD kernel kit (4x4 existing + 4x1, 8x1, 16x1, 8x4, 8x8 new + per-M dispatcher). The previous comment only mentioned the initial 4x1 kernel and was stale relative to what's actually being patched in. Resulting patch: tract-linalg = { git = "https://github.com/czoli1976/tract", branch = "add-wasm-f32-full-kernel-kit" } Cargo.lock will regenerate on next build (the registry source descriptor for tract-linalg changes from path-based to git-based).

…pstream-side after A/B) Mirrors czoli1976/tract@9c45f8d which dropped wasm_f32_8x4 after a controlled A/B confirmed it's structurally dead code for DFN3 (every MM op has N >= 8, strategizer always picks 8x8 over 8x4; mean A/B delta was -1.22% within thermal noise). Final kit shipped via [patch.crates-io]: wasm_f32_4x4 (existing) + wasm_f32_4x1 / 8x1 / 16x1 (new GEMV) + wasm_f32_8x8 (new MM) + per-M dispatcher in Ops::mmv_f32.

…of branch) Pins the [patch.crates-io] override to a specific commit SHA rather than a branch, so libDF builds are reproducible while the upstream PR is under review. Any further commits we push to the branch (in response to review feedback) will not auto-propagate to libDF builds — we'll bump the rev deliberately when we want to absorb them. Pinned commit (b82d1f0): full kernel kit (4x1, 8x1, 16x1, 8x8 + per-M dispatcher), no 8x4, with module-level #![allow(unsafe_op_in_unsafe_fn)].

Picks up the upstream review fix (inner unsafe { } blocks per kali's request, plus cargo fmt). Rev d925624 is the new HEAD of czoli1976:add-wasm-f32-full-kernel-kit; the previous b82d1f0 was force-pushed away (rejected lint-allow approach).

…tch.crates-io]) Replaces the broken `[patch.crates-io] tract-linalg = { git = ..., rev = "d925624" }` directive (which silently fell back to stock crates.io tract-linalg-0.22.1 because the git ref was at version 0.23.0-pre — version mismatch with libDF's =0.22.1 constraint, so cargo ignored the patch). Replacement: vendor a self-contained single-crate copy of sonos/tract@v0.22.1's `linalg/` subcrate at `vendor/tract-linalg-0.22.1/`, with the kernel kit cherry-picks applied (sonos/tract#2164's 4 commits: 4x1 GEMV, kit extension, 8x4 drop, review-feedback inner-unsafe-blocks). The vendor crate is at version 0.22.1 (matches libDF's =0.22.1 requirement) and is single-crate (its tract-data dep comes from crates.io, avoiding the multi-workspace tract-data version conflict that caused the previous git+branch attempt to fail with 103 type-mismatch errors). Validated: - Cargo.lock now has ONE tract-linalg entry (path-resolved, 0.22.1) - WASM size: 10,007,237 bytes raw (was 9,971,007 with stock kernel-kit-free fallback — +36 KB confirms kernel kit symbols are linked) - Bench: RTF 0.0516, per-frame mean 0.5157 ms (matches prior kernel-kit best; was 0.108 with broken patch falling back to stock 4x4-only) Drop this vendor dir entirely once tract publishes a release including the merged kernel kit (sonos/tract#2164 was merged 2026-04-28; awaiting 0.22.x cherry-pick or 0.23.x cut).

Hot loop on the per-frame ERB feature path: dot-product over a band of Complex32 against itself (or a reference). The wasm32 build with `+simd128` was leaving this loop scalar — `wasm-objdump` shows zero v128 ops for the function body in the production build. Replace the inner accumulator with a 4-wide f32x4 reduction using `core::arch::wasm32` intrinsics. Output is bit-exact identical (FNV-1a 20ea4579c427f925 unchanged across Chromium / WebKit / Firefox, single-threaded and 4-thread). Same-machine focused bench, Chromium, 5-run alternated, 300 iter × 20 frames per measurement (t-test): vanilla_mono control: 3.755 -> 3.750 ms (no change, sanity) my_mt_1t: 3.748 -> 3.723 ms (-0.67%, t=2.22) my_mt_4t: 4.679 -> 4.646 ms (-0.71%, t=2.45) Native builds use the existing scalar reduction via cfg gating; no behaviour change off wasm32.

Adds f32x4 vectorization for three more hot DSP functions in the df_process_frame inference path, on top of the compute_band_corr work in this PR's first commit: * band_mean_norm_erb (called from feat_erb per frame): per-bin IIR mean-norm. State is per-bin (no recurrence between bins) so straightforward 4-wide SIMD over all ERB bins. * apply_band_gain (called from apply_mask post-network): Complex32 x f32 scalar mul-in-place per ERB band. Reinterprets &mut [Complex32] as &mut [f32] of length 2N (Complex32 is #[repr(C)] {re, im}, identical layout). 4-wide SIMD multiplies. Also redirects DFState::apply_mask to call apply_band_gain (the Complex32 specialisation) instead of the generic apply_interp_band_gain<T>, since the existing apply_band_gain function is already structurally identical. * apply_window_in_place (called from frame_synthesis per frame): f32 mul-in-place. Signature changed from generic IntoIterator<Item=&'a f32> to &[f32] (the sole caller already passes &state.window which IS a slice). 4-wide SIMD multiplies. Each function keeps the original scalar implementation as the non-wasm32 fallback via #[cfg(not(target_arch = "wasm32"))]. Bit-identical output verified: FNV-1a hash of df_process_frame output stream over 3000 random frames matches the Rikorose main baseline exactly across all 3 independent bench runs on Node v20.11.1 / V8. Wasm size delta vs baseline: +835 bytes total (compute_band_corr +699; the 3 new helpers add net +136 bytes). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds f32x4 SIMD for the two band-unit-norm functions in feat_cplx / feat_cplx_t (called per frame inside df_process_frame). The trick is de-interleaving &mut [Complex32]'s [re,im,re,im,...] layout so we can compute the per-bin norm (sqrt(re^2 + im^2)) lane-wise. Strategy: load 4 Complex32 (8 f32) as 2 v128s, use i32x4_shuffle to build pure-real and pure-imag vectors, compute norm in 4-wide SIMD, update state, then divide xs by sqrt(state). * band_unit_norm (xs: &mut [Complex32]) — re-interleaves the per-bin sqrt(state) divisor via two i32x4_shuffles to match the [re,im,re,im] xs layout, then divides 4 Complex32 (8 f32) at a time. * band_unit_norm_t (xs: &[Complex32], out: &mut [f32]) — same norm computation but writes to o_re / o_im split halves of out (CONTIGUOUS), so no re-interleave step is needed for the divide. Used (re*re + im*im).sqrt() instead of Complex32::norm()'s libm hypot. For DFN3's audio-spectrum magnitudes (no overflow/underflow regime), both produce identical bits — verified by FNV-1a hash of df_process_frame output stream over N=3000 deterministic random frames matching baseline exactly across 5 independent runs on Node v20.11.1 / V8. Wasm size delta: +678 bytes vs the 4-function bundle commit. Total over no-SIMD baseline: +1513 bytes for all 6 vectorisations. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three more loops in frame_synthesis emit scalar code on wasm32 despite +simd128 (unlike the frame_analysis windowing loops which LLVM auto-vec'd; something about the nested zip().zip() iterator pattern in frame_synthesis vs the izip!() pattern in frame_analysis defeats auto-vectorization). Three changes: * out[i] = x_first[i] + synthesis_mem[i] (overlap-add to output) — new f32_add_to(a, b, out) helper, three-slice element-wise add via 4-wide v128 + f32x4_add. * s_first[i] += xs_first[i] (overlap-add for next frame, in-place) — new f32_add_inplace(xs, ys) helper, two-slice element-wise in-place add. * s_second[i] = xs_second[i] (override left-shifted buffer) — replaced the explicit loop with copy_from_slice; the compiler likely emitted memcpy already, but the stdlib idiom is clearer and lets the optimiser pick the best implementation. Bit-identical output verified: FNV-1a hash 53ae8dfc3595faf0 unchanged across N=3000 deterministic frames over 6 independent bench runs. Speed: median bundle_synth vs the previous 6-function bundle is -1.2% RTF; mean over 6 iters is -3.1%. Several runs showed -5% to -11% additional gain (those runs had background CPU activity that hit the previous bundle harder). Real direction, modest absolute gain, no quality cost. Wasm size delta: -24 bytes vs previous bundle (copy_from_slice emits less code than the explicit loop). Net total: +1489 bytes over the no-SIMD baseline for all 8 vectorisations. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

czoli1976 added 2 commits April 28, 2026 07:20

czoli1976 mentioned this pull request Apr 28, 2026

linalg/wasm: WASM SIMD kernel kit (4x1, 8x1, 16x1, 8x8 + per-M dispatcher) czoli1976/tract#2

Open

czoli1976 and others added 8 commits April 28, 2026 09:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DFN3 WASM optimization: tract 0.21 -> 0.22.1 + WASM SIMD kernel kit#1

DFN3 WASM optimization: tract 0.21 -> 0.22.1 + WASM SIMD kernel kit#1
czoli1976 wants to merge 10 commits intomainfrom
dfn3-wasm-opt-tract-022-kernel-kit

czoli1976 commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

czoli1976 commented Apr 28, 2026

Summary

End-to-end impact

Changes

Discovery 1 — production builds need RUSTFLAGS=+simd128

Caveat — [patch.crates-io] uses an absolute local path

Tests

See also

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Discovery 1 — production builds need `RUSTFLAGS=+simd128`

Caveat — `[patch.crates-io]` uses an absolute local path