Skip to content

DFN3 WASM optimization: tract 0.21 -> 0.22.1 + WASM SIMD kernel kit#1

Open
czoli1976 wants to merge 10 commits intomainfrom
dfn3-wasm-opt-tract-022-kernel-kit
Open

DFN3 WASM optimization: tract 0.21 -> 0.22.1 + WASM SIMD kernel kit#1
czoli1976 wants to merge 10 commits intomainfrom
dfn3-wasm-opt-tract-022-kernel-kit

Conversation

@czoli1976
Copy link
Copy Markdown
Owner

Summary

libDF-side integration for Vonage's DFN3 WASM kernel investigation.
Captures the dependency upgrade and [patch.crates-io] wiring needed
to land the WASM SIMD kernel kit being developed at
czoli1976/tract@add-wasm-f32-full-kernel-kit.

Internal documentation PR — not for upstream contribution to
Rikorose/DeepFilterNet at this time (the kernel work needs to land
upstream in sonos/tract first; that's tracked separately at
sonos/tract#2161).

End-to-end impact

Layer RTF per-frame
Mezon baseline (tract 0.21.4 npm, no +simd128) 0.1290 1.290 ms
This branch (tract 0.22.1 + 6-kernel WASM SIMD kit + +simd128) 0.0516 0.516 ms

Net: -60% RTF, 2.5x faster on the 27.29s benchmark clip. Audio output
bit-identical to scalar reference.

Changes

File What Why
libDF/Cargo.toml tract ^0.21.4=0.22.1 (4 crates) Required for the newer tract-linalg architecture the kernel kit targets
libDF/src/tract.rs 3 × m.symbol_table.sym(\"S\")m.symbols.sym(\"S\") API rename in tract 0.22.1
libDF/src/{tract,transforms,wasm,wav_utils}.rs + bin/enhance_wav.rs use ndarray::*use tract_core::ndarray::* tract 0.22.1 vendors ndarray; bare ndarray import no longer resolves
Cargo.toml (workspace) [patch.crates-io] override pointing tract-linalg at local fork Routes the WASM SIMD kernel kit into libDF builds
Cargo.lock regenerated Reflects the deps changes

Discovery 1 — production builds need RUSTFLAGS=+simd128

Critical foundational find: production builds (Mezon's wasm-pack,
libDF's build_wasm_package.sh) never set
`RUSTFLAGS="-C target-feature=+simd128"`. As a result,
tract-linalg/src/wasm.rs (gated by
`#[cfg(all(target_family = "wasm", target_feature = "simd128"))]`)
was being excluded from the build entirely — every WASM build was
running the scalar generic_f32_4x4 fallback.

Fixing this alone (one line in build_wasm_package.sh) yielded a 16%
RTF reduction. The kernel kit work then builds on top of that.

This isn't a code change in this PR — it's a build-script change that
must be applied to the consumer (the Vonage NS library, downstream of
libDF). Documented here so the integration story is self-contained.

Caveat — [patch.crates-io] uses an absolute local path

The [patch.crates-io] override in workspace Cargo.toml currently
points at:

```toml
tract-linalg = { path = "/Users/CZoli/Desktop/tract-fork/tract-linalg-0.22.1" }
```

This is non-reproducible for anyone else. To make it portable, the
patch could be repointed at a git revision on
czoli1976/tract once the kernel
kit is stabilized:

```toml
tract-linalg = { git = "https://github.com/czoli1976/tract\", branch = "add-wasm-f32-full-kernel-kit" }
```

Doing that now would mix concerns; deferred until kernel kit lands
upstream or is given a stable internal version.

Tests

cargo test -p df --no-default-features --features tract,wav-utils
passes on macOS native. WASM tests live in the patched tract-linalg
fork (1508/1508 pass on wasm32-wasip1 + wasmtime + simd128, see
czoli1976/tract#2).

See also

… via patched fork

Captures the libDF-side changes that land Vonage's DFN3 WASM kernel
investigation (RTF 0.1290 -> 0.0516, -60% / 2.5x faster, audio
bit-identical).

Changes:

1. tract bump 0.21.4 -> 0.22.1 in libDF/Cargo.toml. Required for the
   newer tract-linalg architecture that the kernel kit targets.

2. ndarray re-import via tract_core. tract 0.22.1 vendors ndarray
   under tract_core::ndarray; the older bare `use ndarray::prelude::*;`
   no longer resolves cleanly under the bumped dep. Updated:
     - libDF/src/bin/enhance_wav.rs
     - libDF/src/tract.rs
     - libDF/src/transforms.rs
     - libDF/src/wasm.rs
     - libDF/src/wav_utils.rs

3. m.symbol_table.sym('S') -> m.symbols.sym('S') in libDF/src/tract.rs
   (3 sites: encoder + erb_decoder + df_decoder). API rename in
   tract 0.22.1.

4. Workspace [patch.crates-io] override pointing tract-linalg at a
   local fork that adds six WASM SIMD kernels (4x4 existing + 4x1,
   8x1, 16x1, 8x4, 8x8 new) plus a per-M dispatcher in Ops::mmv_f32.
   Source for the kernels: czoli1976/tract@add-wasm-f32-full-kernel-kit.

5. Cargo.lock updated to reflect deps changes.

Production builds must set RUSTFLAGS=\"-C target-feature=+simd128\"
(Discovery #1 from the investigation: simd128 was never on in
production builds, which kept tract-linalg/src/wasm.rs cfg-gated
out and forced the scalar generic_f32_4x4 path, costing 16% RTF).

See WASM_SIMD_KERNEL_INVESTIGATION.md in the dfn3-wasm-opt-v2.1
worktree for the full investigation log + measurements.
                       + describe full kernel kit

Two changes in the [patch.crates-io] override:

1. Source swap: absolute local path -> git source on czoli1976/tract.
   Makes the build reproducible for anyone with access to the fork;
   was previously pinned to /Users/CZoli/Desktop/tract-fork/tract-linalg-0.22.1
   which doesn't exist outside the original investigation environment.

2. Comment updated to describe the full WASM SIMD kernel kit (4x4
   existing + 4x1, 8x1, 16x1, 8x4, 8x8 new + per-M dispatcher). The
   previous comment only mentioned the initial 4x1 kernel and was
   stale relative to what's actually being patched in.

Resulting patch:
  tract-linalg = { git = "https://github.com/czoli1976/tract",
                   branch = "add-wasm-f32-full-kernel-kit" }

Cargo.lock will regenerate on next build (the registry source
descriptor for tract-linalg changes from path-based to git-based).
czoli1976 and others added 8 commits April 28, 2026 09:15
…pstream-side after A/B)

Mirrors czoli1976/tract@9c45f8d which dropped wasm_f32_8x4 after a
controlled A/B confirmed it's structurally dead code for DFN3 (every
MM op has N >= 8, strategizer always picks 8x8 over 8x4; mean A/B
delta was -1.22% within thermal noise).

Final kit shipped via [patch.crates-io]:
  wasm_f32_4x4 (existing) + wasm_f32_4x1 / 8x1 / 16x1 (new GEMV)
  + wasm_f32_8x8 (new MM) + per-M dispatcher in Ops::mmv_f32.
…of branch)

Pins the [patch.crates-io] override to a specific commit SHA rather than
a branch, so libDF builds are reproducible while the upstream PR is
under review. Any further commits we push to the branch (in response to
review feedback) will not auto-propagate to libDF builds — we'll bump
the rev deliberately when we want to absorb them.

Pinned commit (b82d1f0): full kernel kit (4x1, 8x1, 16x1, 8x8 + per-M
dispatcher), no 8x4, with module-level #![allow(unsafe_op_in_unsafe_fn)].
Picks up the upstream review fix (inner unsafe { } blocks per kali's
request, plus cargo fmt). Rev d925624 is the new HEAD of
czoli1976:add-wasm-f32-full-kernel-kit; the previous b82d1f0 was
force-pushed away (rejected lint-allow approach).
…tch.crates-io])

Replaces the broken `[patch.crates-io] tract-linalg = { git = ..., rev = "d925624" }`
directive (which silently fell back to stock crates.io tract-linalg-0.22.1
because the git ref was at version 0.23.0-pre — version mismatch with
libDF's =0.22.1 constraint, so cargo ignored the patch).

Replacement: vendor a self-contained single-crate copy of
sonos/tract@v0.22.1's `linalg/` subcrate at `vendor/tract-linalg-0.22.1/`,
with the kernel kit cherry-picks applied (sonos/tract#2164's 4 commits:
4x1 GEMV, kit extension, 8x4 drop, review-feedback inner-unsafe-blocks).

The vendor crate is at version 0.22.1 (matches libDF's =0.22.1 requirement)
and is single-crate (its tract-data dep comes from crates.io, avoiding
the multi-workspace tract-data version conflict that caused the previous
git+branch attempt to fail with 103 type-mismatch errors).

Validated:
- Cargo.lock now has ONE tract-linalg entry (path-resolved, 0.22.1)
- WASM size: 10,007,237 bytes raw (was 9,971,007 with stock kernel-kit-free
  fallback — +36 KB confirms kernel kit symbols are linked)
- Bench: RTF 0.0516, per-frame mean 0.5157 ms (matches prior kernel-kit
  best; was 0.108 with broken patch falling back to stock 4x4-only)

Drop this vendor dir entirely once tract publishes a release including
the merged kernel kit (sonos/tract#2164 was merged 2026-04-28; awaiting
0.22.x cherry-pick or 0.23.x cut).
Hot loop on the per-frame ERB feature path: dot-product over a band
of Complex32 against itself (or a reference). The wasm32 build with
`+simd128` was leaving this loop scalar — `wasm-objdump` shows zero
v128 ops for the function body in the production build.

Replace the inner accumulator with a 4-wide f32x4 reduction using
`core::arch::wasm32` intrinsics. Output is bit-exact identical
(FNV-1a 20ea4579c427f925 unchanged across Chromium / WebKit /
Firefox, single-threaded and 4-thread).

Same-machine focused bench, Chromium, 5-run alternated, 300 iter
× 20 frames per measurement (t-test):
  vanilla_mono control: 3.755 -> 3.750 ms (no change, sanity)
  my_mt_1t:             3.748 -> 3.723 ms (-0.67%, t=2.22)
  my_mt_4t:             4.679 -> 4.646 ms (-0.71%, t=2.45)

Native builds use the existing scalar reduction via cfg gating;
no behaviour change off wasm32.
Adds f32x4 vectorization for three more hot DSP functions in the
df_process_frame inference path, on top of the compute_band_corr
work in this PR's first commit:

  * band_mean_norm_erb (called from feat_erb per frame): per-bin
    IIR mean-norm. State is per-bin (no recurrence between bins) so
    straightforward 4-wide SIMD over all ERB bins.

  * apply_band_gain (called from apply_mask post-network): Complex32
    x f32 scalar mul-in-place per ERB band. Reinterprets
    &mut [Complex32] as &mut [f32] of length 2N (Complex32 is
    #[repr(C)] {re, im}, identical layout). 4-wide SIMD multiplies.
    Also redirects DFState::apply_mask to call apply_band_gain (the
    Complex32 specialisation) instead of the generic
    apply_interp_band_gain<T>, since the existing apply_band_gain
    function is already structurally identical.

  * apply_window_in_place (called from frame_synthesis per frame):
    f32 mul-in-place. Signature changed from generic
    IntoIterator<Item=&'a f32> to &[f32] (the sole caller already
    passes &state.window which IS a slice). 4-wide SIMD multiplies.

Each function keeps the original scalar implementation as the
non-wasm32 fallback via #[cfg(not(target_arch = "wasm32"))].

Bit-identical output verified: FNV-1a hash of df_process_frame
output stream over 3000 random frames matches the Rikorose main
baseline exactly across all 3 independent bench runs on
Node v20.11.1 / V8.

Wasm size delta vs baseline: +835 bytes total (compute_band_corr
+699; the 3 new helpers add net +136 bytes).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds f32x4 SIMD for the two band-unit-norm functions in feat_cplx /
feat_cplx_t (called per frame inside df_process_frame).

The trick is de-interleaving &mut [Complex32]'s [re,im,re,im,...]
layout so we can compute the per-bin norm (sqrt(re^2 + im^2))
lane-wise. Strategy: load 4 Complex32 (8 f32) as 2 v128s, use
i32x4_shuffle to build pure-real and pure-imag vectors, compute
norm in 4-wide SIMD, update state, then divide xs by sqrt(state).

  * band_unit_norm (xs: &mut [Complex32]) — re-interleaves the
    per-bin sqrt(state) divisor via two i32x4_shuffles to match
    the [re,im,re,im] xs layout, then divides 4 Complex32 (8 f32)
    at a time.

  * band_unit_norm_t (xs: &[Complex32], out: &mut [f32]) — same
    norm computation but writes to o_re / o_im split halves of
    out (CONTIGUOUS), so no re-interleave step is needed for the
    divide.

Used (re*re + im*im).sqrt() instead of Complex32::norm()'s libm
hypot. For DFN3's audio-spectrum magnitudes (no overflow/underflow
regime), both produce identical bits — verified by FNV-1a hash of
df_process_frame output stream over N=3000 deterministic random
frames matching baseline exactly across 5 independent runs on
Node v20.11.1 / V8.

Wasm size delta: +678 bytes vs the 4-function bundle commit.
Total over no-SIMD baseline: +1513 bytes for all 6 vectorisations.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three more loops in frame_synthesis emit scalar code on wasm32
despite +simd128 (unlike the frame_analysis windowing loops which
LLVM auto-vec'd; something about the nested zip().zip() iterator
pattern in frame_synthesis vs the izip!() pattern in frame_analysis
defeats auto-vectorization).

Three changes:

  * out[i] = x_first[i] + synthesis_mem[i] (overlap-add to output)
    — new f32_add_to(a, b, out) helper, three-slice element-wise
    add via 4-wide v128 + f32x4_add.

  * s_first[i] += xs_first[i] (overlap-add for next frame, in-place)
    — new f32_add_inplace(xs, ys) helper, two-slice element-wise
    in-place add.

  * s_second[i] = xs_second[i] (override left-shifted buffer)
    — replaced the explicit loop with copy_from_slice; the compiler
    likely emitted memcpy already, but the stdlib idiom is clearer
    and lets the optimiser pick the best implementation.

Bit-identical output verified: FNV-1a hash 53ae8dfc3595faf0
unchanged across N=3000 deterministic frames over 6 independent
bench runs.

Speed: median bundle_synth vs the previous 6-function bundle is
-1.2% RTF; mean over 6 iters is -3.1%. Several runs showed -5% to
-11% additional gain (those runs had background CPU activity that
hit the previous bundle harder). Real direction, modest absolute
gain, no quality cost.

Wasm size delta: -24 bytes vs previous bundle (copy_from_slice
emits less code than the explicit loop). Net total: +1489 bytes
over the no-SIMD baseline for all 8 vectorisations.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant