DFN3 WASM optimization: tract 0.21 -> 0.22.1 + WASM SIMD kernel kit#1
Open
DFN3 WASM optimization: tract 0.21 -> 0.22.1 + WASM SIMD kernel kit#1
Conversation
… via patched fork
Captures the libDF-side changes that land Vonage's DFN3 WASM kernel
investigation (RTF 0.1290 -> 0.0516, -60% / 2.5x faster, audio
bit-identical).
Changes:
1. tract bump 0.21.4 -> 0.22.1 in libDF/Cargo.toml. Required for the
newer tract-linalg architecture that the kernel kit targets.
2. ndarray re-import via tract_core. tract 0.22.1 vendors ndarray
under tract_core::ndarray; the older bare `use ndarray::prelude::*;`
no longer resolves cleanly under the bumped dep. Updated:
- libDF/src/bin/enhance_wav.rs
- libDF/src/tract.rs
- libDF/src/transforms.rs
- libDF/src/wasm.rs
- libDF/src/wav_utils.rs
3. m.symbol_table.sym('S') -> m.symbols.sym('S') in libDF/src/tract.rs
(3 sites: encoder + erb_decoder + df_decoder). API rename in
tract 0.22.1.
4. Workspace [patch.crates-io] override pointing tract-linalg at a
local fork that adds six WASM SIMD kernels (4x4 existing + 4x1,
8x1, 16x1, 8x4, 8x8 new) plus a per-M dispatcher in Ops::mmv_f32.
Source for the kernels: czoli1976/tract@add-wasm-f32-full-kernel-kit.
5. Cargo.lock updated to reflect deps changes.
Production builds must set RUSTFLAGS=\"-C target-feature=+simd128\"
(Discovery #1 from the investigation: simd128 was never on in
production builds, which kept tract-linalg/src/wasm.rs cfg-gated
out and forced the scalar generic_f32_4x4 path, costing 16% RTF).
See WASM_SIMD_KERNEL_INVESTIGATION.md in the dfn3-wasm-opt-v2.1
worktree for the full investigation log + measurements.
+ describe full kernel kit
Two changes in the [patch.crates-io] override:
1. Source swap: absolute local path -> git source on czoli1976/tract.
Makes the build reproducible for anyone with access to the fork;
was previously pinned to /Users/CZoli/Desktop/tract-fork/tract-linalg-0.22.1
which doesn't exist outside the original investigation environment.
2. Comment updated to describe the full WASM SIMD kernel kit (4x4
existing + 4x1, 8x1, 16x1, 8x4, 8x8 new + per-M dispatcher). The
previous comment only mentioned the initial 4x1 kernel and was
stale relative to what's actually being patched in.
Resulting patch:
tract-linalg = { git = "https://github.com/czoli1976/tract",
branch = "add-wasm-f32-full-kernel-kit" }
Cargo.lock will regenerate on next build (the registry source
descriptor for tract-linalg changes from path-based to git-based).
…pstream-side after A/B) Mirrors czoli1976/tract@9c45f8d which dropped wasm_f32_8x4 after a controlled A/B confirmed it's structurally dead code for DFN3 (every MM op has N >= 8, strategizer always picks 8x8 over 8x4; mean A/B delta was -1.22% within thermal noise). Final kit shipped via [patch.crates-io]: wasm_f32_4x4 (existing) + wasm_f32_4x1 / 8x1 / 16x1 (new GEMV) + wasm_f32_8x8 (new MM) + per-M dispatcher in Ops::mmv_f32.
…of branch) Pins the [patch.crates-io] override to a specific commit SHA rather than a branch, so libDF builds are reproducible while the upstream PR is under review. Any further commits we push to the branch (in response to review feedback) will not auto-propagate to libDF builds — we'll bump the rev deliberately when we want to absorb them. Pinned commit (b82d1f0): full kernel kit (4x1, 8x1, 16x1, 8x8 + per-M dispatcher), no 8x4, with module-level #![allow(unsafe_op_in_unsafe_fn)].
Picks up the upstream review fix (inner unsafe { } blocks per kali's
request, plus cargo fmt). Rev d925624 is the new HEAD of
czoli1976:add-wasm-f32-full-kernel-kit; the previous b82d1f0 was
force-pushed away (rejected lint-allow approach).
…tch.crates-io])
Replaces the broken `[patch.crates-io] tract-linalg = { git = ..., rev = "d925624" }`
directive (which silently fell back to stock crates.io tract-linalg-0.22.1
because the git ref was at version 0.23.0-pre — version mismatch with
libDF's =0.22.1 constraint, so cargo ignored the patch).
Replacement: vendor a self-contained single-crate copy of
sonos/tract@v0.22.1's `linalg/` subcrate at `vendor/tract-linalg-0.22.1/`,
with the kernel kit cherry-picks applied (sonos/tract#2164's 4 commits:
4x1 GEMV, kit extension, 8x4 drop, review-feedback inner-unsafe-blocks).
The vendor crate is at version 0.22.1 (matches libDF's =0.22.1 requirement)
and is single-crate (its tract-data dep comes from crates.io, avoiding
the multi-workspace tract-data version conflict that caused the previous
git+branch attempt to fail with 103 type-mismatch errors).
Validated:
- Cargo.lock now has ONE tract-linalg entry (path-resolved, 0.22.1)
- WASM size: 10,007,237 bytes raw (was 9,971,007 with stock kernel-kit-free
fallback — +36 KB confirms kernel kit symbols are linked)
- Bench: RTF 0.0516, per-frame mean 0.5157 ms (matches prior kernel-kit
best; was 0.108 with broken patch falling back to stock 4x4-only)
Drop this vendor dir entirely once tract publishes a release including
the merged kernel kit (sonos/tract#2164 was merged 2026-04-28; awaiting
0.22.x cherry-pick or 0.23.x cut).
Hot loop on the per-frame ERB feature path: dot-product over a band of Complex32 against itself (or a reference). The wasm32 build with `+simd128` was leaving this loop scalar — `wasm-objdump` shows zero v128 ops for the function body in the production build. Replace the inner accumulator with a 4-wide f32x4 reduction using `core::arch::wasm32` intrinsics. Output is bit-exact identical (FNV-1a 20ea4579c427f925 unchanged across Chromium / WebKit / Firefox, single-threaded and 4-thread). Same-machine focused bench, Chromium, 5-run alternated, 300 iter × 20 frames per measurement (t-test): vanilla_mono control: 3.755 -> 3.750 ms (no change, sanity) my_mt_1t: 3.748 -> 3.723 ms (-0.67%, t=2.22) my_mt_4t: 4.679 -> 4.646 ms (-0.71%, t=2.45) Native builds use the existing scalar reduction via cfg gating; no behaviour change off wasm32.
Adds f32x4 vectorization for three more hot DSP functions in the
df_process_frame inference path, on top of the compute_band_corr
work in this PR's first commit:
* band_mean_norm_erb (called from feat_erb per frame): per-bin
IIR mean-norm. State is per-bin (no recurrence between bins) so
straightforward 4-wide SIMD over all ERB bins.
* apply_band_gain (called from apply_mask post-network): Complex32
x f32 scalar mul-in-place per ERB band. Reinterprets
&mut [Complex32] as &mut [f32] of length 2N (Complex32 is
#[repr(C)] {re, im}, identical layout). 4-wide SIMD multiplies.
Also redirects DFState::apply_mask to call apply_band_gain (the
Complex32 specialisation) instead of the generic
apply_interp_band_gain<T>, since the existing apply_band_gain
function is already structurally identical.
* apply_window_in_place (called from frame_synthesis per frame):
f32 mul-in-place. Signature changed from generic
IntoIterator<Item=&'a f32> to &[f32] (the sole caller already
passes &state.window which IS a slice). 4-wide SIMD multiplies.
Each function keeps the original scalar implementation as the
non-wasm32 fallback via #[cfg(not(target_arch = "wasm32"))].
Bit-identical output verified: FNV-1a hash of df_process_frame
output stream over 3000 random frames matches the Rikorose main
baseline exactly across all 3 independent bench runs on
Node v20.11.1 / V8.
Wasm size delta vs baseline: +835 bytes total (compute_band_corr
+699; the 3 new helpers add net +136 bytes).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds f32x4 SIMD for the two band-unit-norm functions in feat_cplx /
feat_cplx_t (called per frame inside df_process_frame).
The trick is de-interleaving &mut [Complex32]'s [re,im,re,im,...]
layout so we can compute the per-bin norm (sqrt(re^2 + im^2))
lane-wise. Strategy: load 4 Complex32 (8 f32) as 2 v128s, use
i32x4_shuffle to build pure-real and pure-imag vectors, compute
norm in 4-wide SIMD, update state, then divide xs by sqrt(state).
* band_unit_norm (xs: &mut [Complex32]) — re-interleaves the
per-bin sqrt(state) divisor via two i32x4_shuffles to match
the [re,im,re,im] xs layout, then divides 4 Complex32 (8 f32)
at a time.
* band_unit_norm_t (xs: &[Complex32], out: &mut [f32]) — same
norm computation but writes to o_re / o_im split halves of
out (CONTIGUOUS), so no re-interleave step is needed for the
divide.
Used (re*re + im*im).sqrt() instead of Complex32::norm()'s libm
hypot. For DFN3's audio-spectrum magnitudes (no overflow/underflow
regime), both produce identical bits — verified by FNV-1a hash of
df_process_frame output stream over N=3000 deterministic random
frames matching baseline exactly across 5 independent runs on
Node v20.11.1 / V8.
Wasm size delta: +678 bytes vs the 4-function bundle commit.
Total over no-SIMD baseline: +1513 bytes for all 6 vectorisations.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three more loops in frame_synthesis emit scalar code on wasm32
despite +simd128 (unlike the frame_analysis windowing loops which
LLVM auto-vec'd; something about the nested zip().zip() iterator
pattern in frame_synthesis vs the izip!() pattern in frame_analysis
defeats auto-vectorization).
Three changes:
* out[i] = x_first[i] + synthesis_mem[i] (overlap-add to output)
— new f32_add_to(a, b, out) helper, three-slice element-wise
add via 4-wide v128 + f32x4_add.
* s_first[i] += xs_first[i] (overlap-add for next frame, in-place)
— new f32_add_inplace(xs, ys) helper, two-slice element-wise
in-place add.
* s_second[i] = xs_second[i] (override left-shifted buffer)
— replaced the explicit loop with copy_from_slice; the compiler
likely emitted memcpy already, but the stdlib idiom is clearer
and lets the optimiser pick the best implementation.
Bit-identical output verified: FNV-1a hash 53ae8dfc3595faf0
unchanged across N=3000 deterministic frames over 6 independent
bench runs.
Speed: median bundle_synth vs the previous 6-function bundle is
-1.2% RTF; mean over 6 iters is -3.1%. Several runs showed -5% to
-11% additional gain (those runs had background CPU activity that
hit the previous bundle harder). Real direction, modest absolute
gain, no quality cost.
Wasm size delta: -24 bytes vs previous bundle (copy_from_slice
emits less code than the explicit loop). Net total: +1489 bytes
over the no-SIMD baseline for all 8 vectorisations.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
libDF-side integration for Vonage's DFN3 WASM kernel investigation.
Captures the dependency upgrade and
[patch.crates-io]wiring neededto land the WASM SIMD kernel kit being developed at
czoli1976/tract@add-wasm-f32-full-kernel-kit.
Internal documentation PR — not for upstream contribution to
Rikorose/DeepFilterNet at this time (the kernel work needs to land
upstream in
sonos/tractfirst; that's tracked separately atsonos/tract#2161).
End-to-end impact
+simd128)+simd128)Net: -60% RTF, 2.5x faster on the 27.29s benchmark clip. Audio output
bit-identical to scalar reference.
Changes
libDF/Cargo.toml^0.21.4→=0.22.1(4 crates)libDF/src/tract.rsm.symbol_table.sym(\"S\")→m.symbols.sym(\"S\")libDF/src/{tract,transforms,wasm,wav_utils}.rs+bin/enhance_wav.rsuse ndarray::*→use tract_core::ndarray::*ndarrayimport no longer resolvesCargo.toml(workspace)[patch.crates-io]override pointing tract-linalg at local forkCargo.lockDiscovery 1 — production builds need
RUSTFLAGS=+simd128Critical foundational find: production builds (Mezon's
wasm-pack,libDF's
build_wasm_package.sh) never set`RUSTFLAGS="-C target-feature=+simd128"`. As a result,
tract-linalg/src/wasm.rs(gated by`#[cfg(all(target_family = "wasm", target_feature = "simd128"))]`)
was being excluded from the build entirely — every WASM build was
running the scalar
generic_f32_4x4fallback.Fixing this alone (one line in
build_wasm_package.sh) yielded a 16%RTF reduction. The kernel kit work then builds on top of that.
This isn't a code change in this PR — it's a build-script change that
must be applied to the consumer (the Vonage NS library, downstream of
libDF). Documented here so the integration story is self-contained.
Caveat —
[patch.crates-io]uses an absolute local pathThe
[patch.crates-io]override in workspaceCargo.tomlcurrentlypoints at:
```toml
tract-linalg = { path = "/Users/CZoli/Desktop/tract-fork/tract-linalg-0.22.1" }
```
This is non-reproducible for anyone else. To make it portable, the
patch could be repointed at a git revision on
czoli1976/tract once the kernel
kit is stabilized:
```toml
tract-linalg = { git = "https://github.com/czoli1976/tract\", branch = "add-wasm-f32-full-kernel-kit" }
```
Doing that now would mix concerns; deferred until kernel kit lands
upstream or is given a stable internal version.
Tests
cargo test -p df --no-default-features --features tract,wav-utilspasses on macOS native. WASM tests live in the patched tract-linalg
fork (1508/1508 pass on wasm32-wasip1 + wasmtime + simd128, see
czoli1976/tract#2).
See also
(full kit) and czoli1976/tract#1
(upstream-PR-ready 4x1 only)
WASM_SIMD_KERNEL_INVESTIGATION.mdin thedfn3-wasm-opt-v2.1 worktree (Vonage internal)