High-performance audio DSP library in Mojo that BEATS Python!
Whisper-compatible mel spectrogram preprocessing built from scratch in Mojo. 20-40% faster than Python's librosa through algorithmic optimizations, parallelization, and SIMD vectorization.
30-second Whisper audio preprocessing:
librosa (Python): 15ms (1993x realtime)
mojo-audio (Mojo): 12ms (2457x realtime) with -O3
RESULT: 20-40% FASTER THAN PYTHON! π₯
Optimization Journey:
- Started: 476ms (naive implementation)
- Optimized: 12ms (with -O3 compiler flags)
- Total speedup: 40x!
See complete optimization journey β
- Window Functions: Hann, Hamming (spectral leakage reduction)
- FFT Operations: Radix-2/4 iterative FFT, True RFFT for real audio
- STFT: Parallelized short-time Fourier transform across all CPU cores
- Mel Filterbank: Sparse-optimized triangular filters (80 bands)
- Mel Spectrogram: Complete end-to-end pipeline
One function call:
var mel = mel_spectrogram(audio) // (80, 2998) in ~12ms!- β Iterative FFT (3x speedup)
- β Pre-computed twiddle factors (1.7x)
- β Sparse mel filterbank (1.24x)
- β Twiddle caching across frames (2x!)
- β Float32 precision (2x SIMD width)
- β True RFFT algorithm (1.4x)
- β Multi-core parallelization (1.3-1.7x)
- β Radix-4 FFT (1.1-1.2x)
- β Compiler optimization (-O3)
Combined: 40x faster than naive implementation!
# Clone repository
git clone https://github.com/itsdevcoffee/mojo-audio.git
cd mojo-audio
# Install dependencies (requires Mojo)
pixi install
# Run tests
pixi run test
# Run optimized benchmark
pixi run bench-optimizedfrom audio import mel_spectrogram
fn main() raises:
# Load 30s audio @ 16kHz (480,000 samples)
var audio: List[Float32] = [...] // Float32 for performance!
# Get Whisper-compatible mel spectrogram
var mel_spec = mel_spectrogram(audio)
// Output: (80, 2998) mel spectrogram
// Time: ~12ms with -O3
// Ready for Whisper model!
}Compile with optimization:
mojo -O3 -I src your_code.mojo- Mojo: Version 0.26+ (install via Modular)
- pixi: Package manager (install pixi)
- Platform: Linux x86_64 or macOS (Apple Silicon/Intel)
# Clone and setup
git clone https://github.com/itsdevcoffee/mojo-audio.git
cd mojo-audio
# Install dependencies
pixi install
# Build FFI library (optional - for C/Rust/Python integration)
pixi run build-ffi-optimized
# Verify
ls libmojo_audio.so # Should show ~26KB shared libraryNote: Due to Mojo API changes, macOS builds currently require specific setup. See our macOS Build Guide for detailed instructions.
Quick build:
# Clone and setup
git clone https://github.com/itsdevcoffee/mojo-audio.git
cd mojo-audio
# Install dependencies
pixi install
# Build FFI library (.dylib for macOS)
pixi run build-ffi-optimized
# Verify
ls libmojo_audio.dylib # Should show ~20-30KB shared libraryTroubleshooting: If you encounter build errors on macOS, please refer to the detailed macOS build guide which includes solutions for common issues.
For creating release binaries:
- Linux: Use
pixi run build-ffi-optimized(createslibmojo_audio.so) - macOS: Follow the manual build guide
| Implementation | Time (30s) | Throughput | Language | Our Result |
|---|---|---|---|---|
| mojo-audio -O3 | 12ms | 2457x | Mojo | π₯ Winner! |
| librosa | 15ms | 1993x | Python | 20-40% slower |
| faster-whisper | 20-30ms | ~1000x | Python | 1.6-2.5x slower |
| whisper.cpp | 50-100ms | ~300-600x | C++ | 4-8x slower |
mojo-audio is the FASTEST Whisper preprocessing library!
# Mojo (optimized)
pixi run bench-optimized
# Python baseline (requires librosa)
pixi run bench-python
# Standard (no compiler opts)
pixi run benchResults: Consistently 20-40% faster than librosa!
pixi run demo-windowSee Hann and Hamming windows in action.
pixi run demo-fftDemonstrates FFT, power spectrum, and STFT.
pixi run demo-melFull mel spectrogram generation with explanations.
mel_spectrogram(
audio: List[Float32],
sample_rate: Int = 16000,
n_fft: Int = 400,
hop_length: Int = 160,
n_mels: Int = 80
) raises -> List[List[Float32]]Complete Whisper preprocessing pipeline.
Returns: Mel spectrogram (80, ~3000) for 30s audio
// Window functions
hann_window(size: Int) -> List[Float32]
hamming_window(size: Int) -> List[Float32]
apply_window(signal, window) -> List[Float32]
// FFT operations
fft(signal: List[Float32]) -> List[Complex]
rfft(signal: List[Float32]) -> List[Complex] // 2x faster for real audio!
power_spectrum(fft_output) -> List[Float32]
stft(signal, n_fft, hop_length, window_fn) -> List[List[Float32]]
// Mel scale
hz_to_mel(freq_hz: Float32) -> Float32
mel_to_hz(freq_mel: Float32) -> Float32
create_mel_filterbank(n_mels, n_fft, sample_rate) -> List[List[Float32]]
// Normalization
normalize_whisper(mel_spec) -> List[List[Float32]] // Whisper-ready output
normalize_minmax(mel_spec) -> List[List[Float32]] // Scale to [0, 1]
normalize_zscore(mel_spec) -> List[List[Float32]] // Mean=0, std=1
apply_normalization(mel_spec, norm_type) -> List[List[Float32]]Different ML models expect different input ranges. mojo-audio supports multiple normalization methods:
| Constant | Value | Formula | Output Range | Use Case |
|---|---|---|---|---|
NORM_NONE |
0 | log10(max(x, 1e-10)) |
[-10, 0] | Raw output, custom processing |
NORM_WHISPER |
1 | max-8 clamp, then (x+4)/4 |
~[-1, 1] | OpenAI Whisper models |
NORM_MINMAX |
2 | (x - min) / (max - min) |
[0, 1] | General ML |
NORM_ZSCORE |
3 | (x - mean) / std |
~[-3, 3] | Wav2Vec2, research |
from audio import mel_spectrogram, normalize_whisper, NORM_WHISPER
// Option 1: Separate normalization
var raw_mel = mel_spectrogram(audio)
var whisper_mel = normalize_whisper(raw_mel)
// Option 2: Using apply_normalization
var normalized = apply_normalization(raw_mel, NORM_WHISPER)Set normalization in config to apply normalization in the pipeline:
MojoMelConfig config;
mojo_mel_config_default(&config);
config.normalization = MOJO_NORM_WHISPER; // Whisper-ready output
// Compute returns normalized mel spectrogram
MojoMelHandle handle = mojo_mel_spectrogram_compute(audio, num_samples, &config);let config = MojoMelConfig {
sample_rate: 16000,
n_fft: 400,
hop_length: 160,
n_mels: 128, // Whisper large-v3
normalization: 1, // MOJO_NORM_WHISPER
};Matches OpenAI Whisper requirements:
- β Sample rate: 16kHz
- β FFT size: 400
- β Hop length: 160 (10ms frames)
- β Mel bands: 80 (v2) or 128 (v3)
- β Output shape: (n_mels, ~3000) for 30s
| Whisper Model | n_mels | Constant |
|---|---|---|
| tiny, base, small, medium, large, large-v2 | 80 | WHISPER_N_MELS |
| large-v3 | 128 | WHISPER_N_MELS_V3 |
// For large-v2 and earlier (default)
var mel = mel_spectrogram(audio) // n_mels=80
// For large-v3
var mel = mel_spectrogram(audio, n_mels=128)Validated against Whisper model expectations!
Algorithmic:
- Iterative Cooley-Tukey FFT (cache-friendly)
- Radix-4 butterflies for power-of-4 sizes
- True RFFT with pack-FFT-unpack
- Sparse matrix operations
Parallelization:
- Multi-core STFT frame processing
- Thread-safe writes (each core handles different frames)
- Scales linearly with CPU cores
SIMD Vectorization:
- Float32 for 16-element SIMD (vs 8 for Float64)
- @parameter compile-time unrolling
- Direct pointer loads where possible
Memory Optimization:
- Pre-computed twiddle factors
- Cached coefficients across frames
- Pre-allocated output buffers
- COMPLETE_VICTORY.md - How we beat Python (full story!)
- Benchmark Results - Timestamped performance data
- Examples - Educational demos with explanations
mojo-audio/
βββ src/
β βββ audio.mojo # Core library (1,200+ lines)
βββ tests/
β βββ test_window.mojo # Window function tests
β βββ test_fft.mojo # FFT operation tests
β βββ test_mel.mojo # Mel filterbank tests
βββ examples/
β βββ window_demo.mojo
β βββ fft_demo.mojo
β βββ mel_demo.mojo
βββ benchmarks/
β βββ bench_mel_spectrogram.mojo # Mojo benchmarks
β βββ compare_librosa.py # Python baseline
β βββ RESULTS_2025-12-31.md # Historical data
βββ docs/
β βββ COMPLETE_VICTORY.md # Optimization story
βββ pixi.toml # Dependencies & tasks
- Understand every line of code
- No black-box dependencies
- Educational value
- Beats Python's librosa
- Faster than C++ alternatives
- Production-ready
- Demonstrates Mojo's capabilities
- Combines Python ergonomics with C performance
- SIMD, parallelization, optimization techniques
- All tests passing (17 tests)
- Whisper-compatible output
- Well-documented
- Actively optimized
Contributions welcome! Areas where help is appreciated:
- Further optimizations (see potential in docs)
- Additional features (MFCC, CQT, etc.)
- Platform-specific tuning (ARM, x86)
- Documentation improvements
- Bug fixes and testing
If you use mojo-audio in your research or project:
@software{mojo_audio_2026,
author = {Dev Coffee},
title = {mojo-audio: High-performance audio DSP in Mojo},
year = {2026},
url = {https://github.com/itsdevcoffee/mojo-audio}
}- Visage ML - Neural network library in Mojo
- Mojo - The Mojo programming language
- MAX Engine - Modular's AI engine
- π Beats librosa (Python's standard audio library)
- π 40x speedup from naive to optimized
- β‘ ~2500x realtime throughput
- β All from scratch in Mojo
- π Complete learning resource
Built with Mojo π₯ | Faster than Python | Production-Ready
GitHub | Issues | Discussions

