Current speaker verification benchmarks (VoxCeleb, SITW, CN-Celeb) evaluate models on audio with similar recording conditions to training data. Real-world deployments face a different challenge: the same speaker may be encountered through vastly different acoustic carriers - a phone call one day, a video conference the next, a recorded message played through laptop speakers later.
The CASE Benchmark directly measures this capability: Can your model identify the same speaker regardless of how their voice was transmitted?
For each speaker in the evaluation set, we create multiple versions of their utterances:
- Clean (original high-quality)
- Codec-degraded (GSM, G.711, Opus, etc.)
- Mic-filtered (webcam, headset, laptop mic simulations)
- Noise-added (various SNR levels)
- Reverb-added (simulated room acoustics)
- Playback chain (codec → speaker → room → microphone)
Trials are constructed to test cross-carrier matching: enrollment in clean condition, verification in degraded condition.
The CASE Benchmark is built on two datasets:
VoxCeleb1-O (official test set):
- 40 speakers, ~400 utterances
- Celebrity interviews with varied recording conditions
LibriSpeech test-clean:
- 40 speakers, ~392 utterances
- Read audiobook speech in quiet conditions
This combination provides diversity in speaking styles (spontaneous vs read speech) and original recording conditions. Using standard test sets ensures the benchmark tests carrier robustness on speakers that may or may not have been seen during training, providing a fair evaluation.
Audio codecs compress speech for transmission, introducing artifacts and frequency limitations.
| Codec | Bitrate | Use Case |
|---|---|---|
| GSM-FR | 13 kbps | 2G mobile |
| G.711 μ-law | 64 kbps | North American PSTN |
| G.711 A-law | 64 kbps | European PSTN |
| Opus 6k | 6 kbps | Low-bandwidth VoIP |
| Opus 12k | 12 kbps | Standard VoIP |
| Opus 24k | 24 kbps | High-quality VoIP |
| MP3 32k | 32 kbps | Compressed storage |
Consumer microphones have varying frequency responses that color the recorded audio.
| Profile | Characteristics |
|---|---|
| Webcam (budget) | Bass rolloff <200Hz, harsh HF |
| Webcam (quality) | Flatter response |
| USB Headset | Narrow band, speech-optimized |
| Laptop internal | Heavy bass rolloff, resonance |
| Phone | Telephony band (300-3400 Hz) |
| Smartphone (flagship) | Wide band, good response |
| Conference (ceiling) | Room coloration |
Additive background noise at various signal-to-noise (SNR) ratios.
| SNR | Difficulty |
|---|---|
| 25 dB | Easy (quiet background) |
| 20 dB | Moderate |
| 15 dB | Challenging |
| 10 dB | Difficult |
| 5 dB | Very difficult |
Noise Source: DEMAND corpus (real environmental recordings: domestic, office, transport, etc.)
Data Leakage Warning: The benchmark uses DEMAND noise, NOT MUSAN. If you train with MUSAN noise augmentation, this provides proper train/eval separation. If you train with DEMAND, there will be overlap - consider using MUSAN for training instead.
Room reverberation applied using real Room Impulse Responses (RIRs).
RIR Sources (for benchmark):
- OpenSLR-28: ~417 real RIRs from various rooms
- BUT ReverbDB: ~1,500 real RIRs from 8 rooms with multiple microphone positions
Data Leakage Warning: The benchmark uses real RIRs from OpenSLR-28 and BUT ReverbDB. To ensure proper train/eval separation:
- Recommended for training: Simulated RIRs using pyroomacoustics or OpenSLR-26
- Avoid for training: OpenSLR-28 and BUT ReverbDB (these are used in the benchmark)
This design ensures models are tested on real acoustic conditions they haven't seen during training.
Full replay simulation: audio played through a speaker and re-recorded:
- Codec applied (GSM, G.711)
- Speaker simulation (laptop, phone speaker)
- Room acoustics (reverb + early reflections)
- Microphone recording (budget webcam, laptop mic)
- Noise added (SNR 15-25 dB)
This represents the hardest real-world scenario: a voice message played back and recorded.
- 10,000 trials per protocol (5,000 target + 5,000 impostor)
- 24 protocols = 240,000 total trials
- Balanced: Equal same-speaker and different-speaker pairs
Target trials (same speaker):
- Randomly sample enrollment-verification pairs from same speaker
- Enrollment always clean, verification always degraded
Impostor trials (different speaker):
- Same-gender constraint to increase difficulty
- Random pairing across speakers
All random operations use fixed seeds:
- Trial generation: seed 42
- Degradation application: documented per-file
The threshold where False Acceptance Rate = False Rejection Rate.
EER = threshold where FAR(t) = FRR(t)
Lower is better. Reported as percentage.
NIST SRE standard metric with:
- P_target = 0.01
- C_miss = 1
- C_fa = 1
The primary robustness metric - how much performance is lost due to carrier effects:
Degradation Factor = Absolute EER − Clean EER
Where:
Absolute EER= weighted average EER across all protocolsClean EER= EER on clean_clean protocol (baseline)
Interpretation:
- Lower is better (more robust to carrier effects)
- A model with +2% degradation loses 2 percentage points due to carriers
- Independent of baseline performance - directly measures carrier susceptibility
Note on CASE-Score v1: An earlier metric used normalized ratios (EER_degraded / EER_clean), but this can misleadingly reward models with poor baselines. See Metrics for full discussion.
- Simulated degradations: Real codecs and microphones may differ from simulations
- VoxCeleb source: Original recordings already have some channel effects
- English-centric: VoxCeleb speakers are primarily English-speaking
- Single-condition trials: Each trial tests one degradation (not combined)
If you use the CASE Benchmark, please cite:
@misc{case_benchmark_2026,
title={CASE Benchmark: Carrier-Agnostic Speaker Verification Evaluation},
author={Gitter, Ben},
year={2026},
howpublished={\url{https://github.com/gittb/case-benchmark}}
}