CASE Benchmark Methodology

Motivation

Current speaker verification benchmarks (VoxCeleb, SITW, CN-Celeb) evaluate models on audio with similar recording conditions to training data. Real-world deployments face a different challenge: the same speaker may be encountered through vastly different acoustic carriers - a phone call one day, a video conference the next, a recorded message played through laptop speakers later.

The CASE Benchmark directly measures this capability: Can your model identify the same speaker regardless of how their voice was transmitted?

Core Design Principle

For each speaker in the evaluation set, we create multiple versions of their utterances:

Clean (original high-quality)
Codec-degraded (GSM, G.711, Opus, etc.)
Mic-filtered (webcam, headset, laptop mic simulations)
Noise-added (various SNR levels)
Reverb-added (simulated room acoustics)
Playback chain (codec → speaker → room → microphone)

Trials are constructed to test cross-carrier matching: enrollment in clean condition, verification in degraded condition.

Data Source

The CASE Benchmark is built on two datasets:

VoxCeleb1-O (official test set):

40 speakers, ~400 utterances
Celebrity interviews with varied recording conditions

LibriSpeech test-clean:

40 speakers, ~392 utterances
Read audiobook speech in quiet conditions

This combination provides diversity in speaking styles (spontaneous vs read speech) and original recording conditions. Using standard test sets ensures the benchmark tests carrier robustness on speakers that may or may not have been seen during training, providing a fair evaluation.

Degradation Categories

1. Codec Degradations

Audio codecs compress speech for transmission, introducing artifacts and frequency limitations.

Codec	Bitrate	Use Case
GSM-FR	13 kbps	2G mobile
G.711 μ-law	64 kbps	North American PSTN
G.711 A-law	64 kbps	European PSTN
Opus 6k	6 kbps	Low-bandwidth VoIP
Opus 12k	12 kbps	Standard VoIP
Opus 24k	24 kbps	High-quality VoIP
MP3 32k	32 kbps	Compressed storage

2. Microphone Simulations

Consumer microphones have varying frequency responses that color the recorded audio.

Profile	Characteristics
Webcam (budget)	Bass rolloff <200Hz, harsh HF
Webcam (quality)	Flatter response
USB Headset	Narrow band, speech-optimized
Laptop internal	Heavy bass rolloff, resonance
Phone	Telephony band (300-3400 Hz)
Smartphone (flagship)	Wide band, good response
Conference (ceiling)	Room coloration

3. Noise Conditions

Additive background noise at various signal-to-noise (SNR) ratios.

SNR	Difficulty
25 dB	Easy (quiet background)
20 dB	Moderate
15 dB	Challenging
10 dB	Difficult
5 dB	Very difficult

Noise Source: DEMAND corpus (real environmental recordings: domestic, office, transport, etc.)

Data Leakage Warning: The benchmark uses DEMAND noise, NOT MUSAN. If you train with MUSAN noise augmentation, this provides proper train/eval separation. If you train with DEMAND, there will be overlap - consider using MUSAN for training instead.

4. Reverb

Room reverberation applied using real Room Impulse Responses (RIRs).

RIR Sources (for benchmark):

OpenSLR-28: ~417 real RIRs from various rooms
BUT ReverbDB: ~1,500 real RIRs from 8 rooms with multiple microphone positions

Data Leakage Warning: The benchmark uses real RIRs from OpenSLR-28 and BUT ReverbDB. To ensure proper train/eval separation:

Recommended for training: Simulated RIRs using pyroomacoustics or OpenSLR-26

Avoid for training: OpenSLR-28 and BUT ReverbDB (these are used in the benchmark)

This design ensures models are tested on real acoustic conditions they haven't seen during training.

5. Playback Chain (Most Challenging)

Full replay simulation: audio played through a speaker and re-recorded:

Codec applied (GSM, G.711)
Speaker simulation (laptop, phone speaker)
Room acoustics (reverb + early reflections)
Microphone recording (budget webcam, laptop mic)
Noise added (SNR 15-25 dB)

This represents the hardest real-world scenario: a voice message played back and recorded.

Trial Generation

Trial Counts

10,000 trials per protocol (5,000 target + 5,000 impostor)
24 protocols = 240,000 total trials
Balanced: Equal same-speaker and different-speaker pairs

Sampling Strategy

Target trials (same speaker):

Randomly sample enrollment-verification pairs from same speaker
Enrollment always clean, verification always degraded

Impostor trials (different speaker):

Same-gender constraint to increase difficulty
Random pairing across speakers

Reproducibility

All random operations use fixed seeds:

Trial generation: seed 42
Degradation application: documented per-file

Evaluation Metrics

Primary: Equal Error Rate (EER)

The threshold where False Acceptance Rate = False Rejection Rate.

EER = threshold where FAR(t) = FRR(t)

Lower is better. Reported as percentage.

Secondary: Minimum Detection Cost Function (minDCF)

NIST SRE standard metric with:

P_target = 0.01
C_miss = 1
C_fa = 1

Aggregate: Degradation Factor

The primary robustness metric - how much performance is lost due to carrier effects:

Degradation Factor = Absolute EER − Clean EER

Where:

Absolute EER = weighted average EER across all protocols
Clean EER = EER on clean_clean protocol (baseline)

Interpretation:

Lower is better (more robust to carrier effects)
A model with +2% degradation loses 2 percentage points due to carriers
Independent of baseline performance - directly measures carrier susceptibility

Note on CASE-Score v1: An earlier metric used normalized ratios (EER_degraded / EER_clean), but this can misleadingly reward models with poor baselines. See Metrics for full discussion.

Limitations

Simulated degradations: Real codecs and microphones may differ from simulations
VoxCeleb source: Original recordings already have some channel effects
English-centric: VoxCeleb speakers are primarily English-speaking
Single-condition trials: Each trial tests one degradation (not combined)

Citation

If you use the CASE Benchmark, please cite:

@misc{case_benchmark_2026,
  title={CASE Benchmark: Carrier-Agnostic Speaker Verification Evaluation},
  author={Gitter, Ben},
  year={2026},
  howpublished={\url{https://github.com/gittb/case-benchmark}}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CASE Benchmark Methodology

Motivation

Core Design Principle

Data Source

Degradation Categories

1. Codec Degradations

2. Microphone Simulations

3. Noise Conditions

4. Reverb

5. Playback Chain (Most Challenging)

Trial Generation

Trial Counts

Sampling Strategy

Reproducibility

Evaluation Metrics

Primary: Equal Error Rate (EER)

Secondary: Minimum Detection Cost Function (minDCF)

Aggregate: Degradation Factor

Limitations

Citation

FilesExpand file tree

methodology.md

Latest commit

History

methodology.md

File metadata and controls

CASE Benchmark Methodology

Motivation

Core Design Principle

Data Source

Degradation Categories

1. Codec Degradations

2. Microphone Simulations

3. Noise Conditions

4. Reverb

5. Playback Chain (Most Challenging)

Trial Generation

Trial Counts

Sampling Strategy

Reproducibility

Evaluation Metrics

Primary: Equal Error Rate (EER)

Secondary: Minimum Detection Cost Function (minDCF)

Aggregate: Degradation Factor

Limitations

Citation