This is the official repository for the paper: Bayesian Speech Synthesizers Can Learn from Multiple Teachers.
This codebase provides training, data preparation, and evaluation pipelines for BELLE, and also reproduces MELLE.
- End‑to‑end training and evaluation pipelines aligned with the paper.
- Modular data pipeline with optional TTS‑based augmentation.
- Multi‑model training scripts for BELLE / MELLE / BELLE‑stream.
- Zero‑shot TTS evaluation suite under evaluate-zero-shot-tts.
BELLE reframes TTS as Bayesian inference rather than deterministic regression. It models acoustic targets with a Normal‑Inverse‑Gamma distribution to capture data‑dependent aleatoric uncertainty without increasing parameters or inference latency. To learn reliable variance from single‑reference datasets, BELLE introduces a one‑to‑many training strategy that leverages synthetic samples as a statistical support set. The framework naturally supports high‑quality streaming generation.
Architecture at a glance:
Key contributions (summary):
- Bayesian evidential learning for continuous AR TTS with uncertainty modeling.
- One‑to‑many training with multiple teachers to improve robustness.
- Strong results with smaller data scale and streaming capability.
BELLE/
├─ belle/ # Core library (models, modules, utils)
├─ egs/librispeech/ # Data prep + training entrypoints
├─ evaluate-zero-shot-tts/ # Zero‑shot evaluation suite
├─ tts-launch/ # TTS augmentation launch scripts
├─ scripts/ # High‑level run scripts (train/eval/env)
├─ pretrained/ # Pretrained checkpoints (if any)
└─ Figures/ # Paper figures (optional)
The environment setup script is provided at:
It creates a conda environment, installs dependencies, pulls k2, installs icefall, and sets up evaluation dependencies.
⚠️ Note: Thek2wheel must match your CUDA and PyTorch versions.
See the comment inside scripts/setup.sh for guidance.
Please download the following pretrained weights and place them under the pretrained directory as indicated:
- Vocoder (HiFi-GAN, LibriTTS):
- ASR (Conformer WER):
- ASR (HuBERT WER):
- Speaker Similarity (WavLM‑Uni):
- MOS (UTMOS):
Path configuration note: If you place these files in a different location, update the paths in:
- belle/data/tokenizer.py (vocoder path)
- evaluate-zero-shot-tts/evaluate_new.py (ASR / SIM model paths)
- evaluate-zero-shot-tts/utmos/predict_speechmos.py (MOS model path)
First download LibriSpeech from the official page: https://www.openslr.org/12 and extract it to:
- egs/librispeech/download/LibriSpeech (see Step 0 in the script below)
The end‑to‑end data pipeline is implemented in:
Overview of steps:
- Prepare LibriSpeech manifests and tokenize text (G2P phonemes).
- Apply VAD and regenerate manifests.
- Filter by duration.
- (Optional) TTS‑based augmentation using multiple models.
- VAD on synthesized audio and filter failed items.
- Prepare training cuts for BELLE / MELLE.
- Prepare prompt‑augmented cuts for BELLE‑stream.
Training is driven by the scripts below (each script loads the matching settings file):
- BELLE: scripts/run_belle.sh → scripts/settings_belle.sh
- MELLE: scripts/run_melle.sh → scripts/settings_melle.sh
- BELLE‑stream: scripts/run_stream.sh → scripts/settings_stream.sh
All training scripts invoke egs/librispeech/bin/trainer.py with distributed torchrun.
Typical workflow:
- Run the data pipeline.
- Start training with the corresponding run script listed above.
- For BELLE‑stream, initialize from the last BELLE checkpoint and set
start_epoch=2,train_stage=2(already reflected in scripts/settings_stream.sh).
Evaluation is managed under:
This pipeline includes inference + evaluation, and supports metrics such as:
- WER (Hubert / Conformer ASR backends)
- MOS (UTMOS)
- Similarity (WavLM‑Uni:
sim_o,sim_r)
Download the evaluation dataset here:
Place it under the project root with the following structure (excerpt):
evaluate-zero-shot-tts/evalset
├── librispeech-test-clean
│ ├── exp_aligned_pl3_r3
│ │ ├── {FILE_NAME}_wav_c_{TRIAL_ID}.txt
│ │ ├── {FILE_NAME}_wav_c_{TRIAL_ID}.wav
│ │ ├── {FILE_NAME}_wav_g.txt
│ │ ├── {FILE_NAME}_wav_g.wav
│ │ ├── {FILE_NAME}_wav_p_{TRIAL_ID}.txt
│ │ ├── {FILE_NAME}_wav_p_{TRIAL_ID}.wav
│ │ ├── {FILE_NAME}_wav_pg_{TRIAL_ID}.txt
│ │ ├── {FILE_NAME}_wav_pg_{TRIAL_ID}.wav
│ │ ...
│ └── exp_base_pl3_r3
│ ...
We made several modifications to the original repository keonlee9420/evaluate-zero-shot-tts (e.g., adding Conformer‑based WER and UTMOS support) to better fit the BELLE evaluation pipeline. For full details, see evaluate-zero-shot-tts/README.md.
If you use this repository, please cite the paper:
@article{zhang2025bayesian,
title={Bayesian Speech Synthesizers Can Learn from Multiple Teachers},
author={Zhang, Ziyang and Gao, Yifan and Xu, Xuenan and Wu, Wen and Zhang, Chao and others},
journal={arXiv preprint arXiv:2510.24372},
year={2025}
}
See LICENSE.
We heavily referenced the implementation from lifeiteng/vall-e for the training framework and keonlee9420/evaluate-zero-shot-tts for the evaluation framework. This project also builds upon icefall and related open‑source speech toolkits.
