This directory contains scripts and configs for training, fine-tuning, and evaluating OmniVoice.
| Use Case | Script | Description |
|---|---|---|
| Training from scratch | run_emilia.sh | Full pipeline on the Emilia dataset (data check, tokenization, training) |
| Fine-tuning | run_finetune.sh | Fine-tune from a pretrained checkpoint using your own JSONL data |
| Evaluation | run_eval.sh | Evaluate WER, speaker similarity, and UTMOS on standard test sets |
run_emilia.sh runs the full pipeline in 3 stages:
| Stage | What it does |
|---|---|
| 0 | Verify the Emilia dataset and JSONL manifests are in place |
| 1 | Tokenize audio into WebDataset shards |
| 2 | Launch multi-GPU training with accelerate |
Prerequisites:
-
Download the Emilia dataset from OpenXLab and place it under
download/:download/Amphion___Emilia └── raw ├── EN └── ZH -
Obtain JSONL manifests and place them in
data/emilia/manifests/:emilia_en_train.jsonl,emilia_en_dev.jsonlemilia_zh_train.jsonl,emilia_zh_dev.jsonl
You can generate them from the raw data, or download pre-processed manifests from HuggingFace.
Run the full pipeline:
bash examples/run_emilia.shOr run individual stages by setting stage and stop_stage at the top of the script (e.g. stage=1, stop_stage=1 to only tokenize).
See docs/training.md for config details, checkpoint resuming, and TensorBoard monitoring.
run_finetune.sh fine-tunes from a pretrained checkpoint on your own data.
Create a JSONL manifest where each line describes one audio sample:
{"id": "sample_001", "audio_path": "/data/audio/001.wav", "text": "Hello world", "language_id": "en"}
{"id": "sample_002", "audio_path": "/data/audio/002.wav", "text": "你好世界", "language_id": "zh"}id, audio_path, and text are mandatory. language_id is optional.
See docs/data_preparation.md for the full data format specification.
Edit the variables at the top of run_finetune.sh:
TRAIN_JSONL="data/my_data_train.jsonl" # path to training JSONL
DEV_JSONL="data/my_data_dev.jsonl" # path to dev JSONL
GPU_IDS="0,1" # GPUs to use
NUM_GPUS=2
OUTPUT_DIR="exp/omnivoice_finetune" # output directorybash examples/run_finetune.shThe script will:
- Tokenize your audio into WebDataset shards
- Launch fine-tuning with
accelerate
Main difference between fine-tuning config (config/train_config_finetune.json) and the Emilia training config (config/train_config_emilia.json) are:
| Parameter | Emilia (from scratch) | Fine-tune | Why |
|---|---|---|---|
init_from_checkpoint |
null |
"k2-fsa/OmniVoice" |
Load pretrained weights |
steps |
300,000 | 5,000 | Fewer steps for fine-tuning, can be tuned according to your data/task. |
learning_rate |
1e-4 | 5e-5 | Lower LR for fine-tuning, can be tuned according to your data/task |
To use a different pretrained checkpoint, modify init_from_checkpoint in the config file.
Install evaluation dependencies first:
pip install omnivoice[eval]
# or
uv sync --extra evalSupported test sets: librispeech_pc, seedtts_en, seedtts_zh, fleurs, minimax.
bash examples/run_eval.shSee docs/evaluation.md for metrics details, test set preparation, and running individual metrics.