OmniVoice Examples

This directory contains scripts and configs for training, fine-tuning, and evaluating OmniVoice.

Use Case	Script	Description
Training from scratch	run_emilia.sh	Full pipeline on the Emilia dataset (data check, tokenization, training)
Fine-tuning	run_finetune.sh	Fine-tune from a pretrained checkpoint using your own JSONL data
Evaluation	run_eval.sh	Evaluate WER, speaker similarity, and UTMOS on standard test sets

Training from Scratch (Emilia)

run_emilia.sh runs the full pipeline in 3 stages:

Stage	What it does
0	Verify the Emilia dataset and JSONL manifests are in place
1	Tokenize audio into WebDataset shards
2	Launch multi-GPU training with `accelerate`

Prerequisites:

Download the Emilia dataset from OpenXLab and place it under download/:

download/Amphion___Emilia
└── raw
    ├── EN
    └── ZH

Obtain JSONL manifests and place them in data/emilia/manifests/:
- emilia_en_train.jsonl, emilia_en_dev.jsonl
- emilia_zh_train.jsonl, emilia_zh_dev.jsonl
You can generate them from the raw data, or download pre-processed manifests from HuggingFace.

Run the full pipeline:

bash examples/run_emilia.sh

Or run individual stages by setting stage and stop_stage at the top of the script (e.g. stage=1, stop_stage=1 to only tokenize).

See docs/training.md for config details, checkpoint resuming, and TensorBoard monitoring.

Fine-tuning

run_finetune.sh fine-tunes from a pretrained checkpoint on your own data.

Step 1: Prepare Your Data

Create a JSONL manifest where each line describes one audio sample:

{"id": "sample_001", "audio_path": "/data/audio/001.wav", "text": "Hello world", "language_id": "en"}
{"id": "sample_002", "audio_path": "/data/audio/002.wav", "text": "你好世界", "language_id": "zh"}

id, audio_path, and text are mandatory. language_id is optional.

See docs/data_preparation.md for the full data format specification.

Step 2: Configure the Script

Edit the variables at the top of run_finetune.sh:

TRAIN_JSONL="data/my_data_train.jsonl"   # path to training JSONL
DEV_JSONL="data/my_data_dev.jsonl"       # path to dev JSONL
GPU_IDS="0,1"                            # GPUs to use
NUM_GPUS=2
OUTPUT_DIR="exp/omnivoice_finetune"      # output directory

Step 3: Run

bash examples/run_finetune.sh

The script will:

Tokenize your audio into WebDataset shards
Launch fine-tuning with accelerate

Main difference between fine-tuning config (config/train_config_finetune.json) and the Emilia training config (config/train_config_emilia.json) are:

Parameter	Emilia (from scratch)	Fine-tune	Why
`init_from_checkpoint`	`null`	`"k2-fsa/OmniVoice"`	Load pretrained weights
`steps`	300,000	5,000	Fewer steps for fine-tuning, can be tuned according to your data/task.
`learning_rate`	1e-4	5e-5	Lower LR for fine-tuning, can be tuned according to your data/task

To use a different pretrained checkpoint, modify init_from_checkpoint in the config file.

Evaluation

Install evaluation dependencies first:

pip install omnivoice[eval]
# or
uv sync --extra eval

Supported test sets: librispeech_pc, seedtts_en, seedtts_zh, fleurs, minimax.

bash examples/run_eval.sh

See docs/evaluation.md for metrics details, test set preparation, and running individual metrics.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OmniVoice Examples

Training from Scratch (Emilia)

Fine-tuning

Step 1: Prepare Your Data

Step 2: Configure the Script

Step 3: Run

Evaluation

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

OmniVoice Examples

Training from Scratch (Emilia)

Fine-tuning

Step 1: Prepare Your Data

Step 2: Configure the Script

Step 3: Run

Evaluation