Implementation of a non-autoregressive Transformer-based neural network for Text-to-Speech (TTS).
This is repository is managed by TartuNLP, and it is a fork of the implementation by Axel Springer. Our contributions compared to the original repository are:
- Support for grapheme-based synthesis
- Multi-speaker synthesis
- Pretrained models for Estonian
- Open source TTS applications:
- Numerous minor changes to streamline training and make the repository easier to use with new datasets.
When using this repository or models for research, please cite the following paper:
@article{R2tsep_2022,
title = {Estonian Text-to-Speech Synthesis with Non-autoregressive Transformers},
author = {Liisa R\"{a}tsep and Rasmus Lellep and Mark Fishel},
journal = {Baltic Journal of Modern Computing}
volume = {10},
number = {3},
year = 2022
}The original code is based, among others, on the following papers:
- Neural Speech Synthesis with Transformer Network
- FastSpeech: Fast, Robust and Controllable Text to Speech
- FastSpeech 2: Fast and High-Quality End-to-End Text to Speech
- FastPitch: Parallel Text-to-speech with Pitch Prediction
The models are compatible with the pre-trained vocoders:
Being non-autoregressive, this Transformer model is:
- Robust: No repeats and failed attention modes for challenging sentences.
- Fast: With no autoregression, predictions take a fraction of the time.
- Controllable: It is possible to control the speed and pitch of the generated utterance.
Estonian and multispeaker samples can be found on the samples page.
Samples from the original implementation can be found on the original samples page.
- 05/26: Made into installable library and added a CLI (TartuNLP)
- 06/22: Multi-speaker synthesis (TartuNLP)
- 05/22: Merged updates from the original repository (TartuNLP)
- 06/21: Grapheme-based synthesis and Estonian models (TartuNLP)
- 06/20: Added normalisation and pre-trained models compatible with the faster MelGAN vocoder.
- 11/20: Added pitch prediction. Autoregressive model is now specialized as an Aligner and Forward is now the only TTS model. Changed models architectures. Discontinued WaveRNN support. Improved duration extraction with Dijkstra algorithm.
- 03/20: Vocoding branch.
The repository can be installed with pip:
pip install git+https://github.com/TartuNLP/TransformerTTS.gitFor a specific version, the tag name:
pip install git+https://github.com/TartuNLP/TransformerTTS.git@v2.0.0or locally from the source code:
git clone https://github.com/TartuNLP/TransformerTTS.git
cd TransformerTTS
pip install -e .You can directly use LJSpeech to create the training dataset.
- If training on LJSpeech, or if unsure, simply use
config/training_config.yamlto create MelGAN or HiFiGAN compatible models - Use the command line flags to specify dataset location and where preprocessed data, logs and model files should be
saved. Information about configuration flags can be seen with the
-hflag of each script.
Prepare a folder containing your metadata and wav files, for instance
dataset_folder/
├── metadata.csv
└── wavs/
├── file_1.wav
├── ...
└── file_n.wav
if metadata.csv has the following format
wav_file_name|transcription or wav_file_name|transcription|speaker_id
you can use the ljspeech preprocessor in data/metadata_readers.py, otherwise add your own under the same file.
Make sure that:
- the metadata reader function name is the same as
metadata_readerfield intraining_config.yaml. - the metadata file (can be anything) is specified under
metadata_pathintraining_config.yaml - for multispeaker training, review the
multispeakerandn_speakersvalues. - to disable phonemization, edit the
text_settingssection of the configuration file.
Change the --config argument based on the configuration of your choice.
transformer-tts train \
--config $CONFIG_FILE_PATH \
--save-directory $MODEL_PATH \
--mel-directory $DATA_PATH/mels \
--pitch-directory $DATA_PATH/pitch \
--duration-directory $DATA_PATH/durations \
--character-pitch-directory $DATA_PATH/char-pitch \
--test-files $TEST_FILESTo resume training, simply use the same command with the same configuration and model path.
Training and model settings can be configured in training_config.yaml
tensorboard --logdir $MODEL_PATH/logstransformer-tts save_model \
--config $CONFIG_FILE_PATH \
--save-directory $MODEL_PATH \
--checkpoint-path $CHECKPOINT_PATH \
--target-dir $WEIGHTS_PATHThe model will be saved as a mdl.keras file in the specified target directory. If no target directory is specified, the weights will be saved in the model root directory. If no checkpoint path is specified, the latest checkpoint will be used.
Prediction can be done using the transformer-tts predict, for the full specification, check the help flag of the command.
transformer-tts predict -hAlternatively, to use the model in your own code, you can load the model directly in your code:
import tensorflow as tf
from transformer_tts.model import ForwardTransformer
model = tf.keras.models.load_model(
"mdl.keras",
custom_objects={"ForwardTransformer": ForwardTransformer})
tts_out = model.predict(sentence, speed_regulator=speed, speaker_id=speaker_id)
mel_spec = tts_out["mel"].numpy().TNewer models are added to the Releases of this repository.
TartuNLP - the NLP research group at the University of Tartu.
Francesco Cardinale from Axel Springer for the original implementation.
MelGAN and WaveRNN: data normalization and samples' vocoders are from these repos.
Erogol and the Mozilla TTS team for the lively exchange on the topic.
See LICENSE for details.