With the advances in Automatic Speech Recognition (ASR) for high-resource languages, this paper aims to open the door to understanding the magnitude of data needed for speech recognition tasks.
In this paper, we compare Wav2Vec2XLSR (56K hours of data) against Wav2Vec2IPA, a fine-tuned Wav2Vec2-based architecture model trained on <5 hours of data from the TIMIT speech corpus (0.5% of baseline training data). From our evaluations, we find that the baseline model outperforms the smaller data model by 8.3 percentage points with respect to Phoneme Error Rate.
In addition, to address a gap in the availability of the TIMIT speech corpus, we also introduce 2 new datasets on HuggingFace Hub:
Additionally, this research on smaller models intends to address current issues in Natural Language Processing (NLP) surrounding environmental sustainability and democratizing access to AI with the advent of small machine learning models.
Click here for a full writeup on the methods, motivation, and evaluation.
The 3 step (1) Preprocess, (2) Fine Tune, and (3) Evaluate pipeline above are implemented through notebooks, datasets, and models.
- Each of the 3 phases of the pipeline are encapsulated into notebook files
| Pipeline Step | Notebook |
|---|---|
| Data Preprocessing | preprocessing/split_data.ipynb |
| Training | training/fine_tune_w2v2.ipynb |
| Evaluation | Baseline Evaluationevaluation/eval_model_xlsr.ipynbWav2Vec2IPA evaluation/eval_model_w2v2ipa.ipynb |
- We expand on the existing HuggingFace
timit-asr/timit_asrand contribute two HuggingFace datasets in this project:
| Name | Description |
|---|---|
kylelovesllms/timit_asr |
Implementation of TIMIT dataset with Test and Train split |
kylelovesllms/timit_asr_ipa |
Implementation of kylelovesllms/timit_asr with Validation Split and IPA Transcriptions |
- Similar to
datasets, the models used/trained in this project live in HuggingFace Hub
| Models | Description |
|---|---|
facebook/XLSR-Wav2Vec2 |
Baseline Evaluation Model |
facebook/wav2vec2-base |
Pretrained Wav2Vec2 Model |
Wav2Vec2IPA and Wav2Vec2IpaTokenizer |
Fine Tuned Model Wav2Vec2-Base model trained in this repository |
Although HuggingFace has an implementation of the TIMIT database, timit-asr/timit_asr, there are problems that prevent us from using the HuggingFace Hub dataset directly:
- TIMIT transcriptions are phonetic but not IPA (TIMIT has its own transcription system closely closely related to IPA)
- The original TIMIT dataset contains only
Train/Testdataset splits but noValidationdataset to tune hyper-parameters - HuggingFace
timit-asr/timit_asruses only1/5of the entire speech corpus (<1 hour)- Since the dataset is
<5 hoursin total, using1/5of the dataset drastically impacts model performance
- Since the dataset is
- HuggingFace
timit-asr/timit_asruses a deprecated dataset API, requiring users to download the audio files via 3rd party zip
To address these issues, we contribute two datasets to HuggingFaceHub:
kylelovesllms/timit_asr- Reimplementation of HuggingFace
timit_asrwith the fullTrain/Testdataset (5 hours total) - Stores TIMIT database in native HuggingFace Hub Dataset API (Parquet format)
- Reimplementation of HuggingFace
kylelovesllms/timit_asr_ipa- Builds off of
timit_asrand adds:- IPA transcriptions used for Wav2Vec2 Base fine-tuning
- Stratified Validation Dataset
- The validation dataset splits the Test Dataset in half following the TIMIT recommendation of having unique speakers between training and validation/test dataset
- Stratifies speaker population based on
speaker sexandspeaker dialect region
- Builds off of
- To (re)create the datasets, run the
preprocessing/split_data.ipynbnotebook - Below lists the utils used in
split_data.ipynb
| Dependency | Description |
|---|---|
timit_metadata_extractors.py |
Handles metadata extraction (speaker sex, dialect region, audio duration) and extracts transcriptions at the sentence, word, and phoneme level |
timit_ipa_translation.py |
Helper to timit_metadata_extractors for handling TIMIT transcription to IPA mapping |
timit_dataset_splitter.py |
Helper to timit_metadata_extractors to stratify the Test dataset into Test and Validation |
- To train the wav2vec2 IPA base model into run the
evaluation/fine_tune_w2v2.ipynb- Note: to save the model, make sure to edit the
HF_IDto your HuggingFace UserID
- Note: to save the model, make sure to edit the
| Dependency | Description |
|---|---|
vocab_manual.json |
Learnable output tokens for final softmax layer in Wav2Vec2 Architecture. Also contains Connectionist Temporal Classification (CTC) Loss token settings (UNKnown token, PADding token, and space token) |
- To evaluate model performance, select one of the notebooks
eval_model_w2v2ipa.ipynbto evaluate the fine tuned model andeval_model_xlsr.ipynbto evaluate the baseline mode. - At the time of writing, we implement vanilla
Character Error Rate (CER)since the fine tuned model produces less tokens than the ground truth transcription (makingCEReffectivelynormalized CER)
| Dependency | Description |
|---|---|
eval_helpers.py |
Removes extra symbols from baseline model to avoid mis-penalization during evaluation |
