Machine Learning Final Project
This project implements an end-to-end spoken language identification system using classical machine learning techniques. It is developed in two phases:
- Phase 1: Dataset Collection and Problem Understanding
- Phase 2: Processing, Modeling, and Evaluation
The goal is to classify short speech segments into one of four languages and analyze their structure using both supervised and unsupervised learning.
A multilingual speech dataset was created from audiobook-style podcast recordings in:
- Italian
- German
- Korean
- Spanish
Each audio file:
- Is approximately one minute long
- Starts at the beginning of a sentence
- Ends at the end of a sentence
- Contains clean, continuous speech
This phase focused on building a balanced and structured dataset suitable for machine learning.
In this phase, the collected audio data was processed and analyzed through:
- Data cleaning and preprocessing
- Feature extraction
- Supervised classification (multiple ML models)
- Unsupervised clustering
- Quantitative evaluation (Accuracy, F1-score, Confusion Matrix, Silhouette Score)
The project demonstrates a complete workflow from raw speech data to language classification and analysis.