Skip to content

AmirhosseinChami/Spoken-Language-Identification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Multilingual Speech Language Identification

Machine Learning Final Project

Overview

This project implements an end-to-end spoken language identification system using classical machine learning techniques. It is developed in two phases:

  • Phase 1: Dataset Collection and Problem Understanding
  • Phase 2: Processing, Modeling, and Evaluation

The goal is to classify short speech segments into one of four languages and analyze their structure using both supervised and unsupervised learning.


Phase 1 – Dataset Construction

A multilingual speech dataset was created from audiobook-style podcast recordings in:

  • Italian
  • German
  • Korean
  • Spanish

Each audio file:

  • Is approximately one minute long
  • Starts at the beginning of a sentence
  • Ends at the end of a sentence
  • Contains clean, continuous speech

This phase focused on building a balanced and structured dataset suitable for machine learning.


Phase 2 – Machine Learning Pipeline

In this phase, the collected audio data was processed and analyzed through:

  • Data cleaning and preprocessing
  • Feature extraction
  • Supervised classification (multiple ML models)
  • Unsupervised clustering
  • Quantitative evaluation (Accuracy, F1-score, Confusion Matrix, Silhouette Score)

The project demonstrates a complete workflow from raw speech data to language classification and analysis.

About

This repository contains the files for the final project of the machine learning course.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors