A research-based full-stack web application that classifies music genres using both machine learning and deep learning models. The project compares traditional algorithms like SVM and k-NN with CNN and CNN+BiLSTM architectures on two benchmark datasets: GTZAN and FMA-Small.
- Dual-model evaluation: ML vs DL for music genre classification
- Deep Learning Models: CNN and hybrid CNN + BiLSTM with SE block and Attention
- High Accuracy: Up to 93% classification accuracy on FMA-Small
- Feature Extraction: MFCCs and Mel Spectrograms
- User Interface: Django-based frontend with registration, login, and history tracking
- REST API: For real-time genre predictions
- GTZAN Dataset: 1000 tracks across 10 genres
- FMA-Small: 8000 30-second high-quality audio clips
- All data is preprocessed into
.wavformat and converted to 128x128 Mel spectrograms.
- Scikit-learn
- TensorFlow / Keras
- HTML/CSS/Bootstrap (with Django templates)
- JavaScript
- Logistic Regression, k-NN, SVM
- CNN
- CNN + BiLSTM + Attention + SE Block
- Audio Length: 30 seconds
- Sampling Rate: 44,100 Hz
- Feature Type: Mel-spectrogram
- Mel-Spectrogram Shape:
128 (mel bands) × 128 (time frames) - Color Channel: Grayscale (
1 channel) - Final Input Shape:
(128, 128, 1)
- Total Layers: 5 (Convolution + Pooling + Dense)
- Dropout Rate: 45%
- Activation Functions: ReLU (hidden), Softmax (output)
- Loss Function: Categorical Crossentropy
- Optimizer: Adam
- Output Classes: 8 genres
- Best Accuracy: 92%
- CNN Layers: 4 blocks with increasing filters (32 → 256)
- LSTM: Bidirectional LSTM with 128 units
- Attention: Applied after BiLSTM
- Dropout Rate: 50%
- Output Classes:
- GTZAN: 10 genres
- FMA-Small: 8 genres
- Best Accuracy:
- GTZAN: 91%
- FMA-Small: 93%
- Input Features: MFCC vectors or flattened spectrograms
- Best Hyperparameters:
- SVM: RBF kernel, gamma = scale
- k-NN: 5 neighbors, Manhattan distance, distance-based weight
- Logistic Regression:
C=10, L2 penalty, saga solver
- Output Classes: 8 or 10 (based on dataset)
- Accuracy Range: 57%–76%
The performance of the models was evaluated through a rational analysis using Precision, Recall, and F1-Score metrics.
The comparative analysis on the GTZAN dataset demonstrates the superior stabilization and performance of deep learning architectures.
| Model Category | Model | Precision | Recall | F1 Score |
|---|---|---|---|---|
| Traditional ML | k-NN | 0.74 | 0.74 | 0.74 |
| SVM | 0.74 | 0.75 | 0.74 | |
| Logistic Regression | 0.75 | 0.76 | 0.75 | |
| Deep Learning | CNN | 0.91 | 0.91 | 0.91 |
| CNN + BiLSTM | 0.91 | 0.91 | 0.91 |
Figure 1: Confusion Matrix of CNN + BiLSTM model trained on GTZAN Dataset (10 genres).
On the FMA-Small dataset, the CNN + BiLSTM hybrid architecture proved to be the most effective model for capturing complex data patterns.
| Model Category | Model | Precision | Recall | F1 Score |
|---|---|---|---|---|
| Traditional ML | k-NN | 0.51 | 0.49 | 0.48 |
| SVM | 0.62 | 0.61 | 0.61 | |
| Logistic Regression | 0.57 | 0.57 | 0.57 | |
| Deep Learning | CNN | 0.92 | 0.92 | 0.92 |
| CNN + BiLSTM | 0.93 | 0.93 | 0.93 |
Figure 2: Confusion Matrix of CNN + BiLSTM model trained on FMA-Small Dataset (8 genres).
- Datasets Used:
- GTZAN (10 genres × 100 tracks)
- FMA-Small (8 genres × 8000 samples)
- Preprocessing Steps:
- Convert audio to
.wav - Normalize audio levels
- Extract 128×128 Mel-spectrograms
- Store as
.npyor tensor images
- Convert audio to
- Data Augmentation (for deep models):
- Time shifting
- Pitch shifting
- Additive Gaussian noise
| Setting | Value |
|---|---|
| Epochs (CNN) | 50 |
| Epochs (CNN + BiLSTM) | 50 |
| Batch Size | 32 |
| Optimizer | Adam |
| Learning Rate | 0.001 |
| Loss Function | Categorical Crossentropy |
| Evaluation Metrics | Accuracy, Precision, Recall, F1 Score |
| Validation Split | 20% (Stratified) |
| Learning Rate Scheduler | ReduceLROnPlateau (patience=3, factor=0.5) |
| Model Saving | ModelCheckpoint (save_best_only=True) |
| Early Stopping | EarlyStopping (patience=7, restore_best_weights=True) |
- Saved Format:
.h5(Keras HDF5 format) - Model Selection: Automatically saved best model during training
- Callback Tools Used:
EarlyStoppingfor preventing overfittingModelCheckpointfor best-model savingReduceLROnPlateaufor adaptive learning rate
- Final Evaluation: Performed on held-out test set
- Visualization: Confusion matrix plotted for all experiments

