A comprehensive, modular machine learning platform for classification and regression tasks with an intuitive Streamlit interface. Built with scikit-learn, XGBoost, and LightGBM.
MLInf provides a complete end-to-end machine learning workflow from data loading to model deployment, featuring intelligent preprocessing, automated hyperparameter configuration, training visualizations, and model explainability tools.
- Multiple Data Sources: Upload CSV, Excel, JSON, Parquet files, load from URLs, or use built-in scikit-learn datasets
- Intelligent Preprocessing: Automatic encoding strategy for categorical features (one-hot vs frequency encoding based on cardinality)
- Missing Value Handling: Multiple imputation strategies (mean, median, mode, constant, forward/backward fill)
- Feature Scaling: StandardScaler, MinMaxScaler, RobustScaler
- Data Validation: Automatic class balance checking, duplicate detection, missing value analysis
Classification Models:
- Logistic Regression
- Random Forest Classifier
- Support Vector Machine (SVM)
- Gradient Boosting Classifier (scikit-learn)
- XGBoost Classifier
- LightGBM Classifier
- Neural Network (MLP Classifier)
Regression Models:
- Linear Regression
- Ridge Regression
- Lasso Regression
- Random Forest Regressor
- Support Vector Regression (SVR)
- Gradient Boosting Regressor (scikit-learn)
- XGBoost Regressor
- LightGBM Regressor
- Dynamic Hyperparameter UI: Automatically generated configuration widgets for all models
- Training History Tracking: Real-time loss curves and convergence monitoring
- Early Stopping: Configurable early stopping for supported models
- Overfitting Detection: Automatic train-validation gap analysis
- Cross-Validation: K-fold cross-validation support
- Comprehensive Metrics:
- Classification: Accuracy, Precision, Recall, F1-Score, ROC-AUC
- Regression: MSE, RMSE, MAE, Rยฒ, Adjusted Rยฒ
- Rich Visualizations:
- Confusion matrices
- ROC curves and Precision-Recall curves
- Feature importance plots
- Training loss curves
- Residual plots (regression)
- Actual vs Predicted plots
- Model Comparison: Side-by-side comparison of multiple trained models
- Single Prediction: Interactive form-based predictions
- Batch Prediction: Upload files for bulk predictions with probability outputs
- Model Export: Download trained models as zip files with metadata
- Probability Outputs: Class probabilities for classification tasks
- Feature Importance: Built-in feature importance from tree-based models
- Modular Design: Plugin-based model registration system
- Extensible: Add new models by simply dropping files in the models directory
- Type-Safe: Comprehensive type hints throughout the codebase
- Error Handling: Robust exception handling with detailed error messages
- Session Management: Persistent state across UI interactions
# Clone the repository
git clone https://github.com/el-Badr07/mlinf.git
cd mlinf
# Build and run with Docker Compose
docker-compose up -d
# Access the application at http://localhost:8501Docker Benefits:
- No dependency conflicts
- Isolated environment
- Easy deployment and scaling
- Consistent behavior across systems
- Automatic restarts
Docker Commands:
# Start the application
docker-compose up -d
# Stop the application
docker-compose down
# View logs
docker-compose logs -f
# Rebuild after code changes
docker-compose up -d --build
# Remove everything including volumes
docker-compose down -v# Clone the repository
git clone https://github.com/el-Badr07/mlinf.git
cd mlinf
# Create virtual environment (recommended)
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Launch the application
streamlit run ui/app.pymlinf/
โโโ src/
โ โโโ core/ # Core abstractions and registry
โ โ โโโ base_model.py # Base model interface
โ โ โโโ registry.py # Model registration system
โ โโโ data/ # Data handling modules
โ โ โโโ loaders.py # File and URL loaders
โ โ โโโ validators.py # Data validation utilities
โ โ โโโ preprocessors.py # Preprocessing pipelines
โ โ โโโ sklearn_datasets.py # Built-in dataset loader
โ โโโ models/ # Model implementations
โ โ โโโ classification/ # Classification models
โ โ โโโ regression/ # Regression models
โ โโโ training/ # Training utilities
โ โ โโโ trainer.py # Model training logic
โ โโโ evaluation/ # Evaluation modules
โ โ โโโ metrics.py # Metric calculations
โ โ โโโ visualizations.py # Plotting utilities
โ โโโ inference/ # Inference utilities
โ โ โโโ predictor.py # Prediction logic
โ โโโ explainability/ # Model explainability
โ โ โโโ shap_explainer.py
โ โ โโโ lime_explainer.py
โ โโโ persistence/ # Model saving/loading
โ โ โโโ model_saver.py
โ โโโ utils/ # Utility functions
โโโ ui/
โ โโโ pages/ # Streamlit pages
โ โ โโโ 1_๐ค_Data_Upload.py
โ โ โโโ 2_๐ง_Preprocessing.py
โ โ โโโ 3_๐ฏ_Model_Training.py
โ โ โโโ 4_๐_Evaluation.py
โ โ โโโ 5_๐ฎ_Inference.py
โ โโโ ui_utils/ # UI utilities
โ โ โโโ session_state.py
โ โ โโโ hyperparam_widgets.py
โ โโโ app.py # Main application
โโโ configs/ # Configuration files
โโโ tests/ # Test suite
โโโ requirements.txt # Python dependencies
โโโ README.md
- Choose from three data sources:
- Upload File: CSV, Excel, JSON, Parquet formats
- Load from URL: Direct URL to dataset
- Built-in Datasets: 8 scikit-learn datasets (iris, wine, breast_cancer, digits, diabetes, california_housing, linnerud)
- Automatic data profiling with statistics, missing values, and duplicates detection
- Select target variable and features
- Automatic task type detection (classification/regression)
- Missing Values: Choose imputation strategy per feature
- Categorical Encoding:
- Auto mode: One-hot encoding for low cardinality (<20 unique values)
- Frequency encoding for high cardinality features
- Manual override available
- Numerical Scaling: StandardScaler, MinMaxScaler, or RobustScaler
- Train/Test Split: Configurable split ratio with stratification for classification
- Model Selection: Choose one or multiple models to train
- Hyperparameter Configuration:
- Dynamic UI widgets auto-generated from model schemas
- Model-specific parameters (trees, depth, learning rate, etc.)
- Early stopping configuration for supported models
- Training Execution:
- Real-time training progress
- Loss curve visualization during training
- Overfitting detection alerts
- Performance Metrics:
- Classification: Confusion matrix, ROC-AUC, precision-recall curves
- Regression: Actual vs predicted plots, residual analysis
- Model Comparison: Compare metrics across all trained models
- Feature Importance: View which features drive predictions
- Single Prediction:
- Interactive form with all feature inputs
- Probability outputs for classification
- Batch Prediction:
- Upload new data file
- Automatic preprocessing application
- Download results with predictions and probabilities
- Model Download: Export trained model as zip file with metadata
The modular architecture makes it easy to add new models:
- Create a new file in
src/models/classification/orsrc/models/regression/ - Inherit from
BaseModel - Implement required methods and add hyperparameter schema
- The model will be automatically registered and appear in the UI
Example:
from core import BaseModel, register_model
from typing import Dict, Any
@register_model
class MyCustomClassifier(BaseModel):
model_name = "My Custom Classifier"
model_type = "classification"
@classmethod
def get_hyperparameter_schema(cls) -> Dict[str, Dict[str, Any]]:
return {
'my_param': {
'type': 'int',
'default': 100,
'min': 10,
'max': 1000,
'description': 'My custom parameter'
}
}
def build_model(self):
from sklearn.ensemble import SomeClassifier
return SomeClassifier(**self.hyperparameters)Access 8 popular machine learning datasets directly:
- Iris - Iris flower classification (150 samples, 4 features, 3 classes)
- Wine - Wine classification (178 samples, 13 features, 3 classes)
- Breast Cancer - Cancer diagnosis (569 samples, 30 features, 2 classes)
- Digits - Handwritten digit recognition (1797 samples, 64 features, 10 classes)
- Diabetes - Diabetes progression regression (442 samples, 10 features)
- California Housing - Housing price prediction (20640 samples, 8 features)
- Linnerud - Multi-output regression (20 samples, 3 features, 3 targets)
- Core ML: scikit-learn, XGBoost, LightGBM
- UI Framework: Streamlit
- Visualization: Plotly, Matplotlib, Seaborn
- Explainability: SHAP, LIME
- Data Processing: Pandas, NumPy
- Model Persistence: Joblib
- Python 3.8+
- See
requirements.txtfor full dependency list