A machine learning project that predicts Formula 1 podium finishes (Top 3 positions) using historical race and qualifying data from the 2024 and 2025 F1 seasons. Built with Python, scikit-learn, and FastF1 API.
This project uses Random Forest Classifier models to predict which drivers will finish on the podium in F1 races. The system includes:
- Data Pipeline: Automated ETL process extracting race and qualifying results
- Feature Engineering: 22+ predictive features including driver momentum, season statistics, and track characteristics
- Model Training: Multiple model iterations with hyperparameter tuning
- Evaluation Tools: Comprehensive metrics and visualization dashboards
- Prediction Scripts: Ready-to-use prediction scripts for upcoming races
Model Performance: Achieves 87% accuracy in predicting podium finishes.
- Python: 3.12 or higher
- Package Manager:
piporuv(recommended) - Internet Connection: Required for downloading F1 data via FastF1 API
If you have uv installed:
# Clone the repository (or navigate to project directory)
cd f1-ml-project
# Install dependencies
uv sync
# Activate the virtual environment
source .venv/bin/activate # On Windows: .venv\Scripts\activatef1-ml-project/
├── data/
│ ├── raw/ # Raw data from FastF1 API
│ │ ├── qualifying/ # Qualifying results by year
│ │ └── race/ # Race results by year
│ └── processed/ # Processed and feature-engineered data
├── notebooks/ # Jupyter notebooks for analysis
│ ├── save_f1_qualy_results.ipynb
│ ├── save_f1_race_results.ipynb
│ ├── merge_qualy_race.ipynb
│ ├── build_features.ipynb
│ ├── build_improved_features.ipynb
│ ├── train_model*.ipynb
│ ├── evaluate_model.ipynb
│ └── predict_qatar_gp.ipynb
├── src/
│ ├── etl/ # Extract, Transform, Load scripts
│ ├── features/ # Feature engineering scripts
│ └── models/ # Model files and evaluation
├── experiments/ # Experimental scripts
└── pyproject.toml # Project configuration and dependencies
-
Start Jupyter Lab:
jupyter lab
-
Navigate to
notebooks/and run in order:save_f1_qualy_results.ipynb- Fetches qualifying session datasave_f1_race_results.ipynb- Fetches race session data
Note: Update the
YEARvariable in each notebook to specify which season(s) to download (2024, 2025, etc.)
# From project root directory
cd src/etl
# Update YEAR variable in the script, then run:
python save_f1_qualy_results.py
python save_f1_race_results.pyOutput: Raw CSV files in data/raw/qualifying/ and data/raw/race/ directories.
Run the merge notebook or script to combine qualifying and race results:
# Using notebook (recommended)
jupyter lab notebooks/merge_qualy_race.ipynb
# Or using Python script
python src/etl/process_qualy_race_results.pyOutput: data/processed/2024_master_results.csv and 2025_master_results.csv
Create features for model training:
# Using notebook (recommended)
jupyter lab notebooks/build_improved_features.ipynb
# Or using Python script
python src/features/build_improved_features.pyOutput: data/processed/improved_feature_engineered_data.csv
Train a model using one of the training notebooks:
jupyter lab notebooks/train_model.ipynb # Basic model
jupyter lab notebooks/train_model2.ipynb # Improved model
jupyter lab notebooks/train_model3.ipynb # Hyperparameter-tuned modelOutput: Trained model files (.pkl) saved in src/models/
Evaluate model performance:
jupyter lab notebooks/evaluate_model.ipynb # Basic evaluation
jupyter lab notebooks/advanced_evaluation.ipynb # Advanced metricsPredict podium finishes for a specific race:
# Predict Qatar GP (example)
jupyter lab notebooks/predict_qatar_gp.ipynb
# Or use the Python script
python experiments/predict_gp/predict_qatar_gp.pyFastF1 caches downloaded data to speed up future runs. The cache is stored in:
notebooks/fastf1_cache_dir/(for notebooks)- Or a specified directory in Python scripts
You can modify the cache location in the scripts/notebooks if needed.
To download data for a specific year, update the YEAR variable in:
notebooks/save_f1_qualy_results.ipynbnotebooks/save_f1_race_results.ipynb- Or corresponding Python scripts in
src/etl/
| Model | Training Data | Test Data | Accuracy | Location |
|---|---|---|---|---|
| Basic Model | 2024 R1-16 | 2024 R17-24 | 87.5% | src/models/podium_model.pkl |
| Improved Model | 2024 R1-14 | 2024 R15-24 | 86% | src/models/improved_podium_model.pkl |
| Tuned 2025 Model | 2025 R1-19 | 2025 R20-24 | 83.3% | src/models/tuned_podium_model.pkl |
The models use 22+ engineered features including:
- Qualifying times (Q1, Q2, Q3 in seconds)
- Team one-hot encodings
- Driver momentum (rolling averages of last 3 races)
- Season cumulative statistics (points, DNFs)
- Lag features (previous race positions)
- Track type indicators (street circuit vs permanent)
- All features use
shift(1)to ensure only historical data is used - Time-based train/test splits (no random splits)
- Season statistics calculated with proper temporal alignment
- Models use
class_weight='balanced'to handle imbalanced podium predictions - Only ~30 podium finishes per 200+ race entries
- Team names are normalized across seasons (e.g., RB/AlphaTauri → Racing Bulls)
- Critical for multi-year analysis
- fastf1 (>=3.6.1) - F1 data API
- pandas (>=2.3.3) - Data manipulation
- scikit-learn (>=1.7.2) - Machine learning
- numpy (>=2.3.4) - Numerical operations
- matplotlib (>=3.10.7) - Visualization
- seaborn (>=0.13.2) - Statistical visualization
- jupyter (>=1.1.1) - Jupyter notebooks
- jupyterlab (>=4.4.10) - Jupyter Lab interface
See pyproject.toml for complete dependency list.
- Fork the repository
- Create a feature branch
- Make your changes
- Test thoroughly
- Submit a pull request