Comprehensive exploratory data analysis, visualization, and interactive dashboard for US domestic flight delays using historical airline on-time performance data combined with weather information.
This project analyzes US flight delay patterns using publicly available Bureau of Transportation Statistics (BTS) on-time performance data, enriched with weather conditions at departure and arrival airports.
Key features:
- Data cleaning and feature engineering
- Exploratory Data Analysis (EDA): carriers, airports, time-of-day, day-of-week, month, delay causes
- Geographic visualization of delay hotspots
- Interactive Streamlit dashboard with filters and dynamic charts
- Basic predictive modeling foundation (Random Forest / XGBoost ready)
Main questions answered:
- Which airlines have the highest delay rates?
- What are the most common causes of delays?
- When (time of day, day of week, month) do delays peak?
- Which airports are the worst delay hotspots?
flight-delay-analysis/
├── data/
│ ├── raw/ # original downloaded CSVs
├── notebooks/
│ └── notebook.ipynb # main Jupyter notebook with EDA & modeling
│ └── cleaned_flights.parquet # cleaned & processed dataset
├── scripts/
│ └── dashboard.py # Web interface to access data analysis
│ └── cleaned_flights.parquet # cleaned & processed dataset
│ └── airport_coords.py # contains cordinates for mapping
├── dashboard.py # Streamlit app
├── requirements.txt
├── readme.md
└── demos/ # screenshots & saved plots
- Python 3.10+
- Data processing: pandas, numpy
- Visualization: matplotlib, seaborn, plotly, folium
- Dashboard: streamlit, streamlit-folium
- Modeling (optional): scikit-learn, xgboost
# requirements.txt
pandas
numpy
matplotlib
seaborn
plotly
folium
streamlit
streamlit-folium
scikit-learn # optional for modeling
xgboost # optional
- Clone the repository
bash
git clone https://github.com/prashant-sharma-cmd/flight-delay-analysis.git
cd flight-delay-project
- Install Dependencies
# Recommended: create virtual environment first
python -m venv venv
source venv/bin/activate # Linux/Mac
venv\Scripts\activate # Windows
pip install -r requirements.txt
- Prepare the Data
Download a suitable dataset, for example:
Kaggle: Airline Delay and Cancellation Data
Or BTS TranStats: https://www.transtats.bts.gov/DL_SelectFields.aspx?gnoyr_VQ=FGJ
to /data folder. Name the dataset you want to analyze as flights_with_weather.
- Explore the notebook
Open notebooks/notebook.ipynb in Jupyter / VS Code / JupyterLab for detailed EDA, charts, and modeling experiments.
- Run the Interactive Dashboard
cd scripts
streamlit run dashboard.py
Open http://localhost:8501 in your browser.
- Charts don't render the data points in streamlit app.
- Application can only read datasets of a single month.
- Large datasets (>5M rows) may need sampling or Dask for faster processing
Primary data: U.S. Department of Transportation – Bureau of Transportation Statistics (BTS) https://www.transtats.bts.gov/ Weather integration: commonly found in Kaggle merged datasets Airport coordinates: OpenFlights / OurAirports (public domain)
Inspired by many excellent Kaggle notebooks on flight delays. Thanks to the open data community and BTS for making this kind of analysis possible.