AeroPredict: Transformer Based Real-Time Flight Delay Predictor

This project is a flight delay prediction system based on Transformers. It utilizes a Multi-Task Learning (MTL) framework to simultaneously solve two problems:

Classification Task: Predict whether a flight will be delayed (logits, the probability).
Regression Task: Predict the exact delay time (in minutes).

📋 Project Overview

Core Features

Multi-Task Learning (MTL): The model shares a common underlying feature extraction network, which then branches into a Classifier Head and a Regressor Head, optimizing Accuracy and MAE simultaneously.
Tabular Transformer: A Transformer architecture designed specifically for tabular data, using Embeddings for categorical features and Linear Projection for numerical features.
Complete Data Pipeline: Includes data extraction, data cleaning, leakage-proof feature engineering, data standardization, and PyTorch Dataset encapsulation.
Baseline Comparison: Integrates Random Forest and Gradient Boosting Machine as strong baselines to benchmark the performance of the deep learning model.

File Structure

extract.py: Extract the raw data via Aviationstack and Open-Meteo, in which extract both flights information and weather information.
raw_flights_data_10k.csv: The raw data extracted by extract.py. We get 10K various and different flights data in 2024 to ensure the high quality of our dataset.
main.ipynb: The core file for our project, which includes data processing and cleaning, model defination, model training and evaluation, and baselines comparison. Detailed content can be find in it.
environment.yml: Environment needed for our project.
main.md and main.pdf: The original output from our experiment.

🚀 Quick Start

Clone the repository to your local machine and set up the environment:

git clone https://github.com/Resurgamm/AeroPredict.git
cd AeroPredict
conda env create -f environment.yml

Before run extract.py, you need to get an Aviationstack API key and replace

AVIATIONSTACK_API_KEY = 'YOUR_AVIATIONSTACK_API_KEY'  # Replace with your Aviationstack key

with your own API key.

You can also directly use raw_flights_data_10k.csv we provided, or change extract.py to extract more information or other years.

Click main.ipynb and have fun!

Execution Flow

Data Extraction: Automatically extracts 10K records of real flight data, including airports, schedules, weather, etc.
Data Preprocessing:
- Data Cleaning: Removes unused features and "future features" (like actual arrival time) to prevent data leakage.
- Label Construction: Creates classification labels (IS_DELAYED) and regression labels (DEP_DELAY).
- Standardization: Applies StandardScaler to numerical features.
- Encoding: Applies Label Encoding to categorical features.
Model Training:
- Train our model named AeroPredictTransformer.
- Loss Function: Loss = BCEWithLogitsLoss (Classification) + MSELoss (Regression).
- Includes learning rate decay strategy (ReduceLROnPlateau).
Evaluation: Outputs Classification Accuracy, AUC, and Regression Mean Absolute Error (MAE).
Baseline Comparison: Trains Random Forest Classifier and Regressor and Gradient Boosting Machine (GBM) Classifier and Regressor, outputting comparative metrics.

🧠 Model Architecture

The model uses an our defined AeroPredictTransformer architecture:

Input Layer:
- Categorical Features: Mapped to d_model dimensional vectors via nn.Embedding.
- Numerical Features: Projected to vectors of the same dimension via nn.Linear(1, d_model).
Feature Fusion:
- Concatenates all feature vectors into a sequence: [Batch, Num_Features, d_model].
Backbone:
- Transformer Encoder: Uses Self-Attention mechanisms to capture global interactions between features (2 Layers, 4 Heads).
Heads:
- Classifier Head: FC -> ReLU -> Dropout -> FC (Logits) -> Outputs delay probability.
- Regressor Head: FC -> ReLU -> Dropout -> FC (Scalar) -> Outputs standardized delay time.

📊 Performance Metrics

We evaluate our model and baselines with the following metrics: Accuracy and AUC-ROC for classification, and MAE for regression.

After running the script, you will see output similar to the following:

=== Transformer Evaluation (Full Features) ===
Accuracy : 0.6190
ROC AUC  : 0.6269
MAE      : 27.19 min

=== Baseline 1: Random Forest (Full Features) ===
Accuracy : 0.6120
ROC AUC  : 0.6193
MAE      : 28.15 min

=== Baseline 2: Gradient Boosting (Full Features) ===
Accuracy : 0.6140
ROC AUC  : 0.6298
MAE      : 27.29 min

The final results demonstrate that our model achieves higher classification accuracy and lower regression bias compared to the baselines.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AeroPredict: Transformer Based Real-Time Flight Delay Predictor

📋 Project Overview

Core Features

File Structure

🚀 Quick Start

Execution Flow

🧠 Model Architecture

📊 Performance Metrics

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
assets		assets
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
extract.py		extract.py
main.ipynb		main.ipynb
main.md		main.md
main.pdf		main.pdf
raw_flights_data_10k.csv		raw_flights_data_10k.csv

Folders and files

Latest commit

History

Repository files navigation

AeroPredict: Transformer Based Real-Time Flight Delay Predictor

📋 Project Overview

Core Features

File Structure

🚀 Quick Start

Execution Flow

🧠 Model Architecture

📊 Performance Metrics

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages