This repository provides a reproducible implementation of the Dynamic Allocation Model (DAM) for predicting execution time of Apache Spark applications using a deterministic, stage-based simulation framework.
The Dynamic Allocation Model (DAM) extends the Static Allocation Model (SAM) by incorporating dynamic executor allocation, enabling more realistic modeling of Spark execution behavior.
The model captures:
- Backlog-based scaling: Executors are added when tasks accumulate and cores are fully utilized
- Idle-time-based scaling: Executors are removed when resources remain idle
- Minimum executor constraint: Ensures baseline parallelism throughout execution
This approach provides an interpretable and lightweight alternative to data-driven performance models.
- DAG-based stage execution modeling
- Task-level simulation of Spark jobs
- Round-robin scheduling across stages
- Dynamic executor scaling (add/remove cores)
- Deterministic and reproducible execution time prediction
The complete implementation and experiment are contained in:
Q52_DAM.ipynb
The notebook includes:
- Definition of Stage, Task, and Core classes
- DAG construction for Query-52
- Dynamic allocation logic (DAM)
- Execution time simulation
- Validation against measured results
- Workload: TPC-DS Query-52
- Data Size: 500 GB
- Execution Mode: Dynamic allocation
- Minimum Executor Cores: 4
- Overhead Included: Warm-up and stage transfer time
The DAM model produces execution time predictions that closely match measured runtimes under selected configurations.
- Prediction error: ~2% – 9%
- Average Prediction error: ~4.96%
If you use this work, please cite:
@article{tariq2026dam,
title = {Predicting Runtime in Spark-Like Systems with Allocation-Aware Deterministic Models},
author = {Tariq, H. and Das, O.},
journal = {Software: Practice and Experience},
year = {2026},
note = {Manuscript under review}
}
@inproceedings{tariq2022sam,
title = {A Deterministic Model to Predict Execution Time of Spark Applications},
author = {Tariq, H. and Das, O.},
booktitle = {Computer Performance Engineering (EPEW)},
year = {2022},
publisher = {Springer}
}