Big Data Analytics Mini Project 2 — Spark Case Study analyzing the Online Retail II dataset with three Spark APIs (RDD, DataFrame, SQL) and a comparative performance / optimization analysis.
Which customer segments drive revenue, and how do their purchasing patterns evolve over time and across countries?
A combined RFM customer segmentation + sales-trend analysis answered through 10 queries, each implemented three ways.
spark-retail-case-study/
├── REPORT.md # full project report
├── requirements.txt # pyspark, pandas, matplotlib
├── data/
│ ├── raw/online_retail_II.csv # source CSV
│ └── parquet/ # generated by the prep step
├── src/
│ ├── common/ # SparkSession factory, schema, timer
│ ├── prep/ # CSV → Parquet conversion
│ ├── rdd/ # 10 queries, RDD API
│ ├── dataframe/ # 10 queries, DataFrame API
│ ├── sql/ # 10 queries, Spark SQL
│ ├── benchmark/ # capture_explain, run_all, make_report_tables
│ └── optimization/ # caching, partition-pruning, scalability demos
├── outputs/ # generated artifacts (regenerated by scripts)
│ ├── explain/ # text dumps of execution plans
│ ├── results/ # query results (CSV / preview)
│ ├── metrics/ # performance.csv
│ └── plots/ # comparison charts
└── scripts/ # convenience shell scripts
- Python 3.9+ (tested on 3.9 and 3.11)
- Java 11 (required by Spark 3.5)
- ~2 GB free disk for the Parquet copy and benchmark outputs
- Internet connection on first run (to fetch the dataset from UCI)
bash scripts/setup.sh # creates .venv and installs requirements
source .venv/bin/activate
bash scripts/download_data.sh # ~95 MB CSV from UCI; takes ~30 sThe dataset (data/raw/online_retail_II.csv) is not in the repo — it is downloaded from UCI Online Retail II by download_data.sh. The script unzips the workbook UCI ships, concatenates the two sheets, and writes the combined CSV. Re-running the script is a no-op once the file exists.
# 1. Convert CSV → Parquet (one-time, ~1 minute)
bash scripts/run_prep.sh # depends on data/raw/online_retail_II.csv being present
# 2. Run a single query
bash scripts/run_query.sh dataframe 01 # = python -m src.dataframe.q01_filter
bash scripts/run_query.sh sql 05
bash scripts/run_query.sh rdd 10
# 3. Capture every .explain() output
python -m src.benchmark.capture_explain
# 4. Full benchmark sweep (30 implementations × 3 runs ≈ 8–15 minutes)
bash scripts/run_benchmark.sh
# 5. Optimization demos
python -m src.optimization.caching_demo
python -m src.optimization.partition_pruning_demo
python -m src.optimization.scalability_test
# 6. Build the Markdown tables embedded in REPORT.md
python -m src.benchmark.make_report_tables| Rubric item | Where to look |
|---|---|
| Problem statement & dataset description | REPORT.md §1–2 |
| 10 queries × 3 APIs (30 implementations) | src/rdd/, src/dataframe/, src/sql/ |
.explain() outputs |
outputs/explain/ |
| Logical / optimized / physical plan comparison | REPORT.md §4 |
| Performance comparison table | outputs/metrics/performance.csv and REPORT.md §5 |
| Broadcast vs sort-merge join | Q8 (broadcast) and Q9 (sort-merge) |
| CSV vs Parquet comparison | src/optimization/csv_vs_parquet.py (and Q3 run on both) |
| Partition pruning | src/optimization/partition_pruning_demo.py |
| Caching | src/optimization/caching_demo.py |
| Scalability tests | src/optimization/scalability_test.py |
| Final insights | REPORT.md §6 |