Spark Retail Case Study

Big Data Analytics Mini Project 2 — Spark Case Study analyzing the Online Retail II dataset with three Spark APIs (RDD, DataFrame, SQL) and a comparative performance / optimization analysis.

Research question

Which customer segments drive revenue, and how do their purchasing patterns evolve over time and across countries?

A combined RFM customer segmentation + sales-trend analysis answered through 10 queries, each implemented three ways.

Repository layout

spark-retail-case-study/
├── REPORT.md                       # full project report
├── requirements.txt                # pyspark, pandas, matplotlib
├── data/
│   ├── raw/online_retail_II.csv    # source CSV
│   └── parquet/                    # generated by the prep step
├── src/
│   ├── common/                     # SparkSession factory, schema, timer
│   ├── prep/                       # CSV → Parquet conversion
│   ├── rdd/                        # 10 queries, RDD API
│   ├── dataframe/                  # 10 queries, DataFrame API
│   ├── sql/                        # 10 queries, Spark SQL
│   ├── benchmark/                  # capture_explain, run_all, make_report_tables
│   └── optimization/               # caching, partition-pruning, scalability demos
├── outputs/                        # generated artifacts (regenerated by scripts)
│   ├── explain/                    # text dumps of execution plans
│   ├── results/                    # query results (CSV / preview)
│   ├── metrics/                    # performance.csv
│   └── plots/                      # comparison charts
└── scripts/                        # convenience shell scripts

Prerequisites

Python 3.9+ (tested on 3.9 and 3.11)
Java 11 (required by Spark 3.5)
~2 GB free disk for the Parquet copy and benchmark outputs
Internet connection on first run (to fetch the dataset from UCI)

Setup

bash scripts/setup.sh           # creates .venv and installs requirements
source .venv/bin/activate
bash scripts/download_data.sh   # ~95 MB CSV from UCI; takes ~30 s

The dataset (data/raw/online_retail_II.csv) is not in the repo — it is downloaded from UCI Online Retail II by download_data.sh. The script unzips the workbook UCI ships, concatenates the two sheets, and writes the combined CSV. Re-running the script is a no-op once the file exists.

Running

# 1. Convert CSV → Parquet (one-time, ~1 minute)
bash scripts/run_prep.sh           # depends on data/raw/online_retail_II.csv being present

# 2. Run a single query
bash scripts/run_query.sh dataframe 01     # = python -m src.dataframe.q01_filter
bash scripts/run_query.sh sql 05
bash scripts/run_query.sh rdd 10

# 3. Capture every .explain() output
python -m src.benchmark.capture_explain

# 4. Full benchmark sweep (30 implementations × 3 runs ≈ 8–15 minutes)
bash scripts/run_benchmark.sh

# 5. Optimization demos
python -m src.optimization.caching_demo
python -m src.optimization.partition_pruning_demo
python -m src.optimization.scalability_test

# 6. Build the Markdown tables embedded in REPORT.md
python -m src.benchmark.make_report_tables

How the deliverables map to files

Rubric item	Where to look
Problem statement & dataset description	REPORT.md §1–2
10 queries × 3 APIs (30 implementations)	`src/rdd/`, `src/dataframe/`, `src/sql/`
`.explain()` outputs	`outputs/explain/`
Logical / optimized / physical plan comparison	REPORT.md §4
Performance comparison table	`outputs/metrics/performance.csv` and REPORT.md §5
Broadcast vs sort-merge join	Q8 (broadcast) and Q9 (sort-merge)
CSV vs Parquet comparison	`src/optimization/csv_vs_parquet.py` (and Q3 run on both)
Partition pruning	`src/optimization/partition_pruning_demo.py`
Caching	`src/optimization/caching_demo.py`
Scalability tests	`src/optimization/scalability_test.py`
Final insights	REPORT.md §6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spark Retail Case Study

Research question

Repository layout

Prerequisites

Setup

Running

How the deliverables map to files

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
outputs		outputs
scripts		scripts
src		src
.gitignore		.gitignore
README.md		README.md
REPORT.md		REPORT.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Spark Retail Case Study

Research question

Repository layout

Prerequisites

Setup

Running

How the deliverables map to files

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages