Skip to content

Marwan947/Big-Data-Processing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Spark Retail Case Study

Big Data Analytics Mini Project 2 — Spark Case Study analyzing the Online Retail II dataset with three Spark APIs (RDD, DataFrame, SQL) and a comparative performance / optimization analysis.

Research question

Which customer segments drive revenue, and how do their purchasing patterns evolve over time and across countries?

A combined RFM customer segmentation + sales-trend analysis answered through 10 queries, each implemented three ways.

Repository layout

spark-retail-case-study/
├── REPORT.md                       # full project report
├── requirements.txt                # pyspark, pandas, matplotlib
├── data/
│   ├── raw/online_retail_II.csv    # source CSV
│   └── parquet/                    # generated by the prep step
├── src/
│   ├── common/                     # SparkSession factory, schema, timer
│   ├── prep/                       # CSV → Parquet conversion
│   ├── rdd/                        # 10 queries, RDD API
│   ├── dataframe/                  # 10 queries, DataFrame API
│   ├── sql/                        # 10 queries, Spark SQL
│   ├── benchmark/                  # capture_explain, run_all, make_report_tables
│   └── optimization/               # caching, partition-pruning, scalability demos
├── outputs/                        # generated artifacts (regenerated by scripts)
│   ├── explain/                    # text dumps of execution plans
│   ├── results/                    # query results (CSV / preview)
│   ├── metrics/                    # performance.csv
│   └── plots/                      # comparison charts
└── scripts/                        # convenience shell scripts

Prerequisites

  • Python 3.9+ (tested on 3.9 and 3.11)
  • Java 11 (required by Spark 3.5)
  • ~2 GB free disk for the Parquet copy and benchmark outputs
  • Internet connection on first run (to fetch the dataset from UCI)

Setup

bash scripts/setup.sh           # creates .venv and installs requirements
source .venv/bin/activate
bash scripts/download_data.sh   # ~95 MB CSV from UCI; takes ~30 s

The dataset (data/raw/online_retail_II.csv) is not in the repo — it is downloaded from UCI Online Retail II by download_data.sh. The script unzips the workbook UCI ships, concatenates the two sheets, and writes the combined CSV. Re-running the script is a no-op once the file exists.

Running

# 1. Convert CSV → Parquet (one-time, ~1 minute)
bash scripts/run_prep.sh           # depends on data/raw/online_retail_II.csv being present

# 2. Run a single query
bash scripts/run_query.sh dataframe 01     # = python -m src.dataframe.q01_filter
bash scripts/run_query.sh sql 05
bash scripts/run_query.sh rdd 10

# 3. Capture every .explain() output
python -m src.benchmark.capture_explain

# 4. Full benchmark sweep (30 implementations × 3 runs ≈ 8–15 minutes)
bash scripts/run_benchmark.sh

# 5. Optimization demos
python -m src.optimization.caching_demo
python -m src.optimization.partition_pruning_demo
python -m src.optimization.scalability_test

# 6. Build the Markdown tables embedded in REPORT.md
python -m src.benchmark.make_report_tables

How the deliverables map to files

Rubric item Where to look
Problem statement & dataset description REPORT.md §1–2
10 queries × 3 APIs (30 implementations) src/rdd/, src/dataframe/, src/sql/
.explain() outputs outputs/explain/
Logical / optimized / physical plan comparison REPORT.md §4
Performance comparison table outputs/metrics/performance.csv and REPORT.md §5
Broadcast vs sort-merge join Q8 (broadcast) and Q9 (sort-merge)
CSV vs Parquet comparison src/optimization/csv_vs_parquet.py (and Q3 run on both)
Partition pruning src/optimization/partition_pruning_demo.py
Caching src/optimization/caching_demo.py
Scalability tests src/optimization/scalability_test.py
Final insights REPORT.md §6

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors