Real-Time Big Data Analytics Pipeline for Flash Sale Stockout Prevention

Overview

Flash sales such as 11.11 and 12.12 introduce extreme demand volatility in e-commerce systems. Inventory depletion can occur within minutes, while traditional batch analytics react too late — after revenue and customer trust are already lost.

This project implements a fully dockerized, real-time Big Data Analytics (BDA) pipeline designed to prevent stockouts during high-traffic flash sales by providing minute-level, action-oriented insights to decision-makers.

The system continuously ingests high-velocity data, computes distributed analytics, caches KPIs for low-latency access, archives historical data, and surfaces only decision-relevant signals through a tactical dashboard.

Business Problem

During flash sales:

Demand spikes are non-linear and time-dependent
Inventory drains faster than batch refresh cycles
Managers lack visibility into where and when stockouts will occur
Reactive reporting leads to:
- Lost revenue
- Order cancellations
- Wasted ad spend
- Fulfillment bottlenecks

Objective

Enable managers to answer — within one minute:

Which products are about to stock out?
Where is the risk occurring (city / warehouse)?
What action should be taken immediately? (restock, pause ads, reroute fulfillment, monitor)

System Architecture (High-Level)

End-to-end flow:

Statistical Data Generator
        ↓
     MongoDB (Hot Data)
        ↓
 Apache Spark (OLAP & KPIs)
        ↓
     Redis (KPI Cache)
        ↓
   Live Dashboard (Streamlit)

        ↘
        Hadoop HDFS (Archival & Cold Storage)

Orchestration: Apache Airflow Deployment: Fully containerized using Docker & Docker Compose

Technology Stack

Layer	Technology	Purpose
Data Generation	Python (Statistical Simulation)	Realistic flash-sale behavior
Fresh Storage	MongoDB	High-write, low-latency ingestion
Processing	Apache Spark	Distributed OLAP analytics
Orchestration	Apache Airflow	Scheduling & pipeline control
Caching	Redis	Sub-second KPI reads
Archival	Hadoop HDFS	Scalable cold storage
Visualization	Streamlit	Tactical real-time dashboard
Infrastructure	Docker	Reproducibility & isolation

Scale & Data Volume

This project operates at realistic big-data scale:

~69 million streaming records generated
~16 GB of incoming data
Continuous ingestion throughout runtime
Automatic archival once thresholds are crossed

This ensures:

Non-trivial join sizes
Meaningful Spark execution plans
Realistic performance characteristics

Statistical Data Generation (Not Random)

Data is generated using parameterized statistical models, not uniform randomness.

Key characteristics

Context-aware Poisson-based order arrivals
- Orders per minute sampled from Poisson(λ)
- λ dynamically adjusted using:
  - Flash sale intensity
  - Discount levels
  - City demand factors
  - Warehouse pressure

Weighted product selection

Probability proportional to:

popularity × discount_multiplier × scarcity_multiplier

Scarcity correlation
- Lower inventory increases selection bias
- Enables natural stockout emergence
Log-normal price modeling
- Prices centered around discounted base price
- Category-level variance for realism

This approach produces:

Demand skew
Hot SKUs
Geographic imbalance
Emergent stockouts (not hard-coded)

Data Model (Fact–Dimension Schema)

Fact Tables (Streaming)

`orders_fact`

order_id
product_id
seller_id
warehouse_id
city_id
order_qty
order_value
order_ts

`inventory_fact`

inventory_event_id
product_id
warehouse_id
inventory_on_hand
reserved_stock
fulfillment_delay
inventory_ts

Dimension Tables

products_dim (category, base price)
warehouses_dim (capacity, location)
cities_dim (city, province)

All analytics require multi-table joins, enabling OLAP-style queries.

Real-Time KPIs (Computed Every Minute and Cached in Redis)

Orders per minute
Sales velocity (units/min per product & warehouse)
Stockout ETA (minutes-to-stockout)
High-risk SKU count
Warehouse overload count
Geographic risk distribution
Risk trend over time
Actionable alerts table

Query characteristics

Uses WHERE, GROUP BY, and HAVING
Time-windowed (last 15 minutes)
Early aggregation before joins
Derived KPIs instead of raw scans

Orchestration (Airflow DAGs)

DAGs implemented

Dimension Seeding DAG
- Initializes reference tables
Minute-Level KPI DAG
- Triggers Spark analytics every minute
- Updates Redis cache
Archival DAG
- Moves older data from MongoDB → HDFS
- Keeps hot storage lean

Each DAG is isolated for:

Fault containment
Debuggability
Operational clarity

Performance Optimizations

Optimizations are applied at every layer:

Ingestion

Rate-controlled generation
Single write target (MongoDB)

Storage

Hot–cold data separation (MongoDB ↔ HDFS)
Time-windowed queries only

Processing

Early aggregation
Join ordering
Derived KPIs
No full scans

Caching

Redis stores only pre-aggregated KPIs
Eliminates Spark/Mongo hits from dashboard

Visualization

Dashboard reads Redis only
Sub-second refresh
Minimal, decision-centric visuals

Metadata & Governance

Clear schema definitions
Timestamped records
Partitioned archival (date / hour )
Airflow logs provide lineage and execution metadata

Dashboard Design

The dashboard is tactical, not descriptive.

Designed to help managers:

Spot stockout risk early
Identify where it’s happening
Decide action immediately

No scrolling, no heavy charts — only actionable insights.

Why This Matters

This project demonstrates how real-time analytics, when engineered correctly, can shift decision-making from reactive reporting to preventive action.

It reflects:

Realistic data scale
Industry-grade architecture
Practical engineering trade-offs
Business-first analytics design

Status

✔ Fully implemented ✔ Dockerized & reproducible ✔ Real-time & scalable ✔ Industry-aligned architecture

How to Run Locally (Dockerized Setup)

This project is fully containerized using Docker Compose. The steps below bring up the complete real-time Big Data Analytics pipeline locally, including ingestion, processing, orchestration, caching, archival, and visualization.

Prerequisites

Docker ≥ 24.x
Docker Compose (v2)
At least 8–12 GB RAM recommended
Linux (tested)

Build and start core services

Build all images and start base infrastructure services:

docker compose up -d --build

This initializes:

MongoDB (fresh streaming data)
Redis (KPI cache)
Hadoop NameNode & DataNode (archival storage)
Spark Master & Worker
Airflow container (initial state)

Initialize Airflow metadata database

Airflow requires an internal metadata database before it can schedule DAGs:

docker compose run --rm airflow airflow db init

This creates the Airflow metadata schema used for:

DAG state
Task execution history
Scheduling metadata

Create Airflow admin user

Create an admin user to access the Airflow UI:

docker compose run --rm airflow airflow users create \
  --username admin \
  --firstname Admin \
  --lastname User \
  --role Admin \
  --email admin@example.com \
  --password admin

Start Airflow services

Start the Airflow container in detached mode:

docker compose up -d airflow

(Optional) Reset the admin password if required:

docker exec -it airflow airflow users reset-password \
  --username admin \
  --password admin

Start the streaming data generator

Launch the statistical flash-sale data generator:

docker compose up -d generator

This begins:

Continuous order generation
Inventory updates
High-velocity streaming inserts into MongoDB

The generator simulates flash-sale behavior using statistical models, not random noise.

Start the dashboard

Launch the real-time dashboard:

docker compose up -d dashboard

The dashboard:

Reads only from Redis
Displays pre-aggregated KPIs
Updates automatically as data changes

Access it at:

http://localhost:8501

Start Airflow scheduler and webserver

Run the Airflow scheduler and webserver inside the container:

docker exec -it airflow bash -lc "airflow webserver -D && airflow scheduler -D"

Access the Airflow UI at:

http://localhost:8088

From here you can:

Enable DAGs
Monitor task execution
Inspect logs
Observe minute-level analytics & archival workflows

Prepare HDFS archive directories

Initialize the HDFS directory structure for archived data and metadata:

docker exec -it namenode bash -lc "
/opt/hadoop-3.2.1/bin/hdfs dfs -mkdir -p /daraz_flashsale_archive/metadata &&
/opt/hadoop-3.2.1/bin/hdfs dfs -mkdir -p /daraz_flashsale_archive/orders_fact &&
/opt/hadoop-3.2.1/bin/hdfs dfs -mkdir -p /daraz_flashsale_archive/inventory_fact
"

These directories store:

Archived fact data (orders, inventory)
Metadata describing archival batches
Partitioned historical data (date/hour-based)

What Happens After Startup

Once all services are running:

The generator continuously inserts streaming data into MongoDB
Airflow DAGs trigger Spark jobs every minute
Spark computes KPIs using join-based OLAP queries
Redis caches KPIs for low-latency access
HDFS stores archived historical data
The dashboard updates live without recomputation

This setup enables real-time stockout risk detection under flash-sale load.

Verifying the System

Optional checks:

# Check Redis KPIs
docker exec -it redis redis-cli KEYS '*'

# Check MongoDB collections
docker exec -it mongo mongosh

# Check HDFS UI
http://localhost:9870

Shutting Down

To stop all services:

docker compose down

To remove volumes as well:

docker compose down -v

ℹ️ Notes

Kafka is intentionally not used — ingestion is deterministic and controlled for reproducible analytics.
The pipeline is designed for minute-level micro-batching, not event-driven chaos.
Redis is used strictly as a KPI cache, not a source of truth.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
airflow		airflow
dashboard		dashboard
generator		generator
spark		spark
.DS_Store		.DS_Store
README.md		README.md
docker-compose.yml		docker-compose.yml

zee404-code/FullStack_BigDataAnalytics_Project

Folders and files

Latest commit

History

Repository files navigation