Skip to content

A complete end-to-end Dockerized Real-Time Big Data Analytics Pipeline using MongoDB, Spark, Redis, Airflow, and HDFS with a live analytics dashboard.

Notifications You must be signed in to change notification settings

zee404-code/FullStack_BigDataAnalytics_Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Real-Time Big Data Analytics Pipeline for Flash Sale Stockout Prevention

Overview

Flash sales such as 11.11 and 12.12 introduce extreme demand volatility in e-commerce systems. Inventory depletion can occur within minutes, while traditional batch analytics react too late — after revenue and customer trust are already lost.

This project implements a fully dockerized, real-time Big Data Analytics (BDA) pipeline designed to prevent stockouts during high-traffic flash sales by providing minute-level, action-oriented insights to decision-makers.

The system continuously ingests high-velocity data, computes distributed analytics, caches KPIs for low-latency access, archives historical data, and surfaces only decision-relevant signals through a tactical dashboard.

Business Problem

During flash sales:

  • Demand spikes are non-linear and time-dependent

  • Inventory drains faster than batch refresh cycles

  • Managers lack visibility into where and when stockouts will occur

  • Reactive reporting leads to:

    • Lost revenue
    • Order cancellations
    • Wasted ad spend
    • Fulfillment bottlenecks

Objective

Enable managers to answer — within one minute:

  1. Which products are about to stock out?
  2. Where is the risk occurring (city / warehouse)?
  3. What action should be taken immediately? (restock, pause ads, reroute fulfillment, monitor)

System Architecture (High-Level)

End-to-end flow:

Statistical Data Generator
        ↓
     MongoDB (Hot Data)
        ↓
 Apache Spark (OLAP & KPIs)
        ↓
     Redis (KPI Cache)
        ↓
   Live Dashboard (Streamlit)

        ↘
        Hadoop HDFS (Archival & Cold Storage)

Orchestration: Apache Airflow Deployment: Fully containerized using Docker & Docker Compose

Architecture_Diagram

Technology Stack

Layer Technology Purpose
Data Generation Python (Statistical Simulation) Realistic flash-sale behavior
Fresh Storage MongoDB High-write, low-latency ingestion
Processing Apache Spark Distributed OLAP analytics
Orchestration Apache Airflow Scheduling & pipeline control
Caching Redis Sub-second KPI reads
Archival Hadoop HDFS Scalable cold storage
Visualization Streamlit Tactical real-time dashboard
Infrastructure Docker Reproducibility & isolation

Scale & Data Volume

This project operates at realistic big-data scale:

  • ~69 million streaming records generated
  • ~16 GB of incoming data
  • Continuous ingestion throughout runtime
  • Automatic archival once thresholds are crossed

This ensures:

  • Non-trivial join sizes
  • Meaningful Spark execution plans
  • Realistic performance characteristics

Statistical Data Generation (Not Random)

Statistical_Data_Generation

Data is generated using parameterized statistical models, not uniform randomness.

Key characteristics

  • Context-aware Poisson-based order arrivals

    • Orders per minute sampled from Poisson(λ)

    • λ dynamically adjusted using:

      • Flash sale intensity
      • Discount levels
      • City demand factors
      • Warehouse pressure
  • Weighted product selection

    • Probability proportional to:

      popularity × discount_multiplier × scarcity_multiplier
      
  • Scarcity correlation

    • Lower inventory increases selection bias
    • Enables natural stockout emergence
  • Log-normal price modeling

    • Prices centered around discounted base price
    • Category-level variance for realism

This approach produces:

  • Demand skew
  • Hot SKUs
  • Geographic imbalance
  • Emergent stockouts (not hard-coded)

Data Model (Fact–Dimension Schema)

Data Schema

Fact Tables (Streaming)

orders_fact

  • order_id
  • product_id
  • seller_id
  • warehouse_id
  • city_id
  • order_qty
  • order_value
  • order_ts

inventory_fact

  • inventory_event_id
  • product_id
  • warehouse_id
  • inventory_on_hand
  • reserved_stock
  • fulfillment_delay
  • inventory_ts

Dimension Tables

  • products_dim (category, base price)
  • warehouses_dim (capacity, location)
  • cities_dim (city, province)

All analytics require multi-table joins, enabling OLAP-style queries.

Real-Time KPIs (Computed Every Minute and Cached in Redis)

KPIs

  • Orders per minute
  • Sales velocity (units/min per product & warehouse)
  • Stockout ETA (minutes-to-stockout)
  • High-risk SKU count
  • Warehouse overload count
  • Geographic risk distribution
  • Risk trend over time
  • Actionable alerts table

Query characteristics

  • Uses WHERE, GROUP BY, and HAVING
  • Time-windowed (last 15 minutes)
  • Early aggregation before joins
  • Derived KPIs instead of raw scans

Orchestration (Airflow DAGs)

DAGs implemented

  1. Dimension Seeding DAG

    • Initializes reference tables
  2. Minute-Level KPI DAG

    • Triggers Spark analytics every minute
    • Updates Redis cache
  3. Archival DAG

    • Moves older data from MongoDB → HDFS
    • Keeps hot storage lean

Each DAG is isolated for:

  • Fault containment
  • Debuggability
  • Operational clarity

Performance Optimizations

Optimizations are applied at every layer:

Ingestion

  • Rate-controlled generation
  • Single write target (MongoDB)

Storage

  • Hot–cold data separation (MongoDB ↔ HDFS)
  • Time-windowed queries only

Processing

  • Early aggregation
  • Join ordering
  • Derived KPIs
  • No full scans

Caching

  • Redis stores only pre-aggregated KPIs
  • Eliminates Spark/Mongo hits from dashboard

Visualization

  • Dashboard reads Redis only
  • Sub-second refresh
  • Minimal, decision-centric visuals

Metadata & Governance

  • Clear schema definitions
  • Timestamped records
  • Partitioned archival (date / hour )
  • Airflow logs provide lineage and execution metadata

Dashboard Design

The dashboard is tactical, not descriptive.

Designed to help managers:

  • Spot stockout risk early
  • Identify where it’s happening
  • Decide action immediately

No scrolling, no heavy charts — only actionable insights.

Why This Matters

This project demonstrates how real-time analytics, when engineered correctly, can shift decision-making from reactive reporting to preventive action.

It reflects:

  • Realistic data scale
  • Industry-grade architecture
  • Practical engineering trade-offs
  • Business-first analytics design

Status

✔ Fully implemented ✔ Dockerized & reproducible ✔ Real-time & scalable ✔ Industry-aligned architecture

How to Run Locally (Dockerized Setup)

This project is fully containerized using Docker Compose. The steps below bring up the complete real-time Big Data Analytics pipeline locally, including ingestion, processing, orchestration, caching, archival, and visualization.

Prerequisites

  • Docker ≥ 24.x
  • Docker Compose (v2)
  • At least 8–12 GB RAM recommended
  • Linux (tested)

Build and start core services

Build all images and start base infrastructure services:

docker compose up -d --build

This initializes:

  • MongoDB (fresh streaming data)
  • Redis (KPI cache)
  • Hadoop NameNode & DataNode (archival storage)
  • Spark Master & Worker
  • Airflow container (initial state)

Initialize Airflow metadata database

Airflow requires an internal metadata database before it can schedule DAGs:

docker compose run --rm airflow airflow db init

This creates the Airflow metadata schema used for:

  • DAG state
  • Task execution history
  • Scheduling metadata

Create Airflow admin user

Create an admin user to access the Airflow UI:

docker compose run --rm airflow airflow users create \
  --username admin \
  --firstname Admin \
  --lastname User \
  --role Admin \
  --email admin@example.com \
  --password admin

Start Airflow services

Start the Airflow container in detached mode:

docker compose up -d airflow

(Optional) Reset the admin password if required:

docker exec -it airflow airflow users reset-password \
  --username admin \
  --password admin

Start the streaming data generator

Launch the statistical flash-sale data generator:

docker compose up -d generator

This begins:

  • Continuous order generation
  • Inventory updates
  • High-velocity streaming inserts into MongoDB

The generator simulates flash-sale behavior using statistical models, not random noise.

Start the dashboard

Launch the real-time dashboard:

docker compose up -d dashboard

The dashboard:

  • Reads only from Redis
  • Displays pre-aggregated KPIs
  • Updates automatically as data changes

Access it at:

http://localhost:8501

Start Airflow scheduler and webserver

Run the Airflow scheduler and webserver inside the container:

docker exec -it airflow bash -lc "airflow webserver -D && airflow scheduler -D"

Access the Airflow UI at:

http://localhost:8088

From here you can:

  • Enable DAGs
  • Monitor task execution
  • Inspect logs
  • Observe minute-level analytics & archival workflows

Prepare HDFS archive directories

Initialize the HDFS directory structure for archived data and metadata:

docker exec -it namenode bash -lc "
/opt/hadoop-3.2.1/bin/hdfs dfs -mkdir -p /daraz_flashsale_archive/metadata &&
/opt/hadoop-3.2.1/bin/hdfs dfs -mkdir -p /daraz_flashsale_archive/orders_fact &&
/opt/hadoop-3.2.1/bin/hdfs dfs -mkdir -p /daraz_flashsale_archive/inventory_fact
"

These directories store:

  • Archived fact data (orders, inventory)
  • Metadata describing archival batches
  • Partitioned historical data (date/hour-based)

What Happens After Startup

Once all services are running:

  • The generator continuously inserts streaming data into MongoDB
  • Airflow DAGs trigger Spark jobs every minute
  • Spark computes KPIs using join-based OLAP queries
  • Redis caches KPIs for low-latency access
  • HDFS stores archived historical data
  • The dashboard updates live without recomputation

This setup enables real-time stockout risk detection under flash-sale load.

Verifying the System

Optional checks:

# Check Redis KPIs
docker exec -it redis redis-cli KEYS '*'

# Check MongoDB collections
docker exec -it mongo mongosh

# Check HDFS UI
http://localhost:9870

Shutting Down

To stop all services:

docker compose down

To remove volumes as well:

docker compose down -v

ℹ️ Notes

  • Kafka is intentionally not used — ingestion is deterministic and controlled for reproducible analytics.
  • The pipeline is designed for minute-level micro-batching, not event-driven chaos.
  • Redis is used strictly as a KPI cache, not a source of truth.

About

A complete end-to-end Dockerized Real-Time Big Data Analytics Pipeline using MongoDB, Spark, Redis, Airflow, and HDFS with a live analytics dashboard.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published