Flash sales such as 11.11 and 12.12 introduce extreme demand volatility in e-commerce systems. Inventory depletion can occur within minutes, while traditional batch analytics react too late — after revenue and customer trust are already lost.
This project implements a fully dockerized, real-time Big Data Analytics (BDA) pipeline designed to prevent stockouts during high-traffic flash sales by providing minute-level, action-oriented insights to decision-makers.
The system continuously ingests high-velocity data, computes distributed analytics, caches KPIs for low-latency access, archives historical data, and surfaces only decision-relevant signals through a tactical dashboard.
During flash sales:
-
Demand spikes are non-linear and time-dependent
-
Inventory drains faster than batch refresh cycles
-
Managers lack visibility into where and when stockouts will occur
-
Reactive reporting leads to:
- Lost revenue
- Order cancellations
- Wasted ad spend
- Fulfillment bottlenecks
Enable managers to answer — within one minute:
- Which products are about to stock out?
- Where is the risk occurring (city / warehouse)?
- What action should be taken immediately? (restock, pause ads, reroute fulfillment, monitor)
End-to-end flow:
Statistical Data Generator
↓
MongoDB (Hot Data)
↓
Apache Spark (OLAP & KPIs)
↓
Redis (KPI Cache)
↓
Live Dashboard (Streamlit)
↘
Hadoop HDFS (Archival & Cold Storage)
Orchestration: Apache Airflow Deployment: Fully containerized using Docker & Docker Compose
| Layer | Technology | Purpose |
|---|---|---|
| Data Generation | Python (Statistical Simulation) | Realistic flash-sale behavior |
| Fresh Storage | MongoDB | High-write, low-latency ingestion |
| Processing | Apache Spark | Distributed OLAP analytics |
| Orchestration | Apache Airflow | Scheduling & pipeline control |
| Caching | Redis | Sub-second KPI reads |
| Archival | Hadoop HDFS | Scalable cold storage |
| Visualization | Streamlit | Tactical real-time dashboard |
| Infrastructure | Docker | Reproducibility & isolation |
This project operates at realistic big-data scale:
- ~69 million streaming records generated
- ~16 GB of incoming data
- Continuous ingestion throughout runtime
- Automatic archival once thresholds are crossed
This ensures:
- Non-trivial join sizes
- Meaningful Spark execution plans
- Realistic performance characteristics
Data is generated using parameterized statistical models, not uniform randomness.
-
Context-aware Poisson-based order arrivals
-
Orders per minute sampled from
Poisson(λ) -
λ dynamically adjusted using:
- Flash sale intensity
- Discount levels
- City demand factors
- Warehouse pressure
-
-
Weighted product selection
-
Probability proportional to:
popularity × discount_multiplier × scarcity_multiplier
-
-
Scarcity correlation
- Lower inventory increases selection bias
- Enables natural stockout emergence
-
Log-normal price modeling
- Prices centered around discounted base price
- Category-level variance for realism
This approach produces:
- Demand skew
- Hot SKUs
- Geographic imbalance
- Emergent stockouts (not hard-coded)
- order_id
- product_id
- seller_id
- warehouse_id
- city_id
- order_qty
- order_value
- order_ts
- inventory_event_id
- product_id
- warehouse_id
- inventory_on_hand
- reserved_stock
- fulfillment_delay
- inventory_ts
products_dim(category, base price)warehouses_dim(capacity, location)cities_dim(city, province)
All analytics require multi-table joins, enabling OLAP-style queries.
- Orders per minute
- Sales velocity (units/min per product & warehouse)
- Stockout ETA (minutes-to-stockout)
- High-risk SKU count
- Warehouse overload count
- Geographic risk distribution
- Risk trend over time
- Actionable alerts table
- Uses
WHERE,GROUP BY, andHAVING - Time-windowed (last 15 minutes)
- Early aggregation before joins
- Derived KPIs instead of raw scans
-
Dimension Seeding DAG
- Initializes reference tables
-
Minute-Level KPI DAG
- Triggers Spark analytics every minute
- Updates Redis cache
-
Archival DAG
- Moves older data from MongoDB → HDFS
- Keeps hot storage lean
Each DAG is isolated for:
- Fault containment
- Debuggability
- Operational clarity
Optimizations are applied at every layer:
- Rate-controlled generation
- Single write target (MongoDB)
- Hot–cold data separation (MongoDB ↔ HDFS)
- Time-windowed queries only
- Early aggregation
- Join ordering
- Derived KPIs
- No full scans
- Redis stores only pre-aggregated KPIs
- Eliminates Spark/Mongo hits from dashboard
- Dashboard reads Redis only
- Sub-second refresh
- Minimal, decision-centric visuals
- Clear schema definitions
- Timestamped records
- Partitioned archival (date / hour )
- Airflow logs provide lineage and execution metadata
The dashboard is tactical, not descriptive.
Designed to help managers:
- Spot stockout risk early
- Identify where it’s happening
- Decide action immediately
No scrolling, no heavy charts — only actionable insights.
This project demonstrates how real-time analytics, when engineered correctly, can shift decision-making from reactive reporting to preventive action.
It reflects:
- Realistic data scale
- Industry-grade architecture
- Practical engineering trade-offs
- Business-first analytics design
✔ Fully implemented ✔ Dockerized & reproducible ✔ Real-time & scalable ✔ Industry-aligned architecture
This project is fully containerized using Docker Compose. The steps below bring up the complete real-time Big Data Analytics pipeline locally, including ingestion, processing, orchestration, caching, archival, and visualization.
Prerequisites
- Docker ≥ 24.x
- Docker Compose (v2)
- At least 8–12 GB RAM recommended
- Linux (tested)
Build all images and start base infrastructure services:
docker compose up -d --buildThis initializes:
- MongoDB (fresh streaming data)
- Redis (KPI cache)
- Hadoop NameNode & DataNode (archival storage)
- Spark Master & Worker
- Airflow container (initial state)
Airflow requires an internal metadata database before it can schedule DAGs:
docker compose run --rm airflow airflow db initThis creates the Airflow metadata schema used for:
- DAG state
- Task execution history
- Scheduling metadata
Create an admin user to access the Airflow UI:
docker compose run --rm airflow airflow users create \
--username admin \
--firstname Admin \
--lastname User \
--role Admin \
--email admin@example.com \
--password adminStart the Airflow container in detached mode:
docker compose up -d airflow(Optional) Reset the admin password if required:
docker exec -it airflow airflow users reset-password \
--username admin \
--password adminLaunch the statistical flash-sale data generator:
docker compose up -d generatorThis begins:
- Continuous order generation
- Inventory updates
- High-velocity streaming inserts into MongoDB
The generator simulates flash-sale behavior using statistical models, not random noise.
Launch the real-time dashboard:
docker compose up -d dashboardThe dashboard:
- Reads only from Redis
- Displays pre-aggregated KPIs
- Updates automatically as data changes
Access it at:
http://localhost:8501
Run the Airflow scheduler and webserver inside the container:
docker exec -it airflow bash -lc "airflow webserver -D && airflow scheduler -D"Access the Airflow UI at:
http://localhost:8088
From here you can:
- Enable DAGs
- Monitor task execution
- Inspect logs
- Observe minute-level analytics & archival workflows
Initialize the HDFS directory structure for archived data and metadata:
docker exec -it namenode bash -lc "
/opt/hadoop-3.2.1/bin/hdfs dfs -mkdir -p /daraz_flashsale_archive/metadata &&
/opt/hadoop-3.2.1/bin/hdfs dfs -mkdir -p /daraz_flashsale_archive/orders_fact &&
/opt/hadoop-3.2.1/bin/hdfs dfs -mkdir -p /daraz_flashsale_archive/inventory_fact
"These directories store:
- Archived fact data (orders, inventory)
- Metadata describing archival batches
- Partitioned historical data (date/hour-based)
Once all services are running:
- The generator continuously inserts streaming data into MongoDB
- Airflow DAGs trigger Spark jobs every minute
- Spark computes KPIs using join-based OLAP queries
- Redis caches KPIs for low-latency access
- HDFS stores archived historical data
- The dashboard updates live without recomputation
This setup enables real-time stockout risk detection under flash-sale load.
Optional checks:
# Check Redis KPIs
docker exec -it redis redis-cli KEYS '*'
# Check MongoDB collections
docker exec -it mongo mongosh
# Check HDFS UI
http://localhost:9870To stop all services:
docker compose downTo remove volumes as well:
docker compose down -v- Kafka is intentionally not used — ingestion is deterministic and controlled for reproducible analytics.
- The pipeline is designed for minute-level micro-batching, not event-driven chaos.
- Redis is used strictly as a KPI cache, not a source of truth.


