A production-grade real-time click stream analytics system demonstrating Kafka Streams patterns, windowed aggregations, and fault-tolerant stream processing.
This system processes website click events in real-time, tracking user sessions, page views, and revenue metrics with persistent state stores and comprehensive error handling.
- Real-time Processing: Processes click events with sub-second latency
- Windowed Aggregations:
- Tumbling windows (5-minute fixed intervals)
- Session windows (user sessions with 10-minute timeout)
- Hopping windows (sliding 10-minute windows)
- Fault Tolerance:
- Persistent state stores (survives crashes)
- Automatic checkpointing every 100 events
- Exactly-once processing semantics
- Error Handling: Dead Letter Queue for failed events
- State Management: Crash recovery with state restoration
Click Generator → Kafka Topic (clickstream-events) → Stream Processors → Dead Letter Queue (clickstream-dlq)
Stream Processors:
- Basic Processor: Real-time stateful aggregations (page views, sessions, revenue)
- Windowed Processor: Time-based windows (tumbling, session, hopping)
- Fault-Tolerant Processor: Persistent state stores with crash recovery and error handling
- Apache Kafka 7.5.0 - Event streaming platform
- Python 3.8+ - Stream processing logic
- Docker Compose - Infrastructure orchestration
- Docker Desktop 20.10+
- Docker Compose 1.29+
- Python 3.8+
- Git
git clone https://github.com/ezechimere/clickstream-analytics.git
cd clickstream-analyticscd ../../../docker
docker-compose up -dWait 60 seconds for all services to start.
docker exec -it kafka bash
cd /bin
kafka-topics --create \
--bootstrap-server localhost:9092 \
--topic clickstream-events \
--partitions 3 \
--replication-factor 1
kafka-topics --create \
--bootstrap-server localhost:9092 \
--topic clickstream-dlq \
--partitions 1 \
--replication-factor 1
exitpython -m venv venv
source venv/Scripts/activate # Windows Git Bash
# or: .\venv\Scripts\Activate.ps1 # PowerShell
# or: source venv/bin/activate # Linux/Mac
pip install kafka-pythonTerminal 1 - Start Basic Processor:
python -u src/clickstream_processor_native.pyTerminal 2 - Generate Click Events:
python -u src/click_generator.pyOutput:
[PAGE STATS] /home: 20 views, avg duration: 145.2s
[SESSION] user_0042: 10 pages in session session_user_0042_...
[POPULAR] /products reached 100 views!
==========================================================
TOP 5 PAGES:
1. /home: 234 views, avg 145.2s
2. /products: 198 views, avg 138.5s
...
==========================================================
Run Windowed Processor:
python -u src/clickstream_windows.pyOutput:
======================================================================
[TUMBLING WINDOW] 08:10 - 08:15
Total Views: 551
Unique Users: 28
Top Pages:
1. /search: 42 views
2. /product/2: 40 views
3. /product/3: 40 views
======================================================================
[SESSION ENDED] user_0095
Duration: 26.6 minutes
Pages Viewed: 79
Path: /home -> /products -> /cart -> /checkout -> /account
[HOPPING WINDOW] Last 10 minutes:
Events in window: 2900
Trending: /cart (210 views)
Run Fault-Tolerant Processor:
python -u src/fault_tolerant_processor.pyTest Crash Recovery:
- Let processor run and checkpoint
- Press Ctrl+C (simulates crash)
- Restart:
python -u src/fault_tolerant_processor.py - State restored from disk automatically
Test Error Handling:
# Generate bad data
python -u src/bad_data_generator.pyOutput:
[VALIDATION ERROR] Missing required field: page_url
[DLQ] Sent event to dead letter queue
# Processing continues without crashing
- Kafka UI: http://localhost:8090
- DLQ Inspection:
docker exec -it kafka /bin/kafka-console-consumer \
--bootstrap-server localhost:9092 \
--topic clickstream-dlq \
--from-beginning# Count page views
page_views[page] += 1
# Calculate running averages
avg_duration = total_duration / view_count
# Track user sessions
user_sessions[session_id] = {
'pages': [...],
'duration': ...,
'start_time': ...
}Fixed-size, non-overlapping time windows:
[08:00-08:05] [08:05-08:10] [08:10-08:15]
Per-user sessions with inactivity timeout:
User activity → ... → 10 min gap → Session ends
Overlapping time windows (sliding):
Window 1: [08:00-08:10]
Window 2: [08:05-08:15] # 5-minute hop
Window 3: [08:10-08:20]
# Automatic checkpointing every 100 events
if events_since_checkpoint >= 100:
save_state_to_disk()
commit_kafka_offsets()# On startup
if state_file_exists():
load_previous_state()
resume_from_last_checkpoint()try:
process_event(event)
except ValidationError as e:
send_to_dlq(event, error)
continue # Keep processing- Throughput: 10 events/second (scalable to thousands)
- Latency: < 100ms per event
- State Size: ~1MB per 10,000 events
- Checkpoint Overhead: ~50ms per 100 events
- Recovery Time: < 5 seconds for typical state
This system demonstrates capabilities for:
- E-commerce Analytics: Real-time trending products, conversion funnels
- User Behavior Analysis: Session tracking, page journeys, bounce rates
- Revenue Attribution: Revenue per page, customer lifetime value
- Anomaly Detection: Traffic spikes, suspicious patterns
- A/B Testing: Real-time experiment results
clickstream-analytics/
├── src/
│ ├── click_generator.py # Event generator (10 events/sec)
│ ├── clickstream_processor_native.py # Basic stateful processor
│ ├── clickstream_windows.py # Windowed aggregations
│ ├── fault_tolerant_processor.py # Fault tolerance + DLQ
│ └── bad_data_generator.py # Test error handling
├── state_store/ # Persistent state (created at runtime)
├── .gitignore
└── README.md
Edit src/click_generator.py:
generator.generate_stream(
events_per_second=10, # Adjust throughput
duration_seconds=300 # Adjust duration
)Edit src/clickstream_windows.py:
self.window_size = 300 # Tumbling window (seconds)
self.session_timeout = 600 # Session timeout (seconds)
self.hopping_window_size = 600 # Hopping window size (seconds)Edit src/fault_tolerant_processor.py:
self.checkpoint_interval = 100 # Events between checkpoints- Grafana dashboards for real-time metrics
- Machine learning-based anomaly detection
- Multi-level aggregations (hourly, daily rollups)
- Integration with PostgreSQL for analytics storage
- User segmentation and cohort analysis
- Real-time recommendation engine
This is a portfolio/demonstration project. For production use, additional features needed:
- Authentication & authorization
- Encrypted connections
- High availability setup
- Distributed state stores
- Advanced monitoring & alerting
- Backup & disaster recovery
MIT License - See LICENSE file for details
Conrad Mba - Software Engineer | Data Engineering & Real-Time Systems | Kafka Specialist
- LinkedIn: [https://www.linkedin.com/in/conrad-mba/]
- Email: [mbaconrad@gmail.com]
- GitHub: [https://github.com/ezechimere]