Click Stream Analytics Platform

A production-grade real-time click stream analytics system demonstrating Kafka Streams patterns, windowed aggregations, and fault-tolerant stream processing.

Project Overview

This system processes website click events in real-time, tracking user sessions, page views, and revenue metrics with persistent state stores and comprehensive error handling.

Features

Real-time Processing: Processes click events with sub-second latency
Windowed Aggregations:
- Tumbling windows (5-minute fixed intervals)
- Session windows (user sessions with 10-minute timeout)
- Hopping windows (sliding 10-minute windows)
Fault Tolerance:
- Persistent state stores (survives crashes)
- Automatic checkpointing every 100 events
- Exactly-once processing semantics
Error Handling: Dead Letter Queue for failed events
State Management: Crash recovery with state restoration

Architecture

Click Generator → Kafka Topic (clickstream-events) → Stream Processors → Dead Letter Queue (clickstream-dlq)

Stream Processors:

Basic Processor: Real-time stateful aggregations (page views, sessions, revenue)
Windowed Processor: Time-based windows (tumbling, session, hopping)
Fault-Tolerant Processor: Persistent state stores with crash recovery and error handling

Tech Stack

Apache Kafka 7.5.0 - Event streaming platform
Python 3.8+ - Stream processing logic
Docker Compose - Infrastructure orchestration

Prerequisites

Docker Desktop 20.10+
Docker Compose 1.29+
Python 3.8+
Git

Installation

1. Clone Repository

git clone https://github.com/ezechimere/clickstream-analytics.git
cd clickstream-analytics

2. Start Kafka Infrastructure

cd ../../../docker
docker-compose up -d

Wait 60 seconds for all services to start.

3. Create Kafka Topics

docker exec -it kafka bash
cd /bin

kafka-topics --create \
  --bootstrap-server localhost:9092 \
  --topic clickstream-events \
  --partitions 3 \
  --replication-factor 1

kafka-topics --create \
  --bootstrap-server localhost:9092 \
  --topic clickstream-dlq \
  --partitions 1 \
  --replication-factor 1

exit

4. Install Python Dependencies

python -m venv venv
source venv/Scripts/activate  # Windows Git Bash
# or: .\venv\Scripts\Activate.ps1  # PowerShell
# or: source venv/bin/activate  # Linux/Mac

pip install kafka-python

Usage

1. Basic Stateful Processing

Terminal 1 - Start Basic Processor:

python -u src/clickstream_processor_native.py

Terminal 2 - Generate Click Events:

python -u src/click_generator.py

Output:

[PAGE STATS] /home: 20 views, avg duration: 145.2s
[SESSION] user_0042: 10 pages in session session_user_0042_...
[POPULAR] /products reached 100 views!

==========================================================
TOP 5 PAGES:
  1. /home: 234 views, avg 145.2s
  2. /products: 198 views, avg 138.5s
  ...
==========================================================

2. Windowed Aggregations

Run Windowed Processor:

python -u src/clickstream_windows.py

Output:

======================================================================
[TUMBLING WINDOW] 08:10 - 08:15
  Total Views: 551
  Unique Users: 28
  Top Pages:
    1. /search: 42 views
    2. /product/2: 40 views
    3. /product/3: 40 views
======================================================================

[SESSION ENDED] user_0095
  Duration: 26.6 minutes
  Pages Viewed: 79
  Path: /home -> /products -> /cart -> /checkout -> /account

[HOPPING WINDOW] Last 10 minutes:
  Events in window: 2900
  Trending: /cart (210 views)

3. Fault-Tolerant Processing

Run Fault-Tolerant Processor:

python -u src/fault_tolerant_processor.py

Test Crash Recovery:

Let processor run and checkpoint
Press Ctrl+C (simulates crash)
Restart: python -u src/fault_tolerant_processor.py
State restored from disk automatically

Test Error Handling:

# Generate bad data
python -u src/bad_data_generator.py

Output:

[VALIDATION ERROR] Missing required field: page_url
[DLQ] Sent event to dead letter queue
# Processing continues without crashing

Monitoring

Kafka UI: http://localhost:8090
DLQ Inspection:

  docker exec -it kafka /bin/kafka-console-consumer \
    --bootstrap-server localhost:9092 \
    --topic clickstream-dlq \
    --from-beginning

Stream Processing Patterns

Stateful Aggregations

# Count page views
page_views[page] += 1

# Calculate running averages
avg_duration = total_duration / view_count

# Track user sessions
user_sessions[session_id] = {
    'pages': [...],
    'duration': ...,
    'start_time': ...
}

Tumbling Windows

Fixed-size, non-overlapping time windows:

[08:00-08:05] [08:05-08:10] [08:10-08:15]

Session Windows

Per-user sessions with inactivity timeout:

User activity → ... → 10 min gap → Session ends

Hopping Windows

Overlapping time windows (sliding):

Window 1: [08:00-08:10]
Window 2: [08:05-08:15]  # 5-minute hop
Window 3: [08:10-08:20]

Fault Tolerance

State Persistence

# Automatic checkpointing every 100 events
if events_since_checkpoint >= 100:
    save_state_to_disk()
    commit_kafka_offsets()

Crash Recovery

# On startup
if state_file_exists():
    load_previous_state()
    resume_from_last_checkpoint()

Dead Letter Queue

try:
    process_event(event)
except ValidationError as e:
    send_to_dlq(event, error)
    continue  # Keep processing

Performance Metrics

Throughput: 10 events/second (scalable to thousands)
Latency: < 100ms per event
State Size: ~1MB per 10,000 events
Checkpoint Overhead: ~50ms per 100 events
Recovery Time: < 5 seconds for typical state

Business Value

This system demonstrates capabilities for:

E-commerce Analytics: Real-time trending products, conversion funnels
User Behavior Analysis: Session tracking, page journeys, bounce rates
Revenue Attribution: Revenue per page, customer lifetime value
Anomaly Detection: Traffic spikes, suspicious patterns
A/B Testing: Real-time experiment results

Project Structure

clickstream-analytics/
├── src/
│   ├── click_generator.py                # Event generator (10 events/sec)
│   ├── clickstream_processor_native.py   # Basic stateful processor
│   ├── clickstream_windows.py            # Windowed aggregations
│   ├── fault_tolerant_processor.py       # Fault tolerance + DLQ
│   └── bad_data_generator.py             # Test error handling
├── state_store/                          # Persistent state (created at runtime)
├── .gitignore
└── README.md

Configuration

Event Generator

Edit src/click_generator.py:

generator.generate_stream(
    events_per_second=10,  # Adjust throughput
    duration_seconds=300   # Adjust duration
)

Window Sizes

Edit src/clickstream_windows.py:

self.window_size = 300         # Tumbling window (seconds)
self.session_timeout = 600     # Session timeout (seconds)
self.hopping_window_size = 600 # Hopping window size (seconds)

Checkpointing

Edit src/fault_tolerant_processor.py:

self.checkpoint_interval = 100  # Events between checkpoints

Roadmap

Grafana dashboards for real-time metrics
Machine learning-based anomaly detection
Multi-level aggregations (hourly, daily rollups)
Integration with PostgreSQL for analytics storage
User segmentation and cohort analysis
Real-time recommendation engine

Contributing

This is a portfolio/demonstration project. For production use, additional features needed:

Authentication & authorization
Encrypted connections
High availability setup
Distributed state stores
Advanced monitoring & alerting
Backup & disaster recovery

License

MIT License - See LICENSE file for details

Author

Conrad Mba - Software Engineer | Data Engineering & Real-Time Systems | Kafka Specialist

LinkedIn: [https://www.linkedin.com/in/conrad-mba/]
Email: [mbaconrad@gmail.com]
GitHub: [https://github.com/ezechimere]

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
src		src
.gitignore		.gitignore
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Click Stream Analytics Platform

Project Overview

Features

Architecture

Tech Stack

Prerequisites

Installation

1. Clone Repository

2. Start Kafka Infrastructure

3. Create Kafka Topics

4. Install Python Dependencies

Usage

1. Basic Stateful Processing

2. Windowed Aggregations

3. Fault-Tolerant Processing

Monitoring

Stream Processing Patterns

Stateful Aggregations

Tumbling Windows

Session Windows

Hopping Windows

Fault Tolerance

State Persistence

Crash Recovery

Dead Letter Queue

Performance Metrics

Business Value

Project Structure

Configuration

Event Generator

Window Sizes

Checkpointing

Roadmap

Contributing

License

Author

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages