An interactive browser-based simulation game that teaches distributed systems through hands-on incident response. Fix real infrastructure problems — no cluster required.
Distributed Ops Game drops you into running production systems that are starting to fail. Your job: diagnose the issue, apply the right configuration, and restore system health before it crashes.
Every scenario is grounded in a real-world engineering situation — from a pizza ordering platform overwhelmed at rush hour to a Redis cluster in a brain-split to a Flink job with unbounded state growth. Each technology's simulation engine models actual behavior in TypeScript so the mechanics you learn transfer directly to real systems.
150 scenarios across 5 technology tracks, from beginner through master difficulty.
| Technology | Scenarios | Concepts Covered |
|---|---|---|
| 🟠 Apache Kafka | 30 | Topics, Partitions, Consumer Groups, Replication, Exactly-Once, Kafka Streams, Schema Registry, MirrorMaker |
| 🔴 Redis | 30 | Data Structures, Pub/Sub, Streams, Persistence, Sentinel, Cluster, Eviction, Redlock, RediSearch |
| 🟡 Elasticsearch | 30 | Shards, Mappings, Query DSL, Aggregations, ILM, CCR, Snapshots, ML Anomaly Detection, EQL |
| 🔵 Apache Flink | 30 | Windowing, Watermarks, Checkpointing, State Backends, Backpressure, Exactly-Once, CEP, Rescaling |
| 🟣 RabbitMQ | 30 | Exchanges, Queues, Routing, Publisher Confirms, Dead Letters, Quorum Queues, Streams, Federation |
All 5 tracks are available from the start — no cross-technology unlock requirements.
- Choose a technology track — select from Kafka, Redis, Elasticsearch, Flink, or RabbitMQ
- Read the briefing — understand the system architecture and the symptom being reported
- Watch the simulation — entities animate in real time with flowing message particles
- Diagnose the failure — use the metrics panel (throughput, latency, error rate, health score) to identify the root cause
- Apply the fix — adjust configuration in the control panel
- Sustain recovery — hold the fix for 10 consecutive ticks to win
Score is based on time taken, hints used, and final system health. Each scenario within a track unlocks the next.
| # | Scenario | Difficulty | Primary Concept |
|---|---|---|---|
| 1 | Pizza Order System | Beginner | Consumer lag, max.poll.records |
| 2 | Flash Sale Inventory | Easy | Partitions, consumer groups |
| 3 | Ride-Sharing Dispatch | Easy | Message keys, partition routing |
| 4 | Chat App Fan-Out | Easy | Multiple consumer groups, auto.offset.reset |
| 5 | IoT Sensor Pipeline | Medium | Batching, linger.ms, compression |
| 6 | Stock Market Data Feed | Medium | Key strategy, per-symbol ordering |
| 7 | Payment Gateway | Medium | Idempotent producer, acks=all |
| 8 | Log Aggregation Pipeline | Medium | Retention (time + size), cleanup.policy |
| 9 | E-Commerce Order Pipeline | Medium | Transactional read-process-write |
| 10 | Audit Log Compliance | Medium | Replication factor, min.insync.replicas |
| 11 | Real-Time Analytics Dashboard | Medium-Hard | Manual commit, offset reset |
| 12 | Video Streaming Platform | Medium-Hard | Large messages, fetch.max.bytes |
| 13 | Supply Chain Event Tracker | Medium-Hard | Transactions, isolation.level |
| 14 | Gaming Leaderboard | Medium-Hard | Partition scaling, rebalance |
| 15 | Healthcare Patient Monitor | Hard | session.timeout.ms, SLA enforcement |
| 16 | Microservices Event Bus | Hard | Dead letter queue, retry logic |
| 17 | Database CDC Sync | Hard | Log compaction, exactly-once |
| 18 | Fraud Detection Engine | Expert | Kafka Streams, stateful windowing |
| 19 | Schema Registry Migration | Expert | Schema evolution, BACKWARD compatibility |
| 20 | Multi-DC Disaster Recovery | Master | MirrorMaker, geo-replication lag |
| 21 | Log Compaction Deep Dive | Hard | Tombstones, compaction lag, cleaner threads |
| 22 | Consumer Rebalance Storm | Hard | Eager vs cooperative sticky rebalancing |
| 23 | Quota Throttling Crisis | Hard | Producer/consumer byte-rate quotas |
| 24 | Kafka Connect — JDBC Sink | Hard | Sink connector, error tolerance, DLQ |
| 25 | Debezium CDC Source | Hard | CDC, binlog offset, idempotent producer |
| 26 | Schema Forward Compatibility | Expert | FORWARD compat, field removal, Avro unions |
| 27 | Partition Leadership Imbalance | Expert | Preferred replica election, leader skew |
| 28 | Active-Active Geo-Replication | Expert | MirrorMaker 2, cycle detection |
| 29 | ACL & SASL Security Incident | Expert | SASL/PLAIN, ACLs, authorization failures |
| 30 | Multi-Tenant Cluster Isolation | Master | Quotas per client-id, namespace isolation |
| # | Scenario | Difficulty | Primary Concept |
|---|---|---|---|
| 1 | Session Cache Miss Storm | Beginner | GET/SET, TTL, cache-aside pattern |
| 2 | Leaderboard Sorted Set | Easy | ZADD/ZRANGE, sorted sets |
| 3 | Shopping Cart Hash | Easy | HSET/HGET, hash operations |
| 4 | Rate Limiter Race Condition | Easy | INCR + EXPIRE, atomic operations |
| 5 | Pub/Sub Fan-Out Failure | Easy | PUBLISH/SUBSCRIBE vs Streams |
| 6 | Task Queue Data Loss | Medium | LPUSH/BRPOPLPUSH, reliable queue |
| 7 | Cache Stampede | Medium | Thundering herd, mutex lock |
| 8 | Inventory Race Condition | Medium | WATCH/MULTI/EXEC transactions |
| 9 | Bloom Filter Memory | Medium | Probabilistic structures, false positives |
| 10 | Geospatial Delivery Zones | Medium | GEOADD/GEORADIUS, spatial queries |
| 11 | RDB Snapshot Blocking | Medium | BGSAVE, fork, COW, latency spikes |
| 12 | AOF Rewrite Overhead | Medium | appendfsync, AOF rewrite, disk I/O |
| 13 | Memory Eviction Crisis | Medium-Hard | maxmemory-policy, LRU vs LFU |
| 14 | Streams Consumer Group | Medium-Hard | XADD/XREADGROUP, pending entries |
| 15 | Keyspace Notification Flood | Medium-Hard | notify-keyspace-events, filtering |
| 16 | Lua Script Blocking | Medium-Hard | EVAL, event loop, atomicity |
| 17 | Pipeline Throughput | Medium-Hard | Pipelining, RTT reduction |
| 18 | Sentinel Failover | Hard | Sentinel quorum, leader election |
| 19 | Cluster Slot Resharding | Hard | CLUSTER RESHARD, MOVED redirects |
| 20 | Hot Key Overload | Hard | Key sharding, read replicas |
| 21 | Redlock Race Condition | Hard | SET NX PX, fencing tokens |
| 22 | Replica Lag Under Load | Hard | repl-backlog-size, partial resync |
| 23 | Connection Pool Exhaustion | Hard | maxclients, connection multiplexing |
| 24 | Large Value Fragmentation | Hard | OBJECT ENCODING, compression |
| 25 | Time Series High Cardinality | Expert | RedisTimeSeries, downsampling |
| 26 | RediSearch Index Corruption | Expert | FT.CREATE, SORTABLE, query optimization |
| 27 | Transaction Isolation Failure | Expert | MULTI/EXEC, WATCH, retry backoff |
| 28 | Cluster Brain-Split | Expert | Quorum, cluster-require-full-coverage |
| 29 | ACL Security Breach | Expert | ACL SETUSER, command categories |
| 30 | Active-Active Geo-Replication | Master | CRDT, conflict resolution, causal consistency |
| # | Scenario | Difficulty | Primary Concept |
|---|---|---|---|
| 1 | Unassigned Shards | Beginner | Primary shard allocation, cluster yellow/red |
| 2 | Index Not Found | Easy | Index creation, dynamic vs explicit mappings |
| 3 | Slow Query | Easy | match vs term queries, _source filtering |
| 4 | Mapping Conflict | Easy | Field type mismatch, strict mapping |
| 5 | Over-Sharding OOM | Medium | Shard sizing, heap per shard |
| 6 | Relevance Tuning | Medium | BM25 scoring, field boost |
| 7 | Analyzer Mismatch | Medium | standard vs keyword analyzers |
| 8 | Nested Object Query | Medium | nested field type, nested query |
| 9 | Aggregation Memory OOM | Medium | Terms agg circuit breaker |
| 10 | Index Template Migration | Medium | Template priority, component templates |
| 11 | Reindex Performance | Medium-Hard | Sliced scroll, pipeline ingest |
| 12 | Disk Watermark Breach | Medium-Hard | flood_stage, read-only index |
| 13 | Split-Brain Cluster | Medium-Hard | Master quorum, voting config |
| 14 | Alias Rollover | Medium-Hard | Write alias, ILM rollover |
| 15 | Ingest Pipeline Failure | Medium-Hard | Enrich processor, GeoIP, refresh |
| 16 | ILM Policy Misconfiguration | Medium-Hard | hot/warm/cold/delete phases |
| 17 | Cross-Cluster Replication Lag | Hard | CCR leader/follower, lag monitoring |
| 18 | Snapshot Restore Failure | Hard | SLM policy, partial restore |
| 19 | Deep Pagination OOM | Hard | search_after, point-in-time |
| 20 | Circuit Breaker Tripping | Hard | Request/fielddata breakers, heap |
| 21 | Security Role Mapping | Hard | Document-level security, field masking |
| 22 | Watcher Alert Latency | Hard | Trigger, condition, action throttle |
| 23 | EQL Sequence Matching | Expert | EQL syntax, max_span |
| 24 | ML Anomaly Detection | Expert | Datafeed, job state, index patterns |
| 25 | Runtime Field Performance | Expert | Painless scripts, doc_values |
| 26 | Async Search | Expert | Long-running queries, status polling |
| 27 | Percolator Queries | Expert | Document matching, alerting |
| 28 | Geo-Shape Indexing | Expert | geo_shape, BKD tree, spatial |
| 29 | Transform Pivot Aggregation | Expert | Transforms, checkpointing |
| 30 | Cross-Cluster Search | Master | CCS, skip_unavailable, minimize_roundtrips |
| # | Scenario | Difficulty | Primary Concept |
|---|---|---|---|
| 1 | DataStream Backpressure | Beginner | Operator chaining, throughput |
| 2 | Tumbling Window Late Data | Easy | Window triggers, allowedLateness |
| 3 | Event Time Semantics | Easy | Watermarks, out-of-order records |
| 4 | Unbounded ValueState | Easy | StateTtlConfig, key TTL |
| 5 | Sliding Window Memory | Medium | Window pane overhead |
| 6 | Session Window Timeout | Medium | Session gap, dynamic sessions |
| 7 | Checkpoint Failure | Medium | Checkpoint barriers, recovery |
| 8 | Savepoint Migration | Medium | Operator UIDs, restore |
| 9 | Kafka Source Offset | Medium | scan.startup.mode, backfill |
| 10 | Side Output Late Data | Medium | Tagged outputs, late events |
| 11 | Async I/O DB Lookup | Medium-Hard | AsyncFunction, capacity |
| 12 | RocksDB State Backend | Medium-Hard | Heap vs RocksDB, incremental checkpoints |
| 13 | Broadcast State Rules | Medium-Hard | BroadcastStream, dynamic config |
| 14 | Temporal Join | Medium-Hard | Versioned table, event-time join |
| 15 | Watermark Alignment | Medium-Hard | Multi-source drift, idle timeout |
| 16 | Late Data Side Output | Medium-Hard | allowedLateness, side output |
| 17 | Task Manager OOM | Hard | Managed memory fraction, network buffers |
| 18 | State Rescaling | Hard | Key group redistribution, savepoint |
| 19 | Exactly-Once Sink | Hard | TwoPhaseCommitSink, pre-commit |
| 20 | CEP Pattern Matching | Hard | Strict/relaxed contiguity |
| 21 | State TTL Cleanup | Hard | StateTtlConfig, background cleanup |
| 22 | Dynamic Parallelism | Hard | Per-operator parallelism, auto-scaling |
| 23 | Flink SQL CDC Pipeline | Expert | CDC connector, upsert Kafka sink |
| 24 | Temporal Join Versioned Table | Expert | Versioned table, changelog mode |
| 25 | Prometheus Metrics | Expert | MetricGroup, custom reporters |
| 26 | Multi-Sink Fan-Out | Expert | Independent exactly-once per sink |
| 27 | Changelog Compaction | Expert | CHANGELOG_MODE, retract vs upsert |
| 28 | Kubernetes HA | Expert | Application mode, HA config |
| 29 | Global Window Trigger | Master | Custom trigger, purge logic |
| 30 | Unified Batch + Streaming | Master | BATCH execution mode, bounded source |
| # | Scenario | Difficulty | Primary Concept |
|---|---|---|---|
| 1 | Queue Overflow | Beginner | max-length, consumer overload |
| 2 | Direct Exchange Routing | Easy | Routing keys, bindings |
| 3 | Fanout Broadcast | Easy | Fanout exchange, multi-queue |
| 4 | Topic Exchange Wildcards | Easy | #/* routing patterns |
| 5 | Message TTL Expiry | Medium | Per-message TTL, x-message-ttl |
| 6 | Dead Letter Infinite Loop | Medium | DLX, nack requeue=false |
| 7 | Priority Queue Starvation | Medium | x-max-priority, fairness |
| 8 | Manual Acknowledgements | Medium | manual-ack, nack on failure |
| 9 | Publisher Confirms | Medium | confirm mode, retry on nack |
| 10 | Prefetch Throttling | Medium | basic.qos, unacked limits |
| 11 | Lazy Queue Memory | Medium-Hard | Lazy mode, disk spooling |
| 12 | Headers Exchange | Medium-Hard | x-match all/any, complex routing |
| 13 | Classic Mirrored HA | Medium-Hard | ha-mode, ha-sync-mode |
| 14 | Shovel Plugin | Medium-Hard | Shovel config, frame max |
| 15 | Federation Link | Medium-Hard | Federation upstream, link state |
| 16 | Vhost Isolation | Medium-Hard | Virtual hosts, per-vhost limits |
| 17 | Memory Alarm Blocking | Hard | vm_memory_high_watermark, flow control |
| 18 | Disk Free Alarm | Hard | disk_free_limit, publish blocking |
| 19 | Quorum Queue Election | Hard | Raft consensus, leader election |
| 20 | Classic → Quorum Migration | Hard | Drain-and-delete migration |
| 21 | Split-Brain Partition | Hard | cluster_partition_handling, autoheal |
| 22 | Connection Storm | Hard | channel_max, connection pooling |
| 23 | OAuth 2.0 Auth | Hard | rabbitmq-auth-backend-oauth2, JWT |
| 24 | Per-User Rate Limiting | Hard | Credit flow, per-connection rate |
| 25 | Consistent Hash Exchange | Expert | Sharded queues, slot redistribution |
| 26 | Stream Queue Throughput | Expert | RabbitMQ Streams, publisher offsets |
| 27 | Stream Offset Replay | Expert | Offset spec, timestamp-based restart |
| 28 | Delayed Message Exchange | Expert | rabbitmq-delayed-message-exchange |
| 29 | Multi-AZ Active-Passive | Expert | Quorum queues, node evacuation |
| 30 | Cross-Protocol AMQP → MQTT | Master | MQTT plugin, QoS levels, session |
| Concern | Library |
|---|---|
| UI | React 18 + TypeScript |
| Build | Vite 6 |
| Styling | Tailwind CSS 4 |
| State | Zustand 5 |
| Animations | Framer Motion 12 |
| Charts | Recharts 3 |
| Testing | Vitest + @testing-library/react |
The simulation engines (src/technologies/*/engine/) are pure TypeScript with no React dependency — they can be unit-tested without a browser and run headless.
# Install dependencies
npm install
# Start dev server
npm run dev
# Run tests
npm test
# Production build
npm run buildRequires Node 18+. No external services needed — everything runs in-browser.
src/
├── technologies/ # Per-technology engines + scenarios
│ ├── types.ts # TechKey, TechDefinition, TECH_DEFINITIONS
│ ├── kafka/
│ │ ├── engine/ # Kafka simulation engine (11 modules)
│ │ └── scenarios/ # 30 Kafka scenario definitions
│ ├── redis/
│ │ ├── engine/ # Redis simulation engine
│ │ └── scenarios/ # 30 Redis scenario definitions
│ ├── elasticsearch/
│ │ ├── engine/ # Elasticsearch simulation engine
│ │ └── scenarios/ # 30 ES scenario definitions
│ ├── flink/
│ │ ├── engine/ # Flink simulation engine
│ │ └── scenarios/ # 30 Flink scenario definitions
│ └── rabbitmq/
│ ├── engine/ # RabbitMQ simulation engine
│ └── scenarios/ # 30 RabbitMQ scenario definitions
│
├── store/
│ ├── gameStore # Phase, tech selection, per-tech progress (localStorage)
│ ├── simulationStore # Live snapshot from active engine
│ └── metricsStore # 300-tick circular buffer for charts
│
└── components/
├── screens/
│ ├── TechnologyLobby # 5-technology selection screen
│ ├── MainMenu # Per-tech scenario grid
│ └── GameScreen # Simulation canvas + control panel
├── canvas/ # SimulationCanvas, node components, particles
├── panels/ # ControlPanel — per-entity config UI
├── metrics/ # MetricsPanel, Recharts charts
└── tutorial/ # HintPanel, scenario briefing
Each engine runs a 100ms tick loop (scalable to 2×/4× speed):
- Entity step — update entity states based on current config
- Failure injector — fire scripted
FailureEventat the right tick - Metrics step — compute health score, error rate, throughput
- Victory check — evaluate conditions; 10 consecutive passes = win
- Emit — push snapshot to
simulationStore
score = 1000
− (secondsTaken × 5) # time penalty
− (hintsUsed × 50) # hint penalty
+ (finalHealthScore × 2) # health bonus (max +200)
− (duplicates × 10) # correctness penalty
Stars: ≥ 800 → 3★ ≥ 500 → 2★ ≥ 1 → 1★
Progress and scores persist per technology to localStorage.
consumer-lag · partitions · consumer-groups · message-keys · auto.offset.reset · linger.ms · batch.size · compression · idempotent-producer · acks · transactions · isolation.level · retention · compaction · replication-factor · min.insync.replicas · manual-commit · max.request.size · session.timeout.ms · dead-letter-queue · schema-evolution · mirrormaker · kafka-connect · quotas · acl
strings · hashes · lists · sets · sorted-sets · streams · pub-sub · ttl · eviction · rdb · aof · pipelining · transactions · lua-scripts · sentinel · cluster · redlock · keyspace-notifications · bloom-filter · geospatial · timeseries · redisearch
shards · replicas · mappings · analyzers · query-dsl · aggregations · ilm · aliases · rollover · ccr · snapshots · ingest-pipelines · circuit-breakers · security · eql · ml-anomaly-detection · runtime-fields · async-search · percolator · transforms
datastream · windowing · watermarks · event-time · processing-time · checkpointing · savepoints · state-backends · rocksdb · backpressure · exactly-once · cep · broadcast-state · temporal-join · async-io · rescaling · flink-sql · kubernetes-ha
exchanges · queues · bindings · routing-keys · publisher-confirms · consumer-acks · prefetch · dead-letter-exchange · ttl · priority-queues · lazy-queues · quorum-queues · streams · federation · shovel · vhosts · flow-control · oauth2 · consistent-hash · mqtt