LabWatch Platform is a distributed, event-driven monitoring system designed to simulate real-world infrastructure alerting workflows.
It ingests machine telemetry, processes events asynchronously using Kafka, and manages alert lifecycle state using a microservice architecture.
- Event-driven architecture using Kafka
- Microservices built with Spring Boot
- Real-time alert processing pipeline
- Alert deduplication and lifecycle management
- Dockerized system for consistent deployment
- PostgreSQL-backed persistence layer
Client / Agent ↓ monitoring-api (REST ingestion) ↓ Kafka (health-events topic) ↓ alert-engine (async processing) ↓ ai-engine-service (anomaly detection) ↓ PostgreSQL (alerts + events)
- Registers machines/agents before telemetry ingestion
- Receives telemetry via REST (
POST /api/v1/telemetry/snapshots) - Optionally validates
X-Agent-Tokenon ingestion - Publishes events to Kafka topic (
health-events)
- Consumes Kafka events asynchronously
- Applies threshold-based alert logic
- Prevents duplicate ACTIVE alerts
- Transitions alerts from ACTIVE → RESOLVED
- Persists alerts in PostgreSQL
- Consumes Kafka
health-events - Maintains rolling baselines per machine + metric type
- Detects anomalies with rolling average, standard deviation, and z-score
- Publishes anomaly messages to Kafka topic
anomaly-events - Persists detected anomalies in PostgreSQL
- Exposes REST API at
GET /api/anomalies
- Agents can register through
POST /api/v1/agents/register monitoring-apinow tracks machines and agent records separately- Agent auth can be enabled with
LABWATCH_AGENT_AUTH_ENABLED=true - Dashboard can switch between multiple reported machines while keeping the single-machine view intact
- Users can register and login through
POST /api/v1/auth/registerandPOST /api/v1/auth/login - JWT auth is optional and disabled by default for local development
- Machines can remain unowned for backward compatibility, then be claimed later by a user
- Claimed machines are filtered to their owner when auth is enabled
Decoupled services using Kafka to enable scalability and fault tolerance.
Prevents alert spam by ensuring only one ACTIVE alert exists per machine + alert type.
Alerts automatically transition: ACTIVE → RESOLVED
Each alert includes:
createdAtresolvedAt
Supports CPU, Memory, and Disk thresholds.
Uses a rolling window with configurable minimum samples and z-score threshold to flag outlier telemetry values.
- Docker Desktop
docker compose up --build -d./scripts/seed-demo-telemetry.sh- Profile guide: docs/ENVIRONMENT_PROFILES.md
- Local/demo Compose startup defaults to
LABWATCH_SPRING_PROFILE=demo - Persistent-schema services now use Flyway migrations with
ddl-auto=validate
| Service | URL |
|---|---|
| monitoring-api | http://localhost:8089 |
| alert-engine | http://localhost:8088 |
| ai-engine-service | http://localhost:8090 |
POST /api/v1/agents/register
{
"machineIdentifier": "lab-pc-01",
"hostname": "lab-pc-01.local",
"osType": "Darwin",
"osVersion": "23.5.0",
"agentVersion": "1.0.0"
}POST /api/v1/telemetry/snapshots
Include X-Agent-Token when agent auth is enabled.
GET /api/v1/machines
POST /api/v1/auth/register
{
"email": "user@example.com",
"password": "password123",
"displayName": "Derwin"
}POST /api/v1/auth/login
POST /api/v1/machines/{machineIdentifier}/claim
Requires Authorization: Bearer <jwt>.
GET /api/alerts
- Machine sends event → monitoring-api
- Event stored + published to Kafka
- alert-engine consumes event
- ai-engine-service evaluates the same event stream for anomalies
- Alert created if threshold exceeded
- Anomaly published to
anomaly-eventswhen z-score exceeds the configured threshold - Alert resolved when metric normalizes
- Java
- Spring Boot
- Spring Data JPA (Hibernate)
- PostgreSQL
- Apache Kafka
- Docker + Docker Compose
- Maven
- Alert severity levels (INFO / WARNING / CRITICAL)
- Multi-user account ownership for machines
- Observability (metrics + logging)
- Cloud deployment (AWS)
- Deployment and demo environment notes: docs/DEPLOYMENT_READINESS.md
- Backend stability runbook: docs/STABILITY_TESTING.md
LABWATCH_AUTH_ENABLED=falseAI_PROVIDER=mock- Landing page still shows auth/product entry points
- Dashboard remains directly accessible for recruiter demos
LABWATCH_AUTH_ENABLED=true- JWT auth is active
- first registered user becomes
ADMIN - later users default to
OPERATOR
LABWATCH_AUTH_ENABLED=falseLABWATCH_AGENT_AUTH_ENABLED=false- Dashboard works without login
- Existing unowned machines remain visible
- Set
LABWATCH_AUTH_ENABLED=true - Set
JWT_SECRETto a non-default value - Optionally set
JWT_EXPIRATION_MINUTES - Register/login in the dashboard, then claim unowned machines from the machine sidebar
monitoring-api,alert-engine, andai-engine-servicenow use Flyway migrations.- Each service keeps a dedicated Flyway history table because the platform shares one PostgreSQL database.
- Existing databases can transition safely with
baseline-on-migrate=true. - Fresh environments are created from versioned SQL migration files instead of
ddl-auto=update. - Existing
machinerows remain valid because ownership is nullable. - Existing machines will show as unowned until a logged-in user claims them.
- Distributed system design
- Event-driven architecture with Kafka
- Microservice communication patterns
- Backend system scalability concepts
- Real-world alert lifecycle handling
- DevOps fundamentals with Docker
Derwin Bell