A distributed job scheduling system built in Go with Raft consensus, gRPC communication, and production-grade security.
- Distributed Consensus: Raft-based leader election with automatic failover
- Job Scheduling: Multiple policies (round-robin, least-loaded, capacity-aware)
- Security: mTLS, JWT authentication, RBAC authorization
- Observability: Prometheus metrics, OpenTelemetry tracing, structured audit logs
- Persistence: BoltDB storage with crash recovery
- Job Types: Image processing, web scraping, data analysis, custom types
- Go 1.25+
- Make
- protoc (for proto generation)
make build # Build all binaries
make proto # Regenerate protobuf files
make help # Show all available targets# Terminal 1: Start master
make run-master
# Terminal 2: Start worker
make run-worker
# Terminal 3: Submit a job
./bin/client -addr localhost:9000 -type image_processing -payload '{"url": "example.com/image.jpg"}'./scripts/start-cluster.shThis script starts a single-node cluster with automatic worker registration. Useful for local development and testing.
./scripts/start-ha-cluster.shThis script bootstraps a 3-node Raft cluster for high availability testing. Nodes use ports:
- Master 1: gRPC 9000, HTTP 8080 (bootstrap/leader)
- Master 2: gRPC 9002, HTTP 8081
- Master 3: gRPC 9004, HTTP 8082
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Master 1 │◄───►│ Master 2 │◄───►│ Master 3 │
│ (Leader) │ │ (Follower) │ │ (Follower) │
└──────┬──────┘ └─────────────┘ └─────────────┘
│ Raft Consensus
▼
┌─────────────┐ ┌─────────────┐
│ Worker 1 │ │ Worker 2 │
└─────────────┘ └─────────────┘
| Component | Description |
|---|---|
| Master | Raft leader handles job submission, scheduling, and state replication |
| Worker | Executes jobs and reports results via heartbeat |
| Client | CLI for job submission and status queries |
| Type | Description |
|---|---|
image_processing |
Image manipulation tasks |
web_scraping |
Web page scraping |
data_analysis |
Data processing jobs |
test_sleep |
Test job that sleeps (for testing) |
Configuration uses profile-based YAML files + environment variable overrides:
| Profile | File | Use Case |
|---|---|---|
| development | config/dev.yaml |
Local testing, relaxed security |
| staging | config/staging.yaml |
Pre-production with TLS |
| production | config/prod.yaml |
Full security, HA required |
# Core
CONDUCTOR_PROFILE=staging # Profile: development, staging, production
NODE_ID=master-1 # Unique node identifier
GRPC_MASTER_PORT=9000 # gRPC listen port
HTTP_PORT=8080 # HTTP API port
# Security
JWT_SECRET_KEY=<secret> # JWT signing key (required in staging/prod)
SECURITY_TLS_ENABLED=true # Enable mTLS
SECURITY_TLS_SKIP_VERIFY=true # Skip TLS verification (dev only)
# Cluster
BOOTSTRAP=true # Bootstrap new Raft cluster
JOIN_ADDR=localhost:9000 # Address of existing cluster to join| Endpoint | Description |
|---|---|
GET /health |
Health check (returns OK) |
GET /ready |
Readiness check |
GET /cluster/status |
Cluster info (leader, peers, state) |
GET /failover/status |
Failover detector status |
| Method | Description |
|---|---|
SubmitJob |
Submit a new job |
GetJobStatus |
Query job status by ID |
ListJobs |
List all jobs with filtering |
CancelJob |
Cancel a pending/running job |
RegisterWorker |
Register a new worker |
Heartbeat |
Worker heartbeat with stats |
curl http://localhost:9090/metrics | grep conductorKey metrics:
conductor_jobs_total{type, status}- Job countsconductor_scheduling_latency_seconds- Scheduling latencyconductor_raft_is_leader- Leadership status
Trace IDs are logged for each RPC call. Configure OTLP exporter:
OTEL_EXPORTER_OTLP_ENDPOINT=localhost:4318
OTEL_TRACE_ENABLED=trueStructured JSON logs for security events:
{"logger":"audit","msg":"Authentication successful","user_id":"admin","method":"/proto.MasterService/SubmitJob"}
{"logger":"rbac-audit","msg":"Authorization successful","user_id":"admin","method":"/proto.MasterService/SubmitJob"}| Script | Description |
|---|---|
scripts/start-cluster.sh |
Start dev cluster (1 master + 1 worker) |
scripts/start-ha-cluster.sh |
Start HA cluster (3 masters + 1 worker) |
make build # Build all binaries
make test # Run tests with coverage
make test-short # Run tests (skip slow)
make check # Run fmt + vet + staticcheck
make clean # Clean build artifacts
make stop # Stop all conductor processes
make ha-cluster # Start HA clusterconductor/
├── cmd/ # Entry points
│ ├── master/ # Master node
│ ├── worker/ # Worker node
│ └── client/ # CLI client
├── internal/ # Private packages
│ ├── config/ # Configuration
│ ├── consensus/ # Raft consensus
│ ├── scheduler/ # Job scheduling
│ ├── storage/ # Persistence (BoltDB)
│ ├── security/ # TLS, JWT, RBAC
│ └── metrics/ # Prometheus metrics
├── api/proto/ # gRPC definitions
├── config/ # YAML configs
└── scripts/ # Cluster scripts
MIT