Summary
Implement a Node Agent that runs on monitored hosts to collect metrics, execute health checks, and report status to the central config-server. This enables distributed monitoring without requiring direct network access to all targets.
Background
Currently, the monitoring system relies on Prometheus directly scraping targets. A Node Agent provides:
- Push-based metrics: Targets behind firewalls can push data out
- Local script execution: Run health checks locally on the target
- Reduced network complexity: Only agent→server communication needed
- Edge computing: Process/aggregate data before sending
Architecture
┌─────────────────────────────────────────────────────────────────┐
│ Central Infrastructure │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐ │
│ │ Config │◄───│ Config- │◄───│ PostgreSQL │ │
│ │ Server │ │ Server UI │ │ (metadata) │ │
│ └──────┬──────┘ └─────────────┘ └─────────────────────┘ │
│ │ │
│ │ REST (gRPC extensible) │
└─────────┼───────────────────────────────────────────────────────┘
│
│ (Outbound from agents)
│
┌─────────┼───────────────────────────────────────────────────────┐
│ ▼ Monitored Hosts │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Node Agent │ │ Node Agent │ │ Node Agent │ │
│ │ (host-1) │ │ (host-2) │ │ (host-3) │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ Local checks Local checks Local checks │
│ - Scripts - Scripts - Scripts │
│ - Exporters - Exporters - Exporters │
└─────────────────────────────────────────────────────────────────┘
Core Features
1. Registration & Discovery
- Agent registers with config-server on startup
- Agent-side UUID: Agent generates and persists its own ID at installation
- Receives target configuration (which checks to run)
- Heartbeat to maintain connection
- Auto-reconnect on connection loss
2. Configuration Sync
- Pull assigned script policies from config-server
- Watch for configuration changes
- Hot-reload without restart
3. Check Execution
- Execute assigned health check scripts
- Collect script output and exit codes
- Respect check intervals and timeouts
- Report results to config-server
4. Metrics Collection (Future)
- Scrape local exporters
- Aggregate and push to central location
- Remote write to Prometheus/VictoriaMetrics
- Note: Deferred to future phase
Technical Requirements
Communication Protocol
REST (Selected) - with abstraction layer for future gRPC support
POST /api/v1/bootstrap-tokens/register # Extend existing API
POST /api/v1/agents/{id}/heartbeat
POST /api/v1/agents/{id}/check-results
GET /api/v1/checks/target/{id} # Use existing API
Design with API Handler abstraction to allow easy addition of gRPC protocol in the future.
Agent ID & Auto-matching Strategy
Agent-side UUID approach adopted
Generate UUID at agent installation and store in /etc/aami/agent-id. Submit this ID to server during registration.
Installation Script
#!/bin/bash
# install-agent.sh
AGENT_ID_FILE="/etc/aami/agent-id"
if [ ! -f "$AGENT_ID_FILE" ]; then
mkdir -p /etc/aami
uuidgen > "$AGENT_ID_FILE"
fi
Cloud-init (AWS/GCP/Azure)
#cloud-config
write_files:
- path: /etc/aami/agent-id
permissions: '0644'
content: |
${agent_uuid} # Injected from IaC
Terraform Example
resource "random_uuid" "agent_id" {}
resource "aws_instance" "monitored" {
user_data = templatefile("cloud-init.yaml", {
agent_uuid = random_uuid.agent_id.result
})
}
# Pre-register in config-server (optional)
resource "aami_target" "this" {
id = random_uuid.agent_id.result
hostname = "web-server-${count.index}"
group_id = aami_group.web.id
}
Matching Behavior
| Scenario |
AgentID |
Behavior |
| New registration (ID provided) |
Provided |
Create Target with AgentID |
| New registration (no ID) |
Empty |
Server generates UUID |
| Re-registration/restart |
Provided |
Connect to existing Target |
| Pre-registered |
Provided |
Connect Agent to existing Target |
Agent Configuration
# /etc/aami/agent.yaml
server:
address: "config-server.example.com:8443"
tls:
enabled: true
ca_cert: "/etc/aami/ca.crt"
client_cert: "/etc/aami/agent.crt"
client_key: "/etc/aami/agent.key"
agent:
id_file: "/etc/aami/agent-id" # Agent-side UUID file path
hostname: "" # Auto-detected if empty
labels:
environment: "production"
datacenter: "dc1"
heartbeat:
interval: 30s
timeout: 10s
checks:
default_timeout: 30s
max_concurrent: 10
result_buffer_size: 1000
logging:
level: "info"
format: "json"
Security
- mTLS for agent-server communication
- Agent authentication via client certificates or bootstrap tokens
- Script execution sandboxing (resource limits, allowed paths)
- Signed script policies to prevent tampering
API Changes Required
Bootstrap Register Extension
Current:
type BootstrapRegister struct {
Token string
Hostname string
IPAddress string
GroupID string
Labels map[string]string
Metadata map[string]string
}
Updated:
type BootstrapRegister struct {
Token string
AgentID string // New: Agent-submitted ID (optional)
Hostname string
IPAddress string
GroupID string
Labels map[string]string
Metadata map[string]string
}
Check Results Reporting (New)
POST /api/v1/agents/{id}/check-results
{
"results": [
{
"check_id": "disk-space",
"status": "critical",
"exit_code": 2,
"output": "Disk usage 95%",
"duration_ms": 150,
"executed_at": "2024-01-15T10:30:00Z"
}
]
}
Tasks
Phase 1: Core Agent
Phase 2: Check Execution
Phase 3: Config-Server Integration
Phase 4: Security & Production
Dependencies
Decisions Made
| Item |
Decision |
Notes |
| Protocol |
REST |
Design with API handler abstraction for gRPC extensibility |
| Metrics push |
Deferred |
Excluded from Phase 1, to be separated into future issue |
| Auto-matching |
Agent-side UUID |
Generate UUID at installation, Cloud environment compatible |
Related Work
Summary
Implement a Node Agent that runs on monitored hosts to collect metrics, execute health checks, and report status to the central config-server. This enables distributed monitoring without requiring direct network access to all targets.
Background
Currently, the monitoring system relies on Prometheus directly scraping targets. A Node Agent provides:
Architecture
Core Features
1. Registration & Discovery
2. Configuration Sync
3. Check Execution
4. Metrics Collection (Future)
Technical Requirements
Communication Protocol
REST (Selected) - with abstraction layer for future gRPC support
Design with API Handler abstraction to allow easy addition of gRPC protocol in the future.
Agent ID & Auto-matching Strategy
Agent-side UUID approach adopted
Generate UUID at agent installation and store in
/etc/aami/agent-id. Submit this ID to server during registration.Installation Script
Cloud-init (AWS/GCP/Azure)
Terraform Example
Matching Behavior
Agent Configuration
Security
API Changes Required
Bootstrap Register Extension
Current:
Updated:
Check Results Reporting (New)
Tasks
Phase 1: Core Agent
services/node-agentdirectory structurePhase 2: Check Execution
Phase 3: Config-Server Integration
Phase 4: Security & Production
Dependencies
Decisions Made
Related Work