Skip to content

feat: add node conditions and agent drain/cordon#225

Merged
retr0h merged 14 commits intomainfrom
feat/node-conditions-drain
Mar 6, 2026
Merged

feat: add node conditions and agent drain/cordon#225
retr0h merged 14 commits intomainfrom
feat/node-conditions-drain

Conversation

@retr0h
Copy link
Collaborator

@retr0h retr0h commented Mar 6, 2026

Summary

  • Add threshold-based node conditions (MemoryPressure, HighLoad, DiskPressure) evaluated agent-side on each heartbeat tick, with configurable thresholds and CLI display in agent list (CONDITIONS column) and agent get (full details)
  • Add agent drain/cordon lifecycle (Ready → Draining → Cordoned → Ready) with drain/undrain CLI commands, REST API endpoints, and NATS consumer subscribe/unsubscribe for graceful maintenance
  • Add dedicated agent-state KV bucket (no TTL) for persistent drain flags and append-only timeline events, separating operator state from ephemeral heartbeat data
  • Add agent:write permission for drain/undrain operations, included in the admin role by default

Test plan

  • Unit tests for condition evaluation (threshold math, transition tracking)
  • Unit tests for drain flag operations (CheckDrainFlag, SetDrainFlag, DeleteDrainFlag) — 100% coverage
  • Unit tests for state machine transitions and consumer stop/start
  • Unit tests for timeline event storage and retrieval
  • Unit tests for CountExpectedAgents state filtering (cordoned/draining excluded) — 100% coverage
  • HTTP wiring tests for drain/undrain endpoints with RBAC (401, 403, 200, 404, 409)
  • All packages pass: go test ./...
  • Lint clean: just go::vet — 0 issues

🤖 Generated with Claude Code

retr0h and others added 13 commits March 5, 2026 13:05
Design doc covering Kubernetes-inspired node conditions
(MemoryPressure, HighLoad, DiskPressure) and agent drain/cordon
lifecycle with append-only timeline events.

🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
- Add Condition type with MemoryPressure, HighLoad, DiskPressure
- Add agent state constants (Ready, Draining, Cordoned)
- Add condition evaluation functions with 100% test coverage
- Add conditions config (thresholds) to AgentConfig
- Extend heartbeat to collect disk stats and evaluate conditions
- Add WriteAgentTimelineEvent/GetAgentTimeline/ComputeAgentState
- Add agent:write permission to admin role

🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
Replaced by docs/plans/2026-03-05-node-conditions-drain-design.md

🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
- POST /agent/{hostname}/drain with agent:write permission
- POST /agent/{hostname}/undrain with agent:write permission
- Add NodeCondition and TimelineEvent schemas to OpenAPI
- Add state, conditions, timeline fields to AgentInfo
- Full RBAC test coverage (401, 403, 200)
- Unit tests for drain (4 cases) and undrain (5 cases)

🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
- Agent detects drain flag on heartbeat, transitions state
- writeRegistration includes State field
- GetAgent returns timeline events and conditions/state
- agentInfoFromRegistration maps conditions and state
- Full test coverage for drain detection and timeline

🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
Add agent drain/undrain subcommands with --hostname flag. Extend agent
list with CONDITIONS column showing active conditions. Extend agent get
with conditions table, state display, and timeline table.

🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
Add agent-lifecycle.md feature page covering node conditions and
drain/cordon. Add drain/undrain CLI docs. Update agent list/get docs
with conditions and timeline. Add agent:write to permissions tables.
Add conditions config to configuration reference. Update README,
navbar dropdown, and system architecture.

🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
On macOS and Linux, "free" memory is tiny because the OS uses RAM for
file cache. Using Total - Free overstates pressure. Available memory
(Free + Cached) is reclaimable and better reflects actual pressure.

🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
gopsutil does not populate Cached on macOS (it's always 0), so
Free + Cached was identical to Free — causing false 96% pressure
on machines with plenty of reclaimable memory. Use the Available
field instead, which gopsutil computes correctly per-platform.

🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
On drain, cancel the consumer context and wait for all consumer
goroutines to exit — the agent stops receiving new jobs from NATS.
On undrain, create a new consumer context and restart all consumers.
Consumers use a separate WaitGroup from heartbeat/facts so drain
only affects job delivery.

🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
Move drain flags and timeline events from ephemeral buckets
(job-queue 1h TTL, agent-registry 30s TTL) to a dedicated
agent-state KV bucket with no TTL. This ensures operator
drain/undrain actions and state transition history persist
indefinitely. Add tests for drain flag operations and state
filtering in CountExpectedAgents.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Clarify that all scenarios for a function (success, errors, edge cases)
belong as rows in a single table-driven test — never split into separate
TestFoo, TestFooError, TestFooNilResponse methods.

🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
@github-actions
Copy link
Contributor

github-actions bot commented Mar 6, 2026

Thank you for contributing to this project! 😊🕹️

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
@retr0h retr0h force-pushed the feat/node-conditions-drain branch from ed79a86 to 2c83e9f Compare March 6, 2026 00:26
@retr0h retr0h merged commit a76f54e into main Mar 6, 2026
7 checks passed
@retr0h retr0h deleted the feat/node-conditions-drain branch March 6, 2026 00:31
@codecov
Copy link

codecov bot commented Mar 6, 2026

Codecov Report

❌ Patch coverage is 82.73810% with 58 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
internal/api/agent/agent_list.go 3.22% 27 Missing and 3 partials ⚠️
internal/api/agent/agent_drain.go 55.00% 7 Missing and 2 partials ⚠️
internal/api/agent/agent_undrain.go 62.50% 7 Missing and 2 partials ⚠️
internal/cli/nats.go 0.00% 6 Missing ⚠️
internal/job/client/agent.go 97.97% 1 Missing and 1 partial ⚠️
internal/job/client/query.go 77.77% 1 Missing and 1 partial ⚠️

❌ Your patch status has failed because the patch coverage (82.73%) is below the target coverage (100.00%). You can increase the patch coverage or adjust the target coverage.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #225      +/-   ##
==========================================
- Coverage   99.98%   98.94%   -1.04%     
==========================================
  Files         157      161       +4     
  Lines        5293     5611     +318     
==========================================
+ Hits         5292     5552     +260     
- Misses          1       50      +49     
- Partials        0        9       +9     
Files with missing lines Coverage Δ
internal/agent/condition.go 100.00% <100.00%> (ø)
internal/agent/consumer.go 100.00% <100.00%> (ø)
internal/agent/drain.go 100.00% <100.00%> (ø)
internal/agent/facts.go 100.00% <100.00%> (ø)
internal/agent/heartbeat.go 100.00% <100.00%> (ø)
internal/agent/server.go 100.00% <100.00%> (ø)
internal/authtoken/permissions.go 100.00% <ø> (ø)
internal/job/client/client.go 100.00% <100.00%> (ø)
internal/job/subjects.go 100.00% <100.00%> (ø)
internal/provider/node/mem/darwin_get_vm.go 100.00% <100.00%> (ø)
... and 7 more

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 04438c7...2c83e9f. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant