feat: add node conditions and agent drain/cordon by retr0h · Pull Request #225 · osapi-io/osapi

retr0h · 2026-03-06T00:24:09Z

Summary

Add threshold-based node conditions (MemoryPressure, HighLoad, DiskPressure) evaluated agent-side on each heartbeat tick, with configurable thresholds and CLI display in agent list (CONDITIONS column) and agent get (full details)
Add agent drain/cordon lifecycle (Ready → Draining → Cordoned → Ready) with drain/undrain CLI commands, REST API endpoints, and NATS consumer subscribe/unsubscribe for graceful maintenance
Add dedicated agent-state KV bucket (no TTL) for persistent drain flags and append-only timeline events, separating operator state from ephemeral heartbeat data
Add agent:write permission for drain/undrain operations, included in the admin role by default

Test plan

Unit tests for condition evaluation (threshold math, transition tracking)
Unit tests for drain flag operations (CheckDrainFlag, SetDrainFlag, DeleteDrainFlag) — 100% coverage
Unit tests for state machine transitions and consumer stop/start
Unit tests for timeline event storage and retrieval
Unit tests for CountExpectedAgents state filtering (cordoned/draining excluded) — 100% coverage
HTTP wiring tests for drain/undrain endpoints with RBAC (401, 403, 200, 404, 409)
All packages pass: go test ./...
Lint clean: just go::vet — 0 issues

🤖 Generated with Claude Code

Design doc covering Kubernetes-inspired node conditions (MemoryPressure, HighLoad, DiskPressure) and agent drain/cordon lifecycle with append-only timeline events. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

- Add Condition type with MemoryPressure, HighLoad, DiskPressure - Add agent state constants (Ready, Draining, Cordoned) - Add condition evaluation functions with 100% test coverage - Add conditions config (thresholds) to AgentConfig - Extend heartbeat to collect disk stats and evaluate conditions - Add WriteAgentTimelineEvent/GetAgentTimeline/ComputeAgentState - Add agent:write permission to admin role 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

Replaced by docs/plans/2026-03-05-node-conditions-drain-design.md 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

- POST /agent/{hostname}/drain with agent:write permission - POST /agent/{hostname}/undrain with agent:write permission - Add NodeCondition and TimelineEvent schemas to OpenAPI - Add state, conditions, timeline fields to AgentInfo - Full RBAC test coverage (401, 403, 200) - Unit tests for drain (4 cases) and undrain (5 cases) 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

- Agent detects drain flag on heartbeat, transitions state - writeRegistration includes State field - GetAgent returns timeline events and conditions/state - agentInfoFromRegistration maps conditions and state - Full test coverage for drain detection and timeline 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

Add agent drain/undrain subcommands with --hostname flag. Extend agent list with CONDITIONS column showing active conditions. Extend agent get with conditions table, state display, and timeline table. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

Add agent-lifecycle.md feature page covering node conditions and drain/cordon. Add drain/undrain CLI docs. Update agent list/get docs with conditions and timeline. Add agent:write to permissions tables. Add conditions config to configuration reference. Update README, navbar dropdown, and system architecture. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

On macOS and Linux, "free" memory is tiny because the OS uses RAM for file cache. Using Total - Free overstates pressure. Available memory (Free + Cached) is reclaimable and better reflects actual pressure. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

gopsutil does not populate Cached on macOS (it's always 0), so Free + Cached was identical to Free — causing false 96% pressure on machines with plenty of reclaimable memory. Use the Available field instead, which gopsutil computes correctly per-platform. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

On drain, cancel the consumer context and wait for all consumer goroutines to exit — the agent stops receiving new jobs from NATS. On undrain, create a new consumer context and restart all consumers. Consumers use a separate WaitGroup from heartbeat/facts so drain only affects job delivery. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

Move drain flags and timeline events from ephemeral buckets (job-queue 1h TTL, agent-registry 30s TTL) to a dedicated agent-state KV bucket with no TTL. This ensures operator drain/undrain actions and state transition history persist indefinitely. Add tests for drain flag operations and state filtering in CountExpectedAgents. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

Clarify that all scenarios for a function (success, errors, edge cases) belong as rows in a single table-driven test — never split into separate TestFoo, TestFooError, TestFooNilResponse methods. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

github-actions · 2026-03-06T00:24:20Z

Thank you for contributing to this project! 😊🕹️

🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

codecov · 2026-03-06T00:57:26Z

Codecov Report

❌ Patch coverage is 82.73810% with 58 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
internal/api/agent/agent_list.go	3.22%	27 Missing and 3 partials ⚠️
internal/api/agent/agent_drain.go	55.00%	7 Missing and 2 partials ⚠️
internal/api/agent/agent_undrain.go	62.50%	7 Missing and 2 partials ⚠️
internal/cli/nats.go	0.00%	6 Missing ⚠️
internal/job/client/agent.go	97.97%	1 Missing and 1 partial ⚠️
internal/job/client/query.go	77.77%	1 Missing and 1 partial ⚠️

❌ Your patch status has failed because the patch coverage (82.73%) is below the target coverage (100.00%). You can increase the patch coverage or adjust the target coverage.

@@            Coverage Diff             @@
##             main     #225      +/-   ##
==========================================
- Coverage   99.98%   98.94%   -1.04%     
==========================================
  Files         157      161       +4     
  Lines        5293     5611     +318     
==========================================
+ Hits         5292     5552     +260     
- Misses          1       50      +49     
- Partials        0        9       +9

Files with missing lines	Coverage Δ
internal/agent/condition.go	`100.00% <100.00%> (ø)`
internal/agent/consumer.go	`100.00% <100.00%> (ø)`
internal/agent/drain.go	`100.00% <100.00%> (ø)`
internal/agent/facts.go	`100.00% <100.00%> (ø)`
internal/agent/heartbeat.go	`100.00% <100.00%> (ø)`
internal/agent/server.go	`100.00% <100.00%> (ø)`
internal/authtoken/permissions.go	`100.00% <ø> (ø)`
internal/job/client/client.go	`100.00% <100.00%> (ø)`
internal/job/subjects.go	`100.00% <100.00%> (ø)`
internal/provider/node/mem/darwin_get_vm.go	`100.00% <100.00%> (ø)`
... and 7 more

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 04438c7...2c83e9f. Read the comment docs.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

retr0h and others added 13 commits March 5, 2026 13:05

chore: remove legacy kubernetes patterns backlog

55a3957

Replaced by docs/plans/2026-03-05-node-conditions-drain-design.md 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

style: fix goimports formatting in condition_test.go

3f5a1f2

🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

github-actions bot added kind/go kind/yaml kind/docs test/unit labels Mar 6, 2026

style: fix long lines in subjects_public_test.go

2c83e9f

🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

retr0h force-pushed the feat/node-conditions-drain branch from ed79a86 to 2c83e9f Compare March 6, 2026 00:26

retr0h merged commit a76f54e into main Mar 6, 2026
7 checks passed

retr0h deleted the feat/node-conditions-drain branch March 6, 2026 00:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add node conditions and agent drain/cordon#225

feat: add node conditions and agent drain/cordon#225
retr0h merged 14 commits intomainfrom
feat/node-conditions-drain

retr0h commented Mar 6, 2026

Uh oh!

github-actions bot commented Mar 6, 2026

Uh oh!

Uh oh!

codecov bot commented Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

retr0h commented Mar 6, 2026

Summary

Test plan

Uh oh!

github-actions bot commented Mar 6, 2026

Uh oh!

Uh oh!

codecov bot commented Mar 6, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant