feat: add node conditions and agent drain/cordon#225
Merged
Conversation
Design doc covering Kubernetes-inspired node conditions (MemoryPressure, HighLoad, DiskPressure) and agent drain/cordon lifecycle with append-only timeline events. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
- Add Condition type with MemoryPressure, HighLoad, DiskPressure - Add agent state constants (Ready, Draining, Cordoned) - Add condition evaluation functions with 100% test coverage - Add conditions config (thresholds) to AgentConfig - Extend heartbeat to collect disk stats and evaluate conditions - Add WriteAgentTimelineEvent/GetAgentTimeline/ComputeAgentState - Add agent:write permission to admin role 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
Replaced by docs/plans/2026-03-05-node-conditions-drain-design.md 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
- POST /agent/{hostname}/drain with agent:write permission
- POST /agent/{hostname}/undrain with agent:write permission
- Add NodeCondition and TimelineEvent schemas to OpenAPI
- Add state, conditions, timeline fields to AgentInfo
- Full RBAC test coverage (401, 403, 200)
- Unit tests for drain (4 cases) and undrain (5 cases)
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
- Agent detects drain flag on heartbeat, transitions state - writeRegistration includes State field - GetAgent returns timeline events and conditions/state - agentInfoFromRegistration maps conditions and state - Full test coverage for drain detection and timeline 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
Add agent drain/undrain subcommands with --hostname flag. Extend agent list with CONDITIONS column showing active conditions. Extend agent get with conditions table, state display, and timeline table. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
Add agent-lifecycle.md feature page covering node conditions and drain/cordon. Add drain/undrain CLI docs. Update agent list/get docs with conditions and timeline. Add agent:write to permissions tables. Add conditions config to configuration reference. Update README, navbar dropdown, and system architecture. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
On macOS and Linux, "free" memory is tiny because the OS uses RAM for file cache. Using Total - Free overstates pressure. Available memory (Free + Cached) is reclaimable and better reflects actual pressure. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
gopsutil does not populate Cached on macOS (it's always 0), so Free + Cached was identical to Free — causing false 96% pressure on machines with plenty of reclaimable memory. Use the Available field instead, which gopsutil computes correctly per-platform. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
On drain, cancel the consumer context and wait for all consumer goroutines to exit — the agent stops receiving new jobs from NATS. On undrain, create a new consumer context and restart all consumers. Consumers use a separate WaitGroup from heartbeat/facts so drain only affects job delivery. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
Move drain flags and timeline events from ephemeral buckets (job-queue 1h TTL, agent-registry 30s TTL) to a dedicated agent-state KV bucket with no TTL. This ensures operator drain/undrain actions and state transition history persist indefinitely. Add tests for drain flag operations and state filtering in CountExpectedAgents. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
Clarify that all scenarios for a function (success, errors, edge cases) belong as rows in a single table-driven test — never split into separate TestFoo, TestFooError, TestFooNilResponse methods. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
Contributor
|
Thank you for contributing to this project! 😊🕹️ |
🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
ed79a86 to
2c83e9f
Compare
Codecov Report❌ Patch coverage is ❌ Your patch status has failed because the patch coverage (82.73%) is below the target coverage (100.00%). You can increase the patch coverage or adjust the target coverage. @@ Coverage Diff @@
## main #225 +/- ##
==========================================
- Coverage 99.98% 98.94% -1.04%
==========================================
Files 157 161 +4
Lines 5293 5611 +318
==========================================
+ Hits 5292 5552 +260
- Misses 1 50 +49
- Partials 0 9 +9
Continue to review full report in Codecov by Sentry.
🚀 New features to boost your workflow:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
agent list(CONDITIONS column) andagent get(full details)drain/undrainCLI commands, REST API endpoints, and NATS consumer subscribe/unsubscribe for graceful maintenanceagent:writepermission for drain/undrain operations, included in theadminrole by defaultTest plan
go test ./...just go::vet— 0 issues🤖 Generated with Claude Code