Skip to content

feat(pools): pool exhaustion handling + opt-in agent auth auto-reset#49

Merged
nnemirovsky merged 19 commits into
mainfrom
pool-exhaustion-auth-reset
May 23, 2026
Merged

feat(pools): pool exhaustion handling + opt-in agent auth auto-reset#49
nnemirovsky merged 19 commits into
mainfrom
pool-exhaustion-auth-reset

Conversation

@nnemirovsky
Copy link
Copy Markdown
Owner

Problem

Live, both members of openai_pool hit the OpenAI Codex usage-limit 429 (a multi-hour quota window). sluice cooled each member for a flat 60s, then the all-cooling degrade path re-served the soonest-recovering member, which 429'd again. That produced a perpetual openai_oauth to openai_oauth_2 failover flap and roughly 2 Telegram notices per minute, forever. The sticky-failover fix (#48) stopped the snap-to-position-0 flap but left this degrade-path flap in place.

Second problem: when the whole pool is exhausted the agent can latch "usage limit reached" and stop, needing a manual hermes auth reset to resume.

What this does

Three fixes for the flap plus an opt-in auto-reset.

  • B1: derive the member cooldown from the upstream Retry-After / x-ratelimit-reset* headers (clamped 10s to 6h) instead of a flat 60s, so a quota-exhausted member stays cooled for the real window and is not re-probed every 60s.
  • A1: classify the pool as exhausted when no healthy member exists, not only when the failover target equals the failing member. Also collapse the exhausted-notice dedup key so the flap direction cannot produce two keys.
  • A2: edge-triggered notices. One "pool exhausted" on the healthy-to-exhausted edge and one "pool recovered" on the way back, driven by a time-based recovery monitor on the server (the agent emits no recovering traffic while latched). No periodic spam.
  • Auto-reset (opt-in, per pool): a new auth_reset_target column on the pool. When it is set and the pool recovers, sluice runs the agent auth-reset command (hermes profile, executed as uid 10000:10000 to avoid root-chowning auth.json) so the agent un-latches. An empty target means no reset, so this is off by default.

auth_reset_target is reachable from CLI, REST, and Telegram through the channel-agnostic internal/poolops, per the channel parity rule.

Notes for review

  • The auto-reset is opt-in and may not be needed in practice. We saw hermes recover on its own once a member's window reset. Leave auth_reset_target unset to get the flap and spam fix without sluice ever touching the agent.
  • B1 only fully engages if the upstream 429 carries a recognized header. With no header the cooldown falls back to 60s, but A1 and A2 still prevent the notice storm. Capturing a real Codex usage-limit 429 to confirm which header it sends is a follow-up (see the plan's Post-Completion section).
  • New migration 000008_pool_auth_reset adds the column. Its down migration snapshots the FK-tied member rows before the table rebuild so the cascade does not drop them.

Testing

go test ./... passes across all 14 packages, -race clean on the touched packages, gofumpt and go vet ./... clean, and make generate produces no diff. New unit tests cover the cooldown header parsing and clamp, exhaustion detection, the recovery monitor edges (unequal cooldowns, pool removed while exhausted, clean shutdown), the auth-reset argv and uid, validation parity across all channels, and migration 000008 up and down on a populated table.

Plan: docs/plans/completed/20260522-pool-exhaustion-and-agent-auth-reset.md

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

@nnemirovsky nnemirovsky requested a review from Copilot May 23, 2026 02:34

This comment was marked as outdated.

This comment was marked as outdated.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 35 out of 36 changed files in this pull request and generated no new comments.

Files not reviewed (1)
  • internal/api/api.gen.go: Language not supported

@nnemirovsky nnemirovsky merged commit ccc7255 into main May 23, 2026
7 of 8 checks passed
@nnemirovsky nnemirovsky deleted the pool-exhaustion-auth-reset branch May 23, 2026 03:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants