feat(pause-ttl): auto-expire stale pauses at all gate-traversal sites#464
Merged
cirwel merged 1 commit intoMay 19, 2026
Conversation
Sleep-wake artifacts can trigger a categorizer-driven pause whose underlying 'stale state input' cause has long since resolved, but the pause itself persists indefinitely because every subsequent gate-traversal is rejected before the categorizer can re-evaluate. The 2026-05-09 → 2026-05-18 Watcher/Sentinel/Lumen silence (recovered 2026-05-18 via operator self_recovery for three agents) was this class of bug. This PR adds a TTL: pauses older than PAUSE_AUTO_EXPIRE_SECONDS (default 72h, env-overridable via UNITARES_PAUSE_AUTO_EXPIRE_SECONDS) auto-clear at any gate the agent traverses (process_agent_update via Phase 1, KG/dialectic/etc. via check_agent_can_operate). On expire: status flips to 'active', paused_at clears (preserving the system invariant status=paused ⟺ paused_at truthy that auto_ground_truth.py and agent_metadata_model.py rely on), a lifecycle event is appended, and an audit row goes to audit.events as event_type=pause_auto_expired for operator visibility. The categorizer's existing gap_suppression (governance_monitor.py:865, GAP_RECOVERY_CYCLES) handles the first-after-gap state; a genuinely degraded agent re-pauses on the next cycle via the normal circuit-breaker path. Council-passed across 3 lanes after a v1 redraft. v1 was scoped to phases.py only and missed: asymmetric coverage (KG/dialectic still blocked), TZ-naive comparison bug on non-UTC hosts, env var documented but not wired, invariant violation from retaining paused_at, missing test for the TZ case. All addressed in v2. Architectural decisions: - Helpers live in src/mcp_handlers/support/pause_ttl.py so both sync (check_agent_can_operate) and async (Phase 1) gates share behavior. - Sync entry point fire-and-forget schedules persistence using the same pattern as coordination_failure_emit._schedule_coordination_events_dual_write. - All pause sources in the codebase are categorizer-driven (only agent_loop_detection.py:513 sets status=paused, only on decision_action=='pause' from monitor_decision.py's four pause paths); loop-detection uses a separate cooldown mechanism and is not affected. Live-verifier flagged 9 of 11 currently-paused agents are >24h stale and would auto-expire on first check-in under this change. Under the 72h default, 9 of 11 would still auto-expire on the FIRST gate-traversal after merge — they have not actively check-in attempted for weeks. The categorizer will re-pause any that are genuinely degraded.
✅ Documentation Validation PassedTool Count: 7 tools tools All documentation is synchronized with the codebase. |
| ) | ||
| except Exception as exc: # noqa: BLE001 — observability MUST NOT mask | ||
| logger.debug( | ||
| "[pause-ttl] audit emit failed for %s: %r", agent_uuid[:12], exc |
| except Exception as exc: # noqa: BLE001 | ||
| logger.debug( | ||
| "[pause-ttl] persistence schedule failed for %s: %r", | ||
| agent_uuid[:12], |
| logger.warning( | ||
| "[pause-ttl] auto-expired stale pause for %s (paused_at=%s); " | ||
| "categorizer will re-evaluate (sync path)", | ||
| agent_uuid[:12], |
cirwel
added a commit
that referenced
this pull request
May 19, 2026
…vent GC drop (#465) Per asyncio docs (Python 3.11+): bare loop.create_task(coro) returns a Task that CPython GC can collect mid-await if no caller holds a reference. Three sites in the codebase had this hazard, all flagged by Watcher (#69f2ccbc, #0a0616c2, #acfc7012 — 2026-05-18): 1. src/audit_log.py _write_entry — fire-and-forget PG audit tail. Coroutine awaits asyncpg via append_audit_event_async; GC mid-await would silently drop the row that backs audit.events queries (the same surface §129 reads from). 2. src/mcp_handlers/updates/phases.py — _persist_thread_identity_async (per check-in, high frequency). 3. src/mcp_handlers/updates/phases.py — _persist_inferred_purpose_async (per check-in, high frequency). All three fixed using the established 'module-local set + add_done_callback' pattern already in src/coordination_failure_emit.py:_inflight_dedicated_writes and src/mcp_handlers/support/pause_ttl.py:_inflight_persistence_tasks. Regression tests in tests/test_p001_task_refs.py pin the pattern at each new helper. These were pre-existing bugs surfaced during the pause-ttl work (PR #464); they predate that change and live on master. Watcher fingerprints will be resolved against this commit per the audit-trail convention in CLAUDE.md.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Auto-shipped by ship.sh — runtime path. Auto-merge is enabled; CI gate applies.