Skip to content

feat(pause-ttl): auto-expire stale pauses at all gate-traversal sites#464

Merged
cirwel merged 1 commit into
masterfrom
claude/auto/20260518-183851-featpause-ttl-auto-expire-stale-pauses-a
May 19, 2026
Merged

feat(pause-ttl): auto-expire stale pauses at all gate-traversal sites#464
cirwel merged 1 commit into
masterfrom
claude/auto/20260518-183851-featpause-ttl-auto-expire-stale-pauses-a

Conversation

@cirwel
Copy link
Copy Markdown
Owner

@cirwel cirwel commented May 19, 2026

Auto-shipped by ship.sh — runtime path. Auto-merge is enabled; CI gate applies.

Sleep-wake artifacts can trigger a categorizer-driven pause whose underlying 'stale state input' cause has long since resolved, but the pause itself persists indefinitely because every subsequent gate-traversal is rejected before the categorizer can re-evaluate. The 2026-05-09 → 2026-05-18 Watcher/Sentinel/Lumen silence (recovered 2026-05-18 via operator self_recovery for three agents) was this class of bug.

This PR adds a TTL: pauses older than PAUSE_AUTO_EXPIRE_SECONDS (default 72h, env-overridable via UNITARES_PAUSE_AUTO_EXPIRE_SECONDS) auto-clear at any gate the agent traverses (process_agent_update via Phase 1, KG/dialectic/etc. via check_agent_can_operate). On expire: status flips to 'active', paused_at clears (preserving the system invariant status=paused ⟺ paused_at truthy that auto_ground_truth.py and agent_metadata_model.py rely on), a lifecycle event is appended, and an audit row goes to audit.events as event_type=pause_auto_expired for operator visibility.

The categorizer's existing gap_suppression (governance_monitor.py:865, GAP_RECOVERY_CYCLES) handles the first-after-gap state; a genuinely degraded agent re-pauses on the next cycle via the normal circuit-breaker path.

Council-passed across 3 lanes after a v1 redraft. v1 was scoped to phases.py only and missed: asymmetric coverage (KG/dialectic still blocked), TZ-naive comparison bug on non-UTC hosts, env var documented but not wired, invariant violation from retaining paused_at, missing test for the TZ case. All addressed in v2.

Architectural decisions:
- Helpers live in src/mcp_handlers/support/pause_ttl.py so both sync (check_agent_can_operate) and async (Phase 1) gates share behavior.
- Sync entry point fire-and-forget schedules persistence using the same pattern as coordination_failure_emit._schedule_coordination_events_dual_write.
- All pause sources in the codebase are categorizer-driven (only agent_loop_detection.py:513 sets status=paused, only on decision_action=='pause' from monitor_decision.py's four pause paths); loop-detection uses a separate cooldown mechanism and is not affected.

Live-verifier flagged 9 of 11 currently-paused agents are >24h stale and would auto-expire on first check-in under this change. Under the 72h default, 9 of 11 would still auto-expire on the FIRST gate-traversal after merge — they have not actively check-in attempted for weeks. The categorizer will re-pause any that are genuinely degraded.
@github-actions
Copy link
Copy Markdown

✅ Documentation Validation Passed

Tool Count: 7 tools tools
Version: 2.13.0

All documentation is synchronized with the codebase.

)
except Exception as exc: # noqa: BLE001 — observability MUST NOT mask
logger.debug(
"[pause-ttl] audit emit failed for %s: %r", agent_uuid[:12], exc
except Exception as exc: # noqa: BLE001
logger.debug(
"[pause-ttl] persistence schedule failed for %s: %r",
agent_uuid[:12],
logger.warning(
"[pause-ttl] auto-expired stale pause for %s (paused_at=%s); "
"categorizer will re-evaluate (sync path)",
agent_uuid[:12],
@cirwel cirwel merged commit bb0bf43 into master May 19, 2026
5 of 6 checks passed
@cirwel cirwel deleted the claude/auto/20260518-183851-featpause-ttl-auto-expire-stale-pauses-a branch May 19, 2026 01:16
cirwel added a commit that referenced this pull request May 19, 2026
…vent GC drop (#465)

Per asyncio docs (Python 3.11+): bare loop.create_task(coro) returns a Task that CPython GC can collect mid-await if no caller holds a reference. Three sites in the codebase had this hazard, all flagged by Watcher (#69f2ccbc, #0a0616c2, #acfc7012 — 2026-05-18):

1. src/audit_log.py _write_entry — fire-and-forget PG audit tail. Coroutine awaits asyncpg via append_audit_event_async; GC mid-await would silently drop the row that backs audit.events queries (the same surface §129 reads from).

2. src/mcp_handlers/updates/phases.py — _persist_thread_identity_async (per check-in, high frequency).

3. src/mcp_handlers/updates/phases.py — _persist_inferred_purpose_async (per check-in, high frequency).

All three fixed using the established 'module-local set + add_done_callback' pattern already in src/coordination_failure_emit.py:_inflight_dedicated_writes and src/mcp_handlers/support/pause_ttl.py:_inflight_persistence_tasks. Regression tests in tests/test_p001_task_refs.py pin the pattern at each new helper.

These were pre-existing bugs surfaced during the pause-ttl work (PR #464); they predate that change and live on master. Watcher fingerprints will be resolved against this commit per the audit-trail convention in CLAUDE.md.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants