Skip to content

fix: KEEP-570 reaper now detects monitors stuck across silent reconnects#1272

Merged
joelorzet merged 2 commits into
stagingfrom
fix/KEEP-570-reaper-detects-stuck-monitor
May 15, 2026
Merged

fix: KEEP-570 reaper now detects monitors stuck across silent reconnects#1272
joelorzet merged 2 commits into
stagingfrom
fix/KEEP-570-reaper-detects-stuck-monitor

Conversation

@suisuss
Copy link
Copy Markdown

@suisuss suisuss commented May 15, 2026

Summary

Fixes a coverage gap in BlockMonitorService where the reaper's isAlive() check could never see a monitor as stuck if its own in-process reconnect cycle kept firing. Discovered while investigating KEEP-570.

Why this bug was invisible

subscribeToBlocks reset lastBlockAdvanceAt to Date.now() on every reconnect, with the explicit intent of giving freshly-subscribed monitors a warmup window before the reaper considered them stuck. The side effect: any monitor whose internal reconnect cycle kept firing every BLOCK_ADVANCE_TIMEOUT_MS without ever receiving a real block looked permanently fresh to the reconciler. Each reconnect refreshed the same clock the reconciler was reading.

The dispatcher reaper consequently never tore down stuck monitors. The in-process ChainMonitor state that was producing the stuck-ness was therefore never replaced with a fresh instance, even though BlockMonitorService.startNewMonitor is the cheapest possible recovery action (and is what a pod restart effectively does for every chain at once).

Observation that motivated the fix

Prod log signal from keeperhub-block-common-77cf4b46f9-vv54z (uptime 5h 22m at sampling):

6 chains stuck simultaneously, each at exactly 7 silent_reconnect cycles
  Tempo Testnet, Polygon, Ethereum Sepolia, Base Sepolia, Avalanche Fuji, Arbitrum One
5000 recent log lines: 0 "Monitor for X is dead, restarting" entries

A kubectl exec'd fresh ethers WebSocketProvider in that pod against the same URLs the dispatcher is silent on returns BOTH_HEALTHY: 25+ newHeads in 45s. So the failure is in-process and a fresh ChainMonitor instance is the diagnostic-by-action: if recreating one unsticks the chain, the bug lives in ChainMonitor state. The reaper is the path that was supposed to do that and could not.

Change

Split the cold-start warmup baseline from the real-block-advance clock:

  • New private field monitorBootAt set in start(). Used as the staleness baseline before any real block arrives.
  • lastBlockAdvanceAt is no longer reset on resubscribe. It is updated only by processBlockRange on a real height advance, exactly as its comment claims.
  • isAlive() now reads:
const stalenessBaseline = this.lastBlockAdvanceAt ?? this.monitorBootAt;

so the reconciler can see "no real block since boot" as stale once MONITOR_RECREATE_TIMEOUT_MS elapses, while a freshly-booted monitor still gets a full warmup window.

The in-monitor noBlockTimer that drives BLOCK_ADVANCE_TIMEOUT_MS-based reconnects is unchanged. This PR only changes the reconciler's view of liveness.

Test changes

  • "stays alive when blocks are arriving regularly" was passing for the wrong reason. With default BLOCK_ADVANCE_TIMEOUT_MS=60_000, the test's 5-minute gaps between emitted blocks triggered reconnect cycles, and the subscribe-time reset of lastBlockAdvanceAt masked the fact that no real blocks were being delivered to the reconnected providers. Pinned BLOCK_ADVANCE_TIMEOUT_MS to 60 min so the test asserts what its name claims.
  • Added "reaps a monitor stuck across silent reconnects with no real blocks". Drives three silent 60s windows and verifies isAlive() returns false after the 120s reaper threshold.

Test plan

  • pnpm test:unit in keeperhub-scheduler: 81/81 pass (was 80, +1 regression test)
  • pnpm tsc --noEmit -p keeperhub-scheduler/tsconfig.json: clean
  • Canary deploy and observe whether stuck chains get reaped by the reconciler, and whether reaping unsticks them. The result discriminates between two hypotheses:
    • Reap unsticks the chain -> bug is in ChainMonitor per-instance state. New ChainMonitor = recovery.
    • Reap does not unstick -> bug is in shared process state below ChainMonitor. Need a different fix (or pod restart for now).

Related

Before this change, subscribeToBlocks reset lastBlockAdvanceAt to the
current time on every reconnect, "seeding the staleness clock" so a
freshly-subscribed monitor would not be reaped before its first block
arrived. The side effect was that any monitor whose in-process reconnect
cycle kept firing every BLOCK_ADVANCE_TIMEOUT_MS without ever receiving
a real block looked permanently fresh to BlockMonitorService.isAlive() -
because each reconnect refreshed the same clock the reconciler was
checking. The dispatcher reaper consequently never tore down stuck
monitors, and the in-process state that was causing the stuck-ness was
never replaced.

Observed in prod: with the 5h22m-uptime dispatcher pod cycling 6 chains
through silent-reconnect loops every 5 minutes, 5000 log lines contained
zero "Monitor for X is dead, restarting" entries despite chains being
stuck for 30+ minutes each.

Change: split the cold-start warmup baseline from the real-block-advance
clock. New private field monitorBootAt is set in start() and used as the
staleness baseline before any real block arrives. lastBlockAdvanceAt is
no longer reset on resubscribe and is updated only by processBlockRange
on a real height advance. isAlive() now reads:

  stalenessBaseline = lastBlockAdvanceAt ?? monitorBootAt

so the reconciler can see "no real block since boot" as stale once
MONITOR_RECREATE_TIMEOUT_MS elapses, while a freshly-booted monitor still
gets a full warmup window. The in-monitor noBlockTimer (which is what
drives BLOCK_ADVANCE_TIMEOUT_MS-based reconnects) is unchanged.

Test changes:
- "stays alive when blocks are arriving regularly" was previously
  passing for the wrong reason: BLOCK_ADVANCE_TIMEOUT_MS was at its
  default 60s, the 5-min gaps between emitted blocks triggered
  reconnects, and the subscribe-time reset of lastBlockAdvanceAt
  masked the reality that no real blocks were being delivered to the
  reconnected providers. Pinned BLOCK_ADVANCE_TIMEOUT_MS to 60min so
  the test asserts what its name claims.
- Added a regression test that drives three silent 60s windows with no
  real blocks and confirms isAlive() returns false after the 120s
  reaper threshold.

81/81 scheduler unit tests pass.
Single formatter-only fix in chain-monitor.ts: line-wrap on the
stalenessBaseline / MONITOR_RECREATE_TIMEOUT_MS comparison collapsed
onto one line per Biome rules.

No logic change. 81/81 scheduler unit tests still pass, typecheck clean.
Resolves the lint failure blocking #1272.
@joelorzet joelorzet requested review from a team, OleksandrUA, eskp and joelorzet and removed request for a team May 15, 2026 16:31
@joelorzet joelorzet merged commit de67a13 into staging May 15, 2026
45 checks passed
@joelorzet joelorzet deleted the fix/KEEP-570-reaper-detects-stuck-monitor branch May 15, 2026 16:32
@github-actions
Copy link
Copy Markdown

🧹 PR Environment Cleaned Up

The PR environment has been successfully deleted.

Deleted Resources:

  • Namespace: pr-1272
  • All Helm releases (Keeperhub, Scheduler, Event services)
  • PostgreSQL Database (including data)
  • LocalStack, Redis
  • All associated secrets and configs

All resources have been cleaned up and will no longer incur costs.

@github-actions
Copy link
Copy Markdown

ℹ️ No PR Environment to Clean Up

No PR environment was found for this PR. This is expected if:

  • The PR never had the deploy-pr-environment label
  • The environment was already cleaned up
  • The deployment never completed successfully

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants