fix: KEEP-570 reaper now detects monitors stuck across silent reconnects by suisuss · Pull Request #1272 · KeeperHub/keeperhub

suisuss · 2026-05-15T06:10:09Z

Summary

Fixes a coverage gap in BlockMonitorService where the reaper's isAlive() check could never see a monitor as stuck if its own in-process reconnect cycle kept firing. Discovered while investigating KEEP-570.

Why this bug was invisible

subscribeToBlocks reset lastBlockAdvanceAt to Date.now() on every reconnect, with the explicit intent of giving freshly-subscribed monitors a warmup window before the reaper considered them stuck. The side effect: any monitor whose internal reconnect cycle kept firing every BLOCK_ADVANCE_TIMEOUT_MS without ever receiving a real block looked permanently fresh to the reconciler. Each reconnect refreshed the same clock the reconciler was reading.

The dispatcher reaper consequently never tore down stuck monitors. The in-process ChainMonitor state that was producing the stuck-ness was therefore never replaced with a fresh instance, even though BlockMonitorService.startNewMonitor is the cheapest possible recovery action (and is what a pod restart effectively does for every chain at once).

Observation that motivated the fix

Prod log signal from keeperhub-block-common-77cf4b46f9-vv54z (uptime 5h 22m at sampling):

6 chains stuck simultaneously, each at exactly 7 silent_reconnect cycles
  Tempo Testnet, Polygon, Ethereum Sepolia, Base Sepolia, Avalanche Fuji, Arbitrum One
5000 recent log lines: 0 "Monitor for X is dead, restarting" entries

A kubectl exec'd fresh ethers WebSocketProvider in that pod against the same URLs the dispatcher is silent on returns BOTH_HEALTHY: 25+ newHeads in 45s. So the failure is in-process and a fresh ChainMonitor instance is the diagnostic-by-action: if recreating one unsticks the chain, the bug lives in ChainMonitor state. The reaper is the path that was supposed to do that and could not.

Change

Split the cold-start warmup baseline from the real-block-advance clock:

New private field monitorBootAt set in start(). Used as the staleness baseline before any real block arrives.
lastBlockAdvanceAt is no longer reset on resubscribe. It is updated only by processBlockRange on a real height advance, exactly as its comment claims.
isAlive() now reads:

const stalenessBaseline = this.lastBlockAdvanceAt ?? this.monitorBootAt;

so the reconciler can see "no real block since boot" as stale once MONITOR_RECREATE_TIMEOUT_MS elapses, while a freshly-booted monitor still gets a full warmup window.

The in-monitor noBlockTimer that drives BLOCK_ADVANCE_TIMEOUT_MS-based reconnects is unchanged. This PR only changes the reconciler's view of liveness.

Test changes

"stays alive when blocks are arriving regularly" was passing for the wrong reason. With default BLOCK_ADVANCE_TIMEOUT_MS=60_000, the test's 5-minute gaps between emitted blocks triggered reconnect cycles, and the subscribe-time reset of lastBlockAdvanceAt masked the fact that no real blocks were being delivered to the reconnected providers. Pinned BLOCK_ADVANCE_TIMEOUT_MS to 60 min so the test asserts what its name claims.
Added "reaps a monitor stuck across silent reconnects with no real blocks". Drives three silent 60s windows and verifies isAlive() returns false after the 120s reaper threshold.

Test plan

pnpm test:unit in keeperhub-scheduler: 81/81 pass (was 80, +1 regression test)
pnpm tsc --noEmit -p keeperhub-scheduler/tsconfig.json: clean
Canary deploy and observe whether stuck chains get reaped by the reconciler, and whether reaping unsticks them. The result discriminates between two hypotheses:
- Reap unsticks the chain -> bug is in ChainMonitor per-instance state. New ChainMonitor = recovery.
- Reap does not unstick -> bug is in shared process state below ChainMonitor. Need a different fix (or pod restart for now).

Diagnostic instrumentation PR for the same incident: chore: KEEP-570 add WS frame instrumentation and local probe scripts #1270 (feat/KEEP-570-block-trigger-sqs-drops). Independent. Once both land, the next stuck-chain event will produce a self-discriminating warning line AND the reaper will actually act on it.

Before this change, subscribeToBlocks reset lastBlockAdvanceAt to the current time on every reconnect, "seeding the staleness clock" so a freshly-subscribed monitor would not be reaped before its first block arrived. The side effect was that any monitor whose in-process reconnect cycle kept firing every BLOCK_ADVANCE_TIMEOUT_MS without ever receiving a real block looked permanently fresh to BlockMonitorService.isAlive() - because each reconnect refreshed the same clock the reconciler was checking. The dispatcher reaper consequently never tore down stuck monitors, and the in-process state that was causing the stuck-ness was never replaced. Observed in prod: with the 5h22m-uptime dispatcher pod cycling 6 chains through silent-reconnect loops every 5 minutes, 5000 log lines contained zero "Monitor for X is dead, restarting" entries despite chains being stuck for 30+ minutes each. Change: split the cold-start warmup baseline from the real-block-advance clock. New private field monitorBootAt is set in start() and used as the staleness baseline before any real block arrives. lastBlockAdvanceAt is no longer reset on resubscribe and is updated only by processBlockRange on a real height advance. isAlive() now reads: stalenessBaseline = lastBlockAdvanceAt ?? monitorBootAt so the reconciler can see "no real block since boot" as stale once MONITOR_RECREATE_TIMEOUT_MS elapses, while a freshly-booted monitor still gets a full warmup window. The in-monitor noBlockTimer (which is what drives BLOCK_ADVANCE_TIMEOUT_MS-based reconnects) is unchanged. Test changes: - "stays alive when blocks are arriving regularly" was previously passing for the wrong reason: BLOCK_ADVANCE_TIMEOUT_MS was at its default 60s, the 5-min gaps between emitted blocks triggered reconnects, and the subscribe-time reset of lastBlockAdvanceAt masked the reality that no real blocks were being delivered to the reconnected providers. Pinned BLOCK_ADVANCE_TIMEOUT_MS to 60min so the test asserts what its name claims. - Added a regression test that drives three silent 60s windows with no real blocks and confirms isAlive() returns false after the 120s reaper threshold. 81/81 scheduler unit tests pass.

Single formatter-only fix in chain-monitor.ts: line-wrap on the stalenessBaseline / MONITOR_RECREATE_TIMEOUT_MS comparison collapsed onto one line per Biome rules. No logic change. 81/81 scheduler unit tests still pass, typecheck clean. Resolves the lint failure blocking #1272.

github-actions · 2026-05-15T16:32:43Z

🧹 PR Environment Cleaned Up

The PR environment has been successfully deleted.

Deleted Resources:

Namespace: pr-1272
All Helm releases (Keeperhub, Scheduler, Event services)
PostgreSQL Database (including data)
LocalStack, Redis
All associated secrets and configs

All resources have been cleaned up and will no longer incur costs.

github-actions · 2026-05-15T16:32:44Z

ℹ️ No PR Environment to Clean Up

No PR environment was found for this PR. This is expected if:

The PR never had the deploy-pr-environment label
The environment was already cleaned up
The deployment never completed successfully

suisuss temporarily deployed to staging May 15, 2026 06:10 — with GitHub Actions Inactive

joelorzet temporarily deployed to staging May 15, 2026 14:52 — with GitHub Actions Inactive

joelorzet requested review from a team, OleksandrUA, eskp and joelorzet and removed request for a team May 15, 2026 16:31

joelorzet merged commit de67a13 into staging May 15, 2026
45 checks passed

joelorzet deleted the fix/KEEP-570-reaper-detects-stuck-monitor branch May 15, 2026 16:32

joelorzet temporarily deployed to staging May 15, 2026 16:32 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: KEEP-570 reaper now detects monitors stuck across silent reconnects#1272

fix: KEEP-570 reaper now detects monitors stuck across silent reconnects#1272
joelorzet merged 2 commits into
stagingfrom
fix/KEEP-570-reaper-detects-stuck-monitor

suisuss commented May 15, 2026

Uh oh!

Uh oh!

github-actions Bot commented May 15, 2026

Uh oh!

github-actions Bot commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

suisuss commented May 15, 2026

Summary

Why this bug was invisible

Observation that motivated the fix

Change

Test changes

Test plan

Related

Uh oh!

Uh oh!

github-actions Bot commented May 15, 2026

🧹 PR Environment Cleaned Up

Uh oh!

github-actions Bot commented May 15, 2026

ℹ️ No PR Environment to Clean Up

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants