fix: KEEP-570 reaper now detects monitors stuck across silent reconnects#1272
Merged
Merged
Conversation
Before this change, subscribeToBlocks reset lastBlockAdvanceAt to the current time on every reconnect, "seeding the staleness clock" so a freshly-subscribed monitor would not be reaped before its first block arrived. The side effect was that any monitor whose in-process reconnect cycle kept firing every BLOCK_ADVANCE_TIMEOUT_MS without ever receiving a real block looked permanently fresh to BlockMonitorService.isAlive() - because each reconnect refreshed the same clock the reconciler was checking. The dispatcher reaper consequently never tore down stuck monitors, and the in-process state that was causing the stuck-ness was never replaced. Observed in prod: with the 5h22m-uptime dispatcher pod cycling 6 chains through silent-reconnect loops every 5 minutes, 5000 log lines contained zero "Monitor for X is dead, restarting" entries despite chains being stuck for 30+ minutes each. Change: split the cold-start warmup baseline from the real-block-advance clock. New private field monitorBootAt is set in start() and used as the staleness baseline before any real block arrives. lastBlockAdvanceAt is no longer reset on resubscribe and is updated only by processBlockRange on a real height advance. isAlive() now reads: stalenessBaseline = lastBlockAdvanceAt ?? monitorBootAt so the reconciler can see "no real block since boot" as stale once MONITOR_RECREATE_TIMEOUT_MS elapses, while a freshly-booted monitor still gets a full warmup window. The in-monitor noBlockTimer (which is what drives BLOCK_ADVANCE_TIMEOUT_MS-based reconnects) is unchanged. Test changes: - "stays alive when blocks are arriving regularly" was previously passing for the wrong reason: BLOCK_ADVANCE_TIMEOUT_MS was at its default 60s, the 5-min gaps between emitted blocks triggered reconnects, and the subscribe-time reset of lastBlockAdvanceAt masked the reality that no real blocks were being delivered to the reconnected providers. Pinned BLOCK_ADVANCE_TIMEOUT_MS to 60min so the test asserts what its name claims. - Added a regression test that drives three silent 60s windows with no real blocks and confirms isAlive() returns false after the 120s reaper threshold. 81/81 scheduler unit tests pass.
Single formatter-only fix in chain-monitor.ts: line-wrap on the stalenessBaseline / MONITOR_RECREATE_TIMEOUT_MS comparison collapsed onto one line per Biome rules. No logic change. 81/81 scheduler unit tests still pass, typecheck clean. Resolves the lint failure blocking #1272.
🧹 PR Environment Cleaned UpThe PR environment has been successfully deleted. Deleted Resources:
All resources have been cleaned up and will no longer incur costs. |
ℹ️ No PR Environment to Clean UpNo PR environment was found for this PR. This is expected if:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes a coverage gap in
BlockMonitorServicewhere the reaper'sisAlive()check could never see a monitor as stuck if its own in-process reconnect cycle kept firing. Discovered while investigating KEEP-570.Why this bug was invisible
subscribeToBlocksresetlastBlockAdvanceAttoDate.now()on every reconnect, with the explicit intent of giving freshly-subscribed monitors a warmup window before the reaper considered them stuck. The side effect: any monitor whose internal reconnect cycle kept firing everyBLOCK_ADVANCE_TIMEOUT_MSwithout ever receiving a real block looked permanently fresh to the reconciler. Each reconnect refreshed the same clock the reconciler was reading.The dispatcher reaper consequently never tore down stuck monitors. The in-process
ChainMonitorstate that was producing the stuck-ness was therefore never replaced with a fresh instance, even thoughBlockMonitorService.startNewMonitoris the cheapest possible recovery action (and is what a pod restart effectively does for every chain at once).Observation that motivated the fix
Prod log signal from
keeperhub-block-common-77cf4b46f9-vv54z(uptime 5h 22m at sampling):A
kubectl exec'd fresh ethersWebSocketProviderin that pod against the same URLs the dispatcher is silent on returns BOTH_HEALTHY: 25+newHeadsin 45s. So the failure is in-process and a freshChainMonitorinstance is the diagnostic-by-action: if recreating one unsticks the chain, the bug lives inChainMonitorstate. The reaper is the path that was supposed to do that and could not.Change
Split the cold-start warmup baseline from the real-block-advance clock:
monitorBootAtset instart(). Used as the staleness baseline before any real block arrives.lastBlockAdvanceAtis no longer reset on resubscribe. It is updated only byprocessBlockRangeon a real height advance, exactly as its comment claims.isAlive()now reads:so the reconciler can see "no real block since boot" as stale once
MONITOR_RECREATE_TIMEOUT_MSelapses, while a freshly-booted monitor still gets a full warmup window.The in-monitor
noBlockTimerthat drivesBLOCK_ADVANCE_TIMEOUT_MS-based reconnects is unchanged. This PR only changes the reconciler's view of liveness.Test changes
"stays alive when blocks are arriving regularly"was passing for the wrong reason. With defaultBLOCK_ADVANCE_TIMEOUT_MS=60_000, the test's 5-minute gaps between emitted blocks triggered reconnect cycles, and the subscribe-time reset oflastBlockAdvanceAtmasked the fact that no real blocks were being delivered to the reconnected providers. PinnedBLOCK_ADVANCE_TIMEOUT_MSto 60 min so the test asserts what its name claims."reaps a monitor stuck across silent reconnects with no real blocks". Drives three silent 60s windows and verifiesisAlive()returns false after the 120s reaper threshold.Test plan
pnpm test:unitinkeeperhub-scheduler: 81/81 pass (was 80, +1 regression test)pnpm tsc --noEmit -p keeperhub-scheduler/tsconfig.json: cleanChainMonitorper-instance state. NewChainMonitor= recovery.ChainMonitor. Need a different fix (or pod restart for now).Related
feat/KEEP-570-block-trigger-sqs-drops). Independent. Once both land, the next stuck-chain event will produce a self-discriminating warning line AND the reaper will actually act on it.