add pollHealthChecker interface for optional RPC health checks#83
add pollHealthChecker interface for optional RPC health checks#83Krish-vemula wants to merge 15 commits intomainfrom
Conversation
Add optional interface for chain-specific RPC clients to run extra health checks during alive-loop polling. Failures count toward poll failure threshold. Enables chain integrations to detect issues like missing historical state.
…r finalized state availability with configurable threshold and regex-based error classification.
72b9577 to
61b3d25
Compare
multinode/node_lifecycle.go
Outdated
| case <-time.After(dialRetryBackoff.Duration()): | ||
| lggr.Tracew("Trying to re-dial RPC node", "nodeState", n.getCachedState()) | ||
|
|
||
| state := n.createVerifiedConn(ctx, lggr) |
There was a problem hiding this comment.
No need to create a new connection on every iteration
There was a problem hiding this comment.
Fixed. Moved createVerifiedConn before the loop so we dial once. Now respecting the returned state via declareState(state) if chain ID check fails, the node transitions to InvalidChainID instead of staying in FinalizedStateNotAvailable. Also added nodeStateFinalizedStateNotAvailable as a valid source state for transitions to Unreachable, InvalidChainID, and Syncing in the FSM.
multinode/node_lifecycle.go
Outdated
|
|
||
| state := n.createVerifiedConn(ctx, lggr) | ||
| if state != nodeStateAlive { | ||
| n.setState(nodeStateFinalizedStateNotAvailable) |
There was a problem hiding this comment.
We should transition to state returned by createVerifiedConn, if it's not alive.
Imagine a case where check of chain ID fails, in such case we should not mark RPC as nodeStateFinalizedStateNotAvailable, but should mark that chainID is invalid.
multinode/node_lifecycle_test.go
Outdated
| node.wg.Add(1) | ||
| go node.aliveLoop() | ||
|
|
||
| tests.AssertEventually(t, func() bool { |
There was a problem hiding this comment.
This is flaky. It's possible that the transition to alive occurs before this line gets executed.
There was a problem hiding this comment.
Fixed. Replaced the transient state assertion with AssertLogEventually on the "RPC Node cannot serve finalized state" log message, which can't be missed even if the node transitions back to Alive before the poll.
multinode/node_lifecycle.go
Outdated
| return | ||
| } | ||
| // Separate finalized state availability check | ||
| stateCheckCtx, stateCheckCancel := context.WithTimeout(ctx, pollInterval) |
There was a problem hiding this comment.
No need to poll if finalizedStateCheckFailureThreshold is set to 0
There was a problem hiding this comment.
Let's also log a message to indicate if this health check is enabled
There was a problem hiding this comment.
- Fixed. Wrapped the entire CheckFinalizedStateAvailability call in if finalizedStateCheckFailureThreshold > 0 so no RPC call is made when the check is disabled.
- Added. Logs "Finalized state availability check enabled" (with threshold value) or "Finalized state availability check disabled" at loop startup, following the same Debug/Debugw pattern used for the polling check.
| case nodeStateClosed: | ||
| return | ||
| default: | ||
| panic(fmt.Sprintf("finalizedStateNotAvailableLoop can only run for node in FinalizedStateNotAvailable state, got: %s", state)) |
There was a problem hiding this comment.
Is this panic caught somewhere upstream or do we let it crash the node?
There was a problem hiding this comment.
It is not caught upstream; it is intentionally allowed to crash the Chainlink application process.
This matches the exact pattern used for every other state loop in this package (aliveLoop, outOfSyncLoop, unreachableLoop, invalidChainIDLoop, and syncingLoop).
These loops are spawned as independent go-routines directly by the FSM transition functions (e.g., transitionToFinalizedStateNotAvailable). Because the FSM strictly controls the state via mutexes, if a loop starts and the node is somehow not in the expected state, it means the core FSM logic is fundamentally broken/corrupted (e.g. a race condition bypassing the mutex).
In that scenario, crashing the application is the intended behaviour rather than continuing to operate with a corrupted internal state machine. It forces the Node Operator's process manager to restart the application with a clean state.
@dhaidashenko - could you double-check and verify my understanding here?
There was a problem hiding this comment.
Agree, panic is intended and signals bug in the code. Crash is needed to let the LOOP process/node auto-recover.
Summary
Adds a
CheckFinalizedStateAvailabilitymethod to theRPCClientinterface, enabling chain-specific RPC clients to detect non-archive nodes that cannot serve historical state at the finalized block.When enabled via
FinalizedStateCheckFailureThreshold, consecutive failures transition the node to a newFinalizedStateNotAvailablestate, removing it from the live pool. The node periodically re-dials and re-verifies availability, transitioning back toAliveonce the check passes.Key details:
CheckFinalizedStateAvailabilityis optional -RPCClientBaseprovides a no-op default, so chains that don't need it are unaffectedFinalizedStateCheckFailureThreshold = 0(default) disables the check entirely with no polling overheadErrFinalizedStateUnavailableerrors are treated as RPC reachability errors and count towardPollFailureThresholdnodeStateFinalizedStateNotAvailablewith full lifecycle loop and transitionsSupports: #352