Skip to content

add pollHealthChecker interface for optional RPC health checks#83

Open
Krish-vemula wants to merge 15 commits intomainfrom
cre/PLEX-2476
Open

add pollHealthChecker interface for optional RPC health checks#83
Krish-vemula wants to merge 15 commits intomainfrom
cre/PLEX-2476

Conversation

@Krish-vemula
Copy link
Copy Markdown
Contributor

@Krish-vemula Krish-vemula commented Feb 17, 2026

Summary

Adds a CheckFinalizedStateAvailability method to the RPCClient interface, enabling chain-specific RPC clients to detect non-archive nodes that cannot serve historical state at the finalized block.

When enabled via FinalizedStateCheckFailureThreshold, consecutive failures transition the node to a new FinalizedStateNotAvailable state, removing it from the live pool. The node periodically re-dials and re-verifies availability, transitioning back to Alive once the check passes.

Key details:

  • CheckFinalizedStateAvailability is optional - RPCClientBase provides a no-op default, so chains that don't need it are unaffected
  • Setting FinalizedStateCheckFailureThreshold = 0 (default) disables the check entirely with no polling overhead
  • Non-ErrFinalizedStateUnavailable errors are treated as RPC reachability errors and count toward PollFailureThreshold
  • New FSM state nodeStateFinalizedStateNotAvailable with full lifecycle loop and transitions

Supports: #352

@Krish-vemula Krish-vemula marked this pull request as ready for review February 18, 2026 23:02
@Krish-vemula Krish-vemula requested a review from a team as a code owner February 18, 2026 23:02
Add optional interface for chain-specific RPC clients to run extra health
checks during alive-loop polling. Failures count toward poll failure threshold.

Enables chain integrations to detect issues like missing historical state.
…r finalized state availability with configurable threshold and regex-based error classification.
@Krish-vemula Krish-vemula marked this pull request as draft March 10, 2026 18:11
@Krish-vemula Krish-vemula marked this pull request as ready for review March 10, 2026 20:24
case <-time.After(dialRetryBackoff.Duration()):
lggr.Tracew("Trying to re-dial RPC node", "nodeState", n.getCachedState())

state := n.createVerifiedConn(ctx, lggr)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to create a new connection on every iteration

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Moved createVerifiedConn before the loop so we dial once. Now respecting the returned state via declareState(state) if chain ID check fails, the node transitions to InvalidChainID instead of staying in FinalizedStateNotAvailable. Also added nodeStateFinalizedStateNotAvailable as a valid source state for transitions to Unreachable, InvalidChainID, and Syncing in the FSM.


state := n.createVerifiedConn(ctx, lggr)
if state != nodeStateAlive {
n.setState(nodeStateFinalizedStateNotAvailable)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should transition to state returned by createVerifiedConn, if it's not alive.
Imagine a case where check of chain ID fails, in such case we should not mark RPC as nodeStateFinalizedStateNotAvailable, but should mark that chainID is invalid.

node.wg.Add(1)
go node.aliveLoop()

tests.AssertEventually(t, func() bool {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is flaky. It's possible that the transition to alive occurs before this line gets executed.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Replaced the transient state assertion with AssertLogEventually on the "RPC Node cannot serve finalized state" log message, which can't be missed even if the node transitions back to Alive before the poll.

@Krish-vemula Krish-vemula marked this pull request as draft March 11, 2026 16:56
@Krish-vemula Krish-vemula marked this pull request as ready for review March 11, 2026 18:08
return
}
// Separate finalized state availability check
stateCheckCtx, stateCheckCancel := context.WithTimeout(ctx, pollInterval)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to poll if finalizedStateCheckFailureThreshold is set to 0

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also log a message to indicate if this health check is enabled

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Fixed. Wrapped the entire CheckFinalizedStateAvailability call in if finalizedStateCheckFailureThreshold > 0 so no RPC call is made when the check is disabled.
  • Added. Logs "Finalized state availability check enabled" (with threshold value) or "Finalized state availability check disabled" at loop startup, following the same Debug/Debugw pattern used for the polling check.

case nodeStateClosed:
return
default:
panic(fmt.Sprintf("finalizedStateNotAvailableLoop can only run for node in FinalizedStateNotAvailable state, got: %s", state))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this panic caught somewhere upstream or do we let it crash the node?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not caught upstream; it is intentionally allowed to crash the Chainlink application process.

This matches the exact pattern used for every other state loop in this package (aliveLoop, outOfSyncLoop, unreachableLoop, invalidChainIDLoop, and syncingLoop).

These loops are spawned as independent go-routines directly by the FSM transition functions (e.g., transitionToFinalizedStateNotAvailable). Because the FSM strictly controls the state via mutexes, if a loop starts and the node is somehow not in the expected state, it means the core FSM logic is fundamentally broken/corrupted (e.g. a race condition bypassing the mutex).

In that scenario, crashing the application is the intended behaviour rather than continuing to operate with a corrupted internal state machine. It forces the Node Operator's process manager to restart the application with a clean state.

@dhaidashenko - could you double-check and verify my understanding here?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, panic is intended and signals bug in the code. Crash is needed to let the LOOP process/node auto-recover.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants