add pollHealthChecker interface for optional RPC health checks by Krish-vemula · Pull Request #83 · smartcontractkit/chainlink-framework

Krish-vemula · 2026-02-17T22:03:08Z

Summary

Adds a CheckFinalizedStateAvailability method to the RPCClient interface, enabling chain-specific RPC clients to detect non-archive nodes that cannot serve historical state at the finalized block.

When enabled via FinalizedStateCheckFailureThreshold, consecutive failures transition the node to a new FinalizedStateNotAvailable state, removing it from the live pool. The node periodically re-dials and re-verifies availability, transitioning back to Alive once the check passes.

Key details:

CheckFinalizedStateAvailability is optional - RPCClientBase provides a no-op default, so chains that don't need it are unaffected
Setting FinalizedStateCheckFailureThreshold = 0 (default) disables the check entirely with no polling overhead
Non-ErrFinalizedStateUnavailable errors are treated as RPC reachability errors and count toward PollFailureThreshold
New FSM state nodeStateFinalizedStateNotAvailable with full lifecycle loop and transitions

Supports: #352

multinode/node_lifecycle.go

multinode/config/config.go

multinode/node_lifecycle.go

multinode/node_lifecycle_test.go

Add optional interface for chain-specific RPC clients to run extra health checks during alive-loop polling. Failures count toward poll failure threshold. Enables chain integrations to detect issues like missing historical state.

…r finalized state availability with configurable threshold and regex-based error classification.

multinode/node_lifecycle.go

dhaidashenko · 2026-03-11T15:54:32Z

multinode/node_lifecycle.go

+		case <-time.After(dialRetryBackoff.Duration()):
+			lggr.Tracew("Trying to re-dial RPC node", "nodeState", n.getCachedState())
+
+			state := n.createVerifiedConn(ctx, lggr)


No need to create a new connection on every iteration

Fixed. Moved createVerifiedConn before the loop so we dial once. Now respecting the returned state via declareState(state) if chain ID check fails, the node transitions to InvalidChainID instead of staying in FinalizedStateNotAvailable. Also added nodeStateFinalizedStateNotAvailable as a valid source state for transitions to Unreachable, InvalidChainID, and Syncing in the FSM.

dhaidashenko · 2026-03-11T15:56:02Z

multinode/node_lifecycle.go

+
+			state := n.createVerifiedConn(ctx, lggr)
+			if state != nodeStateAlive {
+				n.setState(nodeStateFinalizedStateNotAvailable)


We should transition to state returned by createVerifiedConn, if it's not alive.
Imagine a case where check of chain ID fails, in such case we should not mark RPC as nodeStateFinalizedStateNotAvailable, but should mark that chainID is invalid.

multinode/node_lifecycle.go

dhaidashenko · 2026-03-11T16:01:27Z

multinode/node_lifecycle_test.go

+		node.wg.Add(1)
+		go node.aliveLoop()
+
+		tests.AssertEventually(t, func() bool {


This is flaky. It's possible that the transition to alive occurs before this line gets executed.

Fixed. Replaced the transient state assertion with AssertLogEventually on the "RPC Node cannot serve finalized state" log message, which can't be missed even if the node transitions back to Alive before the poll.

…ifiedConn state

…Available

dhaidashenko · 2026-03-16T16:41:45Z

multinode/node_lifecycle.go

 				return
 			}
+			// Separate finalized state availability check
+			stateCheckCtx, stateCheckCancel := context.WithTimeout(ctx, pollInterval)


No need to poll if finalizedStateCheckFailureThreshold is set to 0

Let's also log a message to indicate if this health check is enabled

Fixed. Wrapped the entire CheckFinalizedStateAvailability call in if finalizedStateCheckFailureThreshold > 0 so no RPC call is made when the check is disabled.

Added. Logs "Finalized state availability check enabled" (with threshold value) or "Finalized state availability check disabled" at loop startup, following the same Debug/Debugw pattern used for the polling check.

…abled at startup

multinode/node_lifecycle.go

ilija42 · 2026-03-26T13:20:14Z

multinode/node_lifecycle.go

+		case nodeStateClosed:
+			return
+		default:
+			panic(fmt.Sprintf("finalizedStateNotAvailableLoop can only run for node in FinalizedStateNotAvailable state, got: %s", state))


Is this panic caught somewhere upstream or do we let it crash the node?

It is not caught upstream; it is intentionally allowed to crash the Chainlink application process.

This matches the exact pattern used for every other state loop in this package (aliveLoop, outOfSyncLoop, unreachableLoop, invalidChainIDLoop, and syncingLoop).

These loops are spawned as independent go-routines directly by the FSM transition functions (e.g., transitionToFinalizedStateNotAvailable). Because the FSM strictly controls the state via mutexes, if a loop starts and the node is somehow not in the expected state, it means the core FSM logic is fundamentally broken/corrupted (e.g. a race condition bypassing the mutex).

In that scenario, crashing the application is the intended behaviour rather than continuing to operate with a corrupted internal state machine. It forces the Node Operator's process manager to restart the application with a clean state.

@dhaidashenko - could you double-check and verify my understanding here?

Agree, panic is intended and signals bug in the code. Crash is needed to let the LOOP process/node auto-recover.

Krish-vemula marked this pull request as ready for review February 18, 2026 23:02

Krish-vemula requested a review from a team as a code owner February 18, 2026 23:02

product-security-plaid-production bot requested a review from DylanTinianov February 18, 2026 23:02

dhaidashenko reviewed Feb 26, 2026

View reviewed changes

multinode/node_lifecycle.go Outdated Show resolved Hide resolved

dhaidashenko reviewed Mar 3, 2026

View reviewed changes

Krish-vemula added 6 commits March 10, 2026 13:08

added fixes for build and lint

0d408ca

Introduce nodeStateFinalizedStateNotAvailable and separate polling fo…

886da2b

…r finalized state availability with configurable threshold and regex-based error classification.

added fixes for lint and mock

77b0b05

Add FinalizedStateUnavailable to ClientErrors

8cb2cbb

Update metrics dependency

61b3d25

Krish-vemula force-pushed the cre/PLEX-2476 branch from 72b9577 to 61b3d25 Compare March 10, 2026 18:09

Krish-vemula marked this pull request as draft March 10, 2026 18:11

Krish-vemula added 3 commits March 10, 2026 13:15

Update metrics dependency to include IncrementFinalizedStateFailed

c11a9a3

Fix goimports formatting

de1061f

Add CheckFinalizedStateAvailability mock expectations to tests

0568292

Krish-vemula marked this pull request as ready for review March 10, 2026 20:24

Krish-vemula requested a review from dhaidashenko March 11, 2026 14:56

dhaidashenko reviewed Mar 11, 2026

View reviewed changes

Remove redundant PollHealthCheck from RPCClient interface

93af15e

Krish-vemula marked this pull request as draft March 11, 2026 16:56

Krish-vemula added 3 commits March 11, 2026 12:13

lint fix-1

9dd8f85

Refactor finalizedStateNotAvailableLoop: dial once, respect createVer…

aa65c5a

…ifiedConn state

Fix flaky test: assert on log message instead of transient FSM state

b32539f

Krish-vemula marked this pull request as ready for review March 11, 2026 18:08

Krish-vemula requested a review from dhaidashenko March 11, 2026 18:08

Add threshold > 0 guard and FSM transition test for FinalizedStateNot…

2d39655

…Available

dhaidashenko reviewed Mar 16, 2026

View reviewed changes

Skip finalized state check when threshold is 0, log check enabled/dis…

793ab9b

…abled at startup

Krish-vemula requested a review from dhaidashenko March 17, 2026 23:39

dhaidashenko approved these changes Mar 18, 2026

View reviewed changes

Krish-vemula mentioned this pull request Mar 24, 2026

chore: bump chainlink-framework/multinode, chainlink-evm, chainlink-solana to multinode PR commits smartcontractkit/chainlink#21674

Draft

amit-momin reviewed Mar 25, 2026

View reviewed changes

multinode/node_lifecycle.go Show resolved Hide resolved

amit-momin approved these changes Mar 25, 2026

View reviewed changes

ilija42 reviewed Mar 26, 2026

View reviewed changes

Conversation

Krish-vemula commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Krish-vemula commented Feb 17, 2026 •

edited

Loading