Skip to content

Attempt to fix progression halting via tainted SyncStatus#435

Open
Ayiga wants to merge 1 commit into
celo-integration-rebase-17from
ts/fix/reorg-detection-progress-halting
Open

Attempt to fix progression halting via tainted SyncStatus#435
Ayiga wants to merge 1 commit into
celo-integration-rebase-17from
ts/fix/reorg-detection-progress-halting

Conversation

@Ayiga
Copy link
Copy Markdown
Member

@Ayiga Ayiga commented May 13, 2026

This PR:

Attempts to resolve a potential stalling issue due to a tainted SyncStatus, and a mismatched ChannelManager head.


When a `SyncStatus` from the `L2` comes in and moves the `SafeL2`
reference backwards (at the very least in the case of `0` specifically)
we can end up in a state where progression of the L2 chain through
Espresso halts entirely.

One such case where this can occur is when the `Streamer` gets `Reset`
back to a prior point, or even potentially to a `Future` point, which
causes a desync in the progress of a `ChannelManager` versus the
`Streamer` itself causing the `Streamer` not to have the next expected
`Batch` to be served.

To address this specific issue, when loading `batches` for the `L1`
submission, we add additional consideration to the `ChannelManager` to
ensure that we are looking at the correct block in addition to the
correct parent hash.
@claude
Copy link
Copy Markdown

claude Bot commented May 13, 2026

Claude finished @Ayiga's task in 2m 35s —— View job


Review of PR #435: Attempt to fix progression halting via tainted SyncStatus

  • Gather context and read changed files
  • Review op-batcher/batcher/espresso.go changes
  • Review espresso/environment/8_reorg_test.go changes
  • Post review feedback

Summary

This PR addresses a stalling issue where a tainted SyncStatus (e.g., SafeL2.Number regressing to 0) causes the batcher's peekNextBatch to lose track of the chain tip, halting progression.

The approach is sound. The key insight is replacing the reliance on channelMgr.tip (which is only set after AddL2Block and reset to zero on Clear) with a more robust strategy:

  1. Use the last block in channelMgr.blocks as the authoritative tip — this is the actual pending state
  2. Fall back to SafeL2 when blocks is empty, immediately deriving the tip hash
  3. Skip the streamer forward via skipStreamerToTargetBlockHeight to consume already-processed batches

The old code had a fragile recovery path: after Clear(), tip would be common.Hash{} and could only recover if the next peeked batch happened to be exactly SafeL2.Number + 1. If the sync status was tainted, this condition might never be met, causing permanent stalling.

What looks good

  • The skipStreamerToTargetBlockHeight helper is clean and correctly handles the "best effort" semantics
  • The test (TestE2eEspressoStreamerSyncStatusTainting) directly exercises the failure path by calling Refresh with safeBatchNumber = 0 and verifying the chain continues
  • Good addition of defer env.Stop(t, ...) calls to the existing TestE2eDevnetWithL1Reorg test — proper cleanup prevents resource leaks in CI
  • Consistent dereferencing pattern change from (*batch).Method() to batch.Method() where the pointer-to-interface is unnecessary

Issues found

I left detailed inline comments. The main items:

  1. Doc comment typoisHashNoptSet in the comment vs isHashEmpty as the actual function name (see comment)
  2. Concurrency consideration — The blocks slice is read under the mutex, but the derived targetBlock is used outside it for skipping. Worth confirming the single-caller invariant and documenting it (see comment)
  3. Edge case question — When blocks is empty and the peeked batch doesn't match targetBlock, we still skip to SafeL2.Number + 1. This seems safe because Refresh(safeBatchNumber=0) triggers a streamer reset, but the recovery sequence after a tainted-then-recovered SyncStatus deserves attention (see comment)
  4. Test cleanup — Commented-out client declarations in the new test (see comment)

Overall

The fix correctly addresses the root cause and the test validates the exact failure scenario. The issues above are minor — the doc typo is trivially fixable, the concurrency note is a documentation ask, and the edge case question is about verifying behavior rather than a bug. This looks good to merge after addressing the minor items.

Comment on lines +877 to +878

// isHashNoptSet is a helper function to check if a hash is the zero
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: the doc comment says isHashNoptSet but the function is named isHashEmpty. The comment should be updated to match.

Suggested change
// isHashNoptSet is a helper function to check if a hash is the zero
// isHashEmpty is a helper function to check if a hash is the zero
// value (not set).

if tipBlock != (SizedBlock{}) && tipBlock.Block != nil {
targetBlock = tipBlock.NumberU64() + 1
tipHash = tipBlock.Hash()
} else if batch := l.EspressoStreamer().Peek(ctx); batch != nil && batch.Number() == targetBlock {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This Peek call is done outside the mutex, then skipStreamerToTargetBlockHeight (line 915) and another Peek (line 916) follow. But since espressoBatchLoadingLoop calls peekNextBatch in a tight for loop (line 969-971) and then immediately AddL2Block (which updates blocks), there's a TOCTOU window: between reading blocks (line 893) and using the derived targetBlock for skipping (line 915), the loading loop could concurrently add blocks that change what targetBlock should be.

In practice this may be safe because peekNextBatch is only called from espressoBatchLoadingLoop (single goroutine), so the blocks can't change between reading them and using them here. But it's worth confirming — is peekNextBatch guaranteed to only be called from the single loading loop goroutine? If so, a brief comment noting the single-caller invariant would help future readers.

Comment on lines +898 to +900
targetBlock := syncStatus.SafeL2.Number + 1
tipHash := syncStatus.SafeL2.Hash
if tipBlock != (SizedBlock{}) && tipBlock.Block != nil {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the core logic change and the heart of the fix. In the old code, tip was read from channelMgr.tip — a hash that's only set after AddL2Block succeeds. After a Clear(), tip would be common.Hash{} and the only way to recover was if the peeked batch happened to be exactly SafeL2.Number + 1.

The new approach improves on this significantly:

  1. Uses the last block in channelMgr.blocks as the authoritative tip (since blocks is what actually tracks pending state)
  2. Falls back to SafeL2 when blocks is empty, immediately setting tipHash instead of requiring a second Peek check
  3. skipStreamerToTargetBlockHeight advances the streamer past already-known blocks

This should handle the tainted SyncStatus case well — if SafeL2.Number regresses to 0, targetBlock becomes 1, and skipStreamerToTargetBlockHeight won't skip anything meaningful if the streamer is already ahead, so the streamer just continues from where it is.

One question: when channelMgr.blocks is empty and the peeked batch number does not equal targetBlock (neither branch of the if/else if matches), we fall through with tipHash = syncStatus.SafeL2.Hash and targetBlock = syncStatus.SafeL2.Number + 1. The skip call will then consume batches up to targetBlock. But if the streamer is already well past SafeL2.Number + 1 (e.g., after a tainted SyncStatus resets SafeL2.Number to 0 then recovers), won't this skip consume batches we actually need?

I think this is fine because Refresh with safeBatchNumber=0 would trigger a streamer reset (per the interface: "This will only automatically reset the Streamer if the safeBatchNumber moves backwards"), so the streamer would be repositioned. But it'd be worth verifying that the skip + peek sequence is correct in the scenario where SyncStatus recovers from a tainted value between ticks.

Comment on lines +276 to +278
l2Seq := system.NodeClient(e2esys.RoleSeq)
// l1Client := system.NodeClient(e2esys.RoleL1)
// caffClient := system.NodeClient(env.RoleCaffNode)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: these commented-out client declarations suggest the test may be incomplete or these were used during development. If they're planned for future use, that's fine; otherwise consider removing them to keep the test clean.

status, err := l2Node.SyncStatus(ctx)
require.NoError(t, err)

err = system.BatchSubmitter.EspressoStreamer().Refresh(ctx, status.FinalizedL1, 0, status.FinalizedL2.L1Origin)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good test — this directly exercises the tainted SyncStatus path by calling Refresh with safeBatchNumber = 0. The subsequent WaitForBlockToBeFinalized with status.UnsafeL2.Number + 2 validates that the chain continues to progress despite the tainted refresh.

One consideration: with a 5-minute timeout and only needing 2 more blocks finalized, this should be robust. But if the devnet is slow to start, the initial WaitForBlockToBeFinalized(5) with a 2-minute timeout on line 279 might be tight. Worth monitoring in CI.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new end-to-end test for EspressoStreamer sync status tainting and improves resource cleanup in existing tests. In the batcher, it refactors the batch peeking logic to more accurately track the chain tip using blocks from the channel manager and adds a utility to skip streamer batches to a target height. Feedback includes correcting a documentation typo, ensuring consistent use of the .Len() method for block queues, simplifying redundant conditional checks, and standardizing hexadecimal formatting in log entries.

}
}

// isHashNoptSet is a helper function to check if a hash is the zero
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There is a typo in the function documentation: isHashNoptSet should be isHashEmpty to match the function name.

Suggested change
// isHashNoptSet is a helper function to check if a hash is the zero
// isHashEmpty is a helper function to check if a hash is the zero

Comment on lines +893 to +894
if len(l.channelMgr.blocks) > 0 {
tipBlock = l.channelMgr.blocks[len(l.channelMgr.blocks)-1]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There is an inconsistency in how the length of l.channelMgr.blocks is accessed. In op-batcher/batcher/channel_manager.go (e.g., line 98 and 632), the code uses the .Len() method, whereas here it uses the built-in len() function. For consistency and to avoid potential compilation issues if queue.Queue is a struct type, it is recommended to use the .Len() method throughout the package.

Suggested change
if len(l.channelMgr.blocks) > 0 {
tipBlock = l.channelMgr.blocks[len(l.channelMgr.blocks)-1]
if l.channelMgr.blocks.Len() > 0 {
tipBlock = l.channelMgr.blocks[l.channelMgr.blocks.Len()-1]


targetBlock := syncStatus.SafeL2.Number + 1
tipHash := syncStatus.SafeL2.Hash
if tipBlock != (SizedBlock{}) && tipBlock.Block != nil {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The check tipBlock != (SizedBlock{}) is redundant if tipBlock.Block != nil is also checked. Simplifying this check improves readability and ensures that we only proceed if we have a valid block pointer to access.

Suggested change
if tipBlock != (SizedBlock{}) && tipBlock.Block != nil {
if tipBlock.Block != nil {

Comment on lines +909 to +911
"batchNr", batch.Number(),
"batchParent", batch.Header().ParentHash.Hex(),
"tip", tipHash,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For consistency in logging, it is recommended to use .Hex() for all common.Hash types. This ensures that the output is always formatted as a hexadecimal string, matching the style used for batchParent on line 910.

Suggested change
"batchNr", batch.Number(),
"batchParent", batch.Header().ParentHash.Hex(),
"tip", tipHash,
"batchNr", batch.Number(),
"batchParent", batch.Header().ParentHash.Hex(),
"tip", tipHash.Hex(),

Comment on lines 933 to +935
"batchNr", (*batch).Number(),
"batchParent", (*batch).Header().ParentHash,
"tip", tip,
"tip", tipHash,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logging here is inconsistent with previous log entries in the same function (e.g., line 910). It is recommended to use .Hex() for common.Hash types. Additionally, the dereference (*batch) is unnecessary in Go when calling methods on a pointer to a struct.

Suggested change
"batchNr", (*batch).Number(),
"batchParent", (*batch).Header().ParentHash,
"tip", tip,
"tip", tipHash,
"batchNr", batch.Number(),
"batchParent", batch.Header().ParentHash.Hex(),
"tip", tipHash.Hex(),

Copy link
Copy Markdown
Member

@shenkeyao shenkeyao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall! Some of the Claude comments seem reasonable, though, so worth addressing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants