Attempt to fix progression halting via tainted SyncStatus#435
Conversation
When a `SyncStatus` from the `L2` comes in and moves the `SafeL2` reference backwards (at the very least in the case of `0` specifically) we can end up in a state where progression of the L2 chain through Espresso halts entirely. One such case where this can occur is when the `Streamer` gets `Reset` back to a prior point, or even potentially to a `Future` point, which causes a desync in the progress of a `ChannelManager` versus the `Streamer` itself causing the `Streamer` not to have the next expected `Batch` to be served. To address this specific issue, when loading `batches` for the `L1` submission, we add additional consideration to the `ChannelManager` to ensure that we are looking at the correct block in addition to the correct parent hash.
|
Claude finished @Ayiga's task in 2m 35s —— View job Review of PR #435: Attempt to fix progression halting via tainted SyncStatus
SummaryThis PR addresses a stalling issue where a tainted The approach is sound. The key insight is replacing the reliance on
The old code had a fragile recovery path: after What looks good
Issues foundI left detailed inline comments. The main items:
OverallThe fix correctly addresses the root cause and the test validates the exact failure scenario. The issues above are minor — the doc typo is trivially fixable, the concurrency note is a documentation ask, and the edge case question is about verifying behavior rather than a bug. This looks good to merge after addressing the minor items. |
|
|
||
| // isHashNoptSet is a helper function to check if a hash is the zero |
There was a problem hiding this comment.
Nit: the doc comment says isHashNoptSet but the function is named isHashEmpty. The comment should be updated to match.
| // isHashNoptSet is a helper function to check if a hash is the zero | |
| // isHashEmpty is a helper function to check if a hash is the zero | |
| // value (not set). |
| if tipBlock != (SizedBlock{}) && tipBlock.Block != nil { | ||
| targetBlock = tipBlock.NumberU64() + 1 | ||
| tipHash = tipBlock.Hash() | ||
| } else if batch := l.EspressoStreamer().Peek(ctx); batch != nil && batch.Number() == targetBlock { |
There was a problem hiding this comment.
This Peek call is done outside the mutex, then skipStreamerToTargetBlockHeight (line 915) and another Peek (line 916) follow. But since espressoBatchLoadingLoop calls peekNextBatch in a tight for loop (line 969-971) and then immediately AddL2Block (which updates blocks), there's a TOCTOU window: between reading blocks (line 893) and using the derived targetBlock for skipping (line 915), the loading loop could concurrently add blocks that change what targetBlock should be.
In practice this may be safe because peekNextBatch is only called from espressoBatchLoadingLoop (single goroutine), so the blocks can't change between reading them and using them here. But it's worth confirming — is peekNextBatch guaranteed to only be called from the single loading loop goroutine? If so, a brief comment noting the single-caller invariant would help future readers.
| targetBlock := syncStatus.SafeL2.Number + 1 | ||
| tipHash := syncStatus.SafeL2.Hash | ||
| if tipBlock != (SizedBlock{}) && tipBlock.Block != nil { |
There was a problem hiding this comment.
This is the core logic change and the heart of the fix. In the old code, tip was read from channelMgr.tip — a hash that's only set after AddL2Block succeeds. After a Clear(), tip would be common.Hash{} and the only way to recover was if the peeked batch happened to be exactly SafeL2.Number + 1.
The new approach improves on this significantly:
- Uses the last block in
channelMgr.blocksas the authoritative tip (sinceblocksis what actually tracks pending state) - Falls back to
SafeL2when blocks is empty, immediately settingtipHashinstead of requiring a second Peek check skipStreamerToTargetBlockHeightadvances the streamer past already-known blocks
This should handle the tainted SyncStatus case well — if SafeL2.Number regresses to 0, targetBlock becomes 1, and skipStreamerToTargetBlockHeight won't skip anything meaningful if the streamer is already ahead, so the streamer just continues from where it is.
One question: when channelMgr.blocks is empty and the peeked batch number does not equal targetBlock (neither branch of the if/else if matches), we fall through with tipHash = syncStatus.SafeL2.Hash and targetBlock = syncStatus.SafeL2.Number + 1. The skip call will then consume batches up to targetBlock. But if the streamer is already well past SafeL2.Number + 1 (e.g., after a tainted SyncStatus resets SafeL2.Number to 0 then recovers), won't this skip consume batches we actually need?
I think this is fine because Refresh with safeBatchNumber=0 would trigger a streamer reset (per the interface: "This will only automatically reset the Streamer if the safeBatchNumber moves backwards"), so the streamer would be repositioned. But it'd be worth verifying that the skip + peek sequence is correct in the scenario where SyncStatus recovers from a tainted value between ticks.
| l2Seq := system.NodeClient(e2esys.RoleSeq) | ||
| // l1Client := system.NodeClient(e2esys.RoleL1) | ||
| // caffClient := system.NodeClient(env.RoleCaffNode) |
There was a problem hiding this comment.
Nit: these commented-out client declarations suggest the test may be incomplete or these were used during development. If they're planned for future use, that's fine; otherwise consider removing them to keep the test clean.
| status, err := l2Node.SyncStatus(ctx) | ||
| require.NoError(t, err) | ||
|
|
||
| err = system.BatchSubmitter.EspressoStreamer().Refresh(ctx, status.FinalizedL1, 0, status.FinalizedL2.L1Origin) |
There was a problem hiding this comment.
Good test — this directly exercises the tainted SyncStatus path by calling Refresh with safeBatchNumber = 0. The subsequent WaitForBlockToBeFinalized with status.UnsafeL2.Number + 2 validates that the chain continues to progress despite the tainted refresh.
One consideration: with a 5-minute timeout and only needing 2 more blocks finalized, this should be robust. But if the devnet is slow to start, the initial WaitForBlockToBeFinalized(5) with a 2-minute timeout on line 279 might be tight. Worth monitoring in CI.
There was a problem hiding this comment.
Code Review
This pull request introduces a new end-to-end test for EspressoStreamer sync status tainting and improves resource cleanup in existing tests. In the batcher, it refactors the batch peeking logic to more accurately track the chain tip using blocks from the channel manager and adds a utility to skip streamer batches to a target height. Feedback includes correcting a documentation typo, ensuring consistent use of the .Len() method for block queues, simplifying redundant conditional checks, and standardizing hexadecimal formatting in log entries.
| } | ||
| } | ||
|
|
||
| // isHashNoptSet is a helper function to check if a hash is the zero |
There was a problem hiding this comment.
| if len(l.channelMgr.blocks) > 0 { | ||
| tipBlock = l.channelMgr.blocks[len(l.channelMgr.blocks)-1] |
There was a problem hiding this comment.
There is an inconsistency in how the length of l.channelMgr.blocks is accessed. In op-batcher/batcher/channel_manager.go (e.g., line 98 and 632), the code uses the .Len() method, whereas here it uses the built-in len() function. For consistency and to avoid potential compilation issues if queue.Queue is a struct type, it is recommended to use the .Len() method throughout the package.
| if len(l.channelMgr.blocks) > 0 { | |
| tipBlock = l.channelMgr.blocks[len(l.channelMgr.blocks)-1] | |
| if l.channelMgr.blocks.Len() > 0 { | |
| tipBlock = l.channelMgr.blocks[l.channelMgr.blocks.Len()-1] |
|
|
||
| targetBlock := syncStatus.SafeL2.Number + 1 | ||
| tipHash := syncStatus.SafeL2.Hash | ||
| if tipBlock != (SizedBlock{}) && tipBlock.Block != nil { |
There was a problem hiding this comment.
The check tipBlock != (SizedBlock{}) is redundant if tipBlock.Block != nil is also checked. Simplifying this check improves readability and ensures that we only proceed if we have a valid block pointer to access.
| if tipBlock != (SizedBlock{}) && tipBlock.Block != nil { | |
| if tipBlock.Block != nil { |
| "batchNr", batch.Number(), | ||
| "batchParent", batch.Header().ParentHash.Hex(), | ||
| "tip", tipHash, |
There was a problem hiding this comment.
For consistency in logging, it is recommended to use .Hex() for all common.Hash types. This ensures that the output is always formatted as a hexadecimal string, matching the style used for batchParent on line 910.
| "batchNr", batch.Number(), | |
| "batchParent", batch.Header().ParentHash.Hex(), | |
| "tip", tipHash, | |
| "batchNr", batch.Number(), | |
| "batchParent", batch.Header().ParentHash.Hex(), | |
| "tip", tipHash.Hex(), |
| "batchNr", (*batch).Number(), | ||
| "batchParent", (*batch).Header().ParentHash, | ||
| "tip", tip, | ||
| "tip", tipHash, |
There was a problem hiding this comment.
The logging here is inconsistent with previous log entries in the same function (e.g., line 910). It is recommended to use .Hex() for common.Hash types. Additionally, the dereference (*batch) is unnecessary in Go when calling methods on a pointer to a struct.
| "batchNr", (*batch).Number(), | |
| "batchParent", (*batch).Header().ParentHash, | |
| "tip", tip, | |
| "tip", tipHash, | |
| "batchNr", batch.Number(), | |
| "batchParent", batch.Header().ParentHash.Hex(), | |
| "tip", tipHash.Hex(), |
shenkeyao
left a comment
There was a problem hiding this comment.
Looks good overall! Some of the Claude comments seem reasonable, though, so worth addressing.
This PR:
Attempts to resolve a potential stalling issue due to a tainted SyncStatus, and a mismatched ChannelManager head.