Skip to content

bridge: fix checkpoint tx height (off-by-one) post tx_index removal#593

Merged
lucca30 merged 3 commits into
developfrom
lmartins/fix-checkpoint-tx-block-height
May 19, 2026
Merged

bridge: fix checkpoint tx height (off-by-one) post tx_index removal#593
lucca30 merged 3 commits into
developfrom
lmartins/fix-checkpoint-tx-block-height

Conversation

@lucca30
Copy link
Copy Markdown
Contributor

@lucca30 lucca30 commented May 19, 2026

Summary

Hot-fix for the checkpoints-to-L1 regression introduced by #587. The bridge stopped submitting checkpoints because the new tx-bytes lookup reads from the wrong block. Single-line fix (height - 1), with a comment so the invariant is impossible to miss next time.

Background

#587 made indexer = "null" the default for heimdall by removing the bridge's last dependency on cometbft's tx_index. The previous flow looked up the checkpoint tx by hash through node.Tx(hash, prove=true) — that path traversed the tx indexer and didn't care which block the tx lived in.

The replacement, helper.QueryTxBytesFromBlock(cliCtx, hash, height), reads the block at height directly and scans block.Txs for the matching hash. That requires the caller to pass the exact height of the block that contains the checkpoint tx. The new code passed the wrong height.

What was wrong

bridge/processor/checkpoint.go was passing the blockHeight attached to the EventTypeCheckpoint event — but that's the height where the event was emitted, not the height of the checkpoint tx. They differ by one. Walkthrough:

  1. A validator broadcasts MsgCheckpointBlock. The tx is included in heimdall block H. Vote extensions over H carry the side-tx signatures.
  2. At block H+1, app.PreBlocker (app/abci.go) tallies the vote extensions from H and runs the side-tx post-handler against the txs that were in H — those txs are loaded via StakeKeeper.GetLastBlockTxs(ctx), which is populated at the end of every block by SetLastBlockTxs(ctx, req.Txs[1:]) (app/abci.go:536).
  3. Inside the post-handler, x/checkpoint/keeper/side_msg_server.go emits EventTypeCheckpoint. Because PreBlocker runs at H+1, the event lands in H+1's FinalizeBlockEvents.
  4. The bridge listener (bridge/listener/heimdall.go:104) polls heimdall blocks and forwards events to the processor with the block height where it found them — i.e. H+1.
  5. The processor then called QueryTxBytesFromBlock(cliCtx, txHash, H+1). node.Block(H+1) does not contain the MsgCheckpointBlock tx (that tx is in H), so findTxInBlock returns "tx not found", the bridge errors out, and no checkpoint is submitted to L1.

Symptom in CI: the smoke test for checkpoints polls /checkpoints/latest 80+ times and never sees one (gh run view 26037788347 --repo 0xPolygon/heimdall-v2).

The old tx_index path masked the issue completely — node.Tx(hash, true) was looking the tx up by hash, not by block, so the wrong height was harmless then. The new direct-block path made the off-by-one observable for the first time.

What's right now

Pass height - 1 to QueryTxBytesFromBlock. The invariant is exact, not approximate: the post-handler runs at H+1 against exactly the txs saved from H. There is no scenario where the tx would be in H-2 or H — the lastBlockTxs state is rewritten every block. So a precise -1 is correct (and a windowed scan would just be noise).

Also added a multi-line comment at the callsite explaining the H-1 invariant and pointing at app/abci.go so the next reader doesn't have to retrace this. The previous comment only said "this works with indexer=null" — true but it hid the precondition on height.

Why this won't repeat

  • The comment in bridge/processor/checkpoint.go now names the exact files (StakeKeeper.SetLastBlockTxs/GetLastBlockTxs, app/abci.go) that establish the H-1 invariant. Anyone touching this callsite has to read past it.
  • The variable rename heighttxBlockHeight at the callsite makes it obvious that the value passed to QueryTxBytesFromBlock is not the event's height — they are semantically distinct.
  • QueryTxBytesFromBlock is now only ever called from this one site. If a second caller is ever added, they will encounter the same naming distinction and the same comment via grep.
  • The kurtosis E2E suite catches this end-to-end: the checkpoints smoke test verifies a checkpoint actually lands on L1 within the polling window. A future regression in this code path will fail CI the same way pruning defaults + bridge tx_index-free refactor #587 did.

Verification

Reproduced the regression locally and confirmed the fix end-to-end:

  1. Built heimdall-v2:local from this branch and bor:local from bor/develop.

  2. Ran kurtosis run with a 4-validator + 1-rpc config (same participant shape as pos-workflows/configs/kurtosis-e2e.yml).

  3. Polled /checkpoints/latest on the first validator:

    {"checkpoint":{"id":"2","proposer":"0xeee6f79486542f85290920073947bc9672c6ace5","start_block":"128","end_block":"255",...}}
    

    Two checkpoints landed in the same window where CI saw zero with the broken develop image. Bridge logs show sendCheckpointToHeimdall followed by a successful root-chain submission.

Confirming the broader claim from #587

Re-grepped origin/develop for any remaining cometbft tx-index reads in heimdall code:

git grep "node\.Tx(\|QueryTxWithProof\|TxSearch\|TxsSearch" -- bridge/ helper/ x/ app/ cmd/

Only hit is the definition of helper.QueryTxWithProof itself (helper/query.go:35) — and it has no callers. So with this fix in, indexer = "null" is genuinely safe on every node role (validator, sentry, RPC, bridge). The dead QueryTxWithProof can be removed in a separate cleanup PR.

Test plan

  • Built heimdall-v2:local with the fix.
  • Ran kurtosis locally with the same participant shape as CI; checkpoints landed within ~3 min of devnet start.
  • Confirmed no other tx_index readers remain in the codebase.
  • CI: kurtosis-e2e smoke tests pass on this branch (the regression that broke pruning defaults + bridge tx_index-free refactor #587's merge).

PR #587 replaced the cometbft tx_index lookup in the bridge checkpoint
flow with helper.QueryTxBytesFromBlock(hash, height), passing the block
height attached to the EventTypeCheckpoint event. That height is wrong
by one: the event is emitted in PreBlocker against the *previous*
block's txs, so node.Block(height) returns a block that does not
contain the MsgCheckpointBlock tx, and the bridge fails to submit
checkpoints to L1.

Pass height-1 and document the invariant inline so this can't silently
regress again.
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.

Tip: disable this comment in your organization's Code Review settings.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 19, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 45.1%. Comparing base (3f642d8) to head (525cd94).
⚠️ Report is 1 commits behind head on develop.

Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##           develop    #593   +/-   ##
=======================================
  Coverage     45.1%   45.1%           
=======================================
  Files          172     172           
  Lines        18660   18663    +3     
=======================================
+ Hits          8428    8435    +7     
+ Misses        9025    9020    -5     
- Partials      1207    1208    +1     
Files with missing lines Coverage Δ
bridge/processor/checkpoint.go 6.3% <100.0%> (+2.0%) ⬆️

... and 1 file with indirect coverage changes

Files with missing lines Coverage Δ
bridge/processor/checkpoint.go 6.3% <100.0%> (+2.0%) ⬆️

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@claude
Copy link
Copy Markdown

claude Bot commented May 19, 2026

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

marcello33
marcello33 previously approved these changes May 19, 2026
Comment thread bridge/processor/checkpoint.go Outdated
Routes the QueryTxBytesFromBlock call through a small named helper that
owns the "event at H, tx in H-1" invariant. The helper is unit-testable
via a queryTxBytesFromBlock function field on CheckpointProcessor, so a
regression in the height calculation now fails a fast unit test instead
of only surfacing in the kurtosis E2E.

The unit test is not a substitute for the E2E (the surrounding flow is
too integration-shaped to mock fully). It is a fast tripwire on the one
line that broke #587.
Adds a unit test that drives createAndSendCheckpointToRootChain with a
queryTxBytesFromBlock stub that fails, asserting the original error is
propagated verbatim (no swallow, no fall-through into decode/sign with
nil bytes). Pushes coverage on the diff from 66.6% to 100% — all newly
added lines are now exercised.
@sonarqubecloud
Copy link
Copy Markdown

@lucca30 lucca30 merged commit 9cd3055 into develop May 19, 2026
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants