Skip to content

Conversation

@alpe
Copy link
Contributor

@alpe alpe commented Jan 6, 2026

Replaces #2836

alpe added 28 commits November 12, 2025 15:16
* main:
  fix: remove duplicate error logging in light node shutdown (#2841)
  chore: fix incorrect function name in comment (#2840)
  chore: remove sequencer go.mod (#2837)
* main:
  build(deps): Bump the go_modules group across 2 directories with 3 updates (#2846)
  build(deps): Bump github.com/dvsekhvalnov/jose2go from 1.7.0 to 1.8.0 in /test/e2e (#2851)
  build(deps): Bump github.com/consensys/gnark-crypto from 0.18.0 to 0.18.1 in /test/e2e (#2844)
  build(deps): Bump github.com/cometbft/cometbft from 0.38.17 to 0.38.19 in /test/e2e (#2843)
  build(deps): Bump github.com/dvsekhvalnov/jose2go from 1.6.0 to 1.7.0 in /test/e2e (#2845)
(cherry picked from commit c44cd77e665f6d5d463295c6ed61c59a56d88db3)
* main:
  chore: reduce log noise (#2864)
  fix: sync service for non zero height starts with empty store (#2834)
  build(deps): Bump golang.org/x/crypto from 0.43.0 to 0.45.0 in /execution/evm (#2861)
  chore: minor improvement for docs (#2862)
* main:
  chore: bump da (#2866)
  chore: bump  core (#2865)
* main:
  chore: fix some comments (#2874)
  chore: bump node in evm-single (#2875)
  refactor(syncer,cache): use compare and swap loop and add comments (#2873)
  refactor: use state da height as well (#2872)
  refactor: retrieve highest da height in cache (#2870)
  chore: change from event count to start and end height (#2871)
* main:
  chore: remove extra github action yml file (#2882)
  fix(execution/evm): verify payload status (#2863)
  feat: fetch included da height from store (#2880)
  chore: better output on errors (#2879)
  refactor!: create da client and split cache interface (#2878)
  chore!: rename `evm-single` and `grpc-single` (#2839)
  build(deps): Bump golang.org/x/crypto from 0.42.0 to 0.45.0 in /tools/da-debug in the go_modules group across 1 directory (#2876)
  chore: parallel cache de/serialization (#2868)
  chore: bump blob size (#2877)
* main:
  build(deps): Bump mdast-util-to-hast from 13.2.0 to 13.2.1 in /docs in the npm_and_yarn group across 1 directory (#2900)
  refactor(block): centralize timeout in client (#2903)
  build(deps): Bump the all-go group across 2 directories with 3 updates (#2898)
  chore: bump default timeout (#2902)
  fix: revert default db (#2897)
  refactor: remove obsolete // +build tag (#2899)
  fix:da visualiser namespace  (#2895)
  refactor: omit unnecessary reassignment (#2892)
  build(deps): Bump the all-go group across 5 directories with 6 updates (#2881)
  chore: fix inconsistent method name in retryWithBackoffOnPayloadStatus comment (#2889)
  fix: ensure consistent network ID usage in P2P subscriber (#2884)
  build(deps): Bump golangci/golangci-lint-action from 9.0.0 to 9.1.0 (#2885)
  build(deps): Bump actions/checkout from 5 to 6 (#2886)
* main: (34 commits)
  feat: make reaper poll duration configurable (#2951)
  chore!: move sequencers to pkg (#2931)
  feat: Ensure Header integrity on DA (#2948)
  feat(testda): add header support with GetHeaderByHeight method (#2946)
  chore: improve code comments clarity (#2947)
  chore(sequencers): optimize store check (#2945)
  fix: make evm_execution more robust (#2942)
  fix(sequencers/single): deterministic queue (#2938)
  fix(block): fix init logic sequencer for da epoch fetching (#2926)
  feat: use DA timestamp (#2939)
  chore: improve code comments clarity (#2943)
  build(deps): bump libp2p (#2937)
  build(deps): Bump actions/cache from 4 to 5 (#2934)
  build(deps): Bump actions/download-artifact from 6 to 7 (#2933)
  build(deps): Bump actions/upload-artifact from 5 to 6 (#2932)
  feat: DA Client remove interface part 3, replace types with new code (#2910)
  DA Client remove interface: Part 2.5, create e2e test to validate that a blob is posted in DA layer. (#2920)
  fix(syncing): skip forced txs checks for p2p blocks (#2922)
  build(deps): Bump the all-go group across 5 directories with 5 updates (#2919)
  chore: loosen syncer state check (#2927)
  ...
@github-actions
Copy link
Contributor

github-actions bot commented Jan 6, 2026

The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed⏩ skipped✅ passed✅ passedJan 9, 2026, 12:13 PM

@claude
Copy link
Contributor

claude bot commented Jan 6, 2026

Claude finished @alpe's task in 2m 40s —— View job


Code Review: RAFT High Availability Implementation

I've completed a comprehensive review of this RAFT integration PR. This is a substantial architectural change introducing high availability through HashiCorp's RAFT consensus. Below are my findings organized by severity and category.


Critical Issues 🔴

1. Bootstrap Mode Restriction Blocks Non-Bootstrap Nodes

File: pkg/raft/node.go:108-115

The Start() method returns an error if Bootstrap is false, which prevents non-bootstrap nodes from joining an existing cluster:

if !n.config.Bootstrap {
    return fmt.Errorf("raft cluster requires bootstrap mode")
}

Impact: This makes it impossible to add new nodes to an existing cluster dynamically, severely limiting cluster scalability and operational flexibility.

Recommendation: Only execute bootstrap logic when Bootstrap is true. Allow non-bootstrap nodes to start and attempt to join existing cluster via AddPeer. Fix this →


2. Naked Returns Swallow Errors

File: node/full.go:113-119

Two critical error paths use naked return statements that will return (nil, nil) instead of propagating errors:

if leaderElection, err = newSingleRoleElector(leaderFactory); err != nil {
    return  // ❌ Should be: return nil, err
}

Impact: Initialization errors are silently ignored, leading to nil pointer dereferences and runtime panics.

Recommendation: Change both occurrences to return nil, err. Fix this →


3. Race Condition in Leader Election Sync Check

File: pkg/raft/election.go:96-109

The leadership transition logic has a TOCTOU (time-of-check-time-of-use) race:

raftSynced := d.node.waitForMsgsLanded(d.node.Config().SendTimeout) == nil
if d.node.leaderID() != d.node.NodeID() {  // ⚠️ Leadership could change between these lines
    continue
}
if !raftSynced || !runnable.IsSynced(d.node.GetState()) {
    // Pass on leadership
}

Impact: Could result in dual leadership or "split-brain" scenarios where two nodes believe they're leaders, potentially causing double-signing and chain forks.

Recommendation: Check leadership status before and after sync, with proper locking. Consider adding a monotonic leadership epoch/term counter. Fix this →


4. Missing Error Handling in RAFT State Persistence

File: block/internal/executing/executor.go:474-478

RAFT broadcast errors during block production could lead to state inconsistency:

if err := e.raftNode.Broadcast(e.ctx, raftState); err != nil {
    return fmt.Errorf("failed to propose block to raft: %w", err)
}
// ... but block is still committed locally even if RAFT broadcast failed

Impact: If RAFT broadcast fails, the local node commits the block but cluster doesn't receive it, causing divergence.

Recommendation: Either fail block production entirely on RAFT broadcast failure, or implement retry/rollback logic. Fix this →


High Priority Issues 🟡

5. Insufficient Context Cancellation Timeout

File: pkg/raft/election.go:68-70

Worker cancellation uses parent context which may be already cancelled:

workerCtx, cancel := context.WithCancel(ctx)

Issue: When switching roles, the parent context might be cancelled before workers gracefully shut down, leading to abrupt termination.

Recommendation: Use context.WithTimeout(context.Background(), 5*time.Second) for shutdown operations.


6. Potential Memory Leak in Apply Channel

File: pkg/raft/node.go:293-299

The FSM apply callback drops messages on full channel:

select {
case f.applyCh <- RaftApplyMsg{Index: log.Index, State: &state}:
default:
    f.logger.Warn().Msg("apply channel full, dropping message")
}

Issue: While non-blocking is good, dropping RAFT consensus messages silently could lead to followers permanently falling behind.

Recommendation: Add metrics for dropped messages and consider triggering a full sync if too many drops occur.


7. Missing Validation in State Transitions

File: pkg/raft/types.go:13-21

The assertValid function only checks height/timestamp ordering but doesn't validate:

  • Chain continuity (hash chains)
  • Non-negative timestamps
  • Reasonable timestamp drift

Recommendation: Add comprehensive validation including hash chain verification, timestamp bounds, and state transition rules.


8. Inadequate Duplicate Server Detection

File: pkg/raft/node.go:377-391

deduplicateServers only deduplicates by ID, ignoring address conflicts:

key := server.ID  // ⚠️ Should also check for duplicate addresses

Issue: Two servers with different IDs but same address could cause connection issues.

Recommendation: Validate both ID and address uniqueness, return error on address conflicts.


Medium Priority Issues 🟢

9. Misleading Method Name

File: block/internal/cache/pending_headers.go:69-71

func (ph *PendingHeaders) GetLastSubmittedDataHeight() uint64

This method is on PendingHeaders but retrieves data height, which is confusing.

Recommendation: Rename to GetLastSubmittedHeaderHeight() for consistency. Fix this →


10. Debug Log at Wrong Level

File: block/internal/syncing/syncer.go:184

s.logger.Info().Msg("received raft block state")  // Should be Debug

Recommendation: Change to Debug() level to avoid log spam in production.


11. Non-Idiomatic Go Naming

File: pkg/rpc/server/http.go:143-146

NodeId string `json:"node_id"`  // Should be NodeID

Recommendation: Use NodeID per Go conventions for initialisms.


12. Redundant Type Conversion

File: block/internal/executing/executor.go:623-625

for i, tx := range batchData.Transactions {
    data.Txs[i] = tx  // types.Tx is already []byte
}

Recommendation: Replace loop with data.Txs = append(data.Txs, batchData.Transactions...) for efficiency.


Architecture & Design 📐

Strengths ✅

  1. Clean separation of concerns - RAFT logic isolated in pkg/raft/, interfaces well-defined
  2. Proper use of FSM pattern - State machine implementation follows RAFT best practices
  3. Dynamic leader election - Elegant failover mechanism with DynamicLeaderElection
  4. Non-blocking apply channel - Prevents RAFT cluster blocking on slow consumers
  5. Comprehensive E2E tests - test/e2e/failover_e2e_test.go provides good coverage

Areas for Improvement ⚠️

  1. Test coverage: Only 40% patch coverage (567 lines uncovered). Critical paths in pkg/raft/node.go (12% coverage) and block/internal/syncing/raft_retriever.go (0% coverage) need tests.

  2. Configuration complexity: 8 new RAFT config options added without migration guide or validation rules.

  3. Leadership transfer logic: The leadership transfer in election.go:104 needs retry logic and timeout handling.

  4. Snapshot management: No tests for snapshot/restore functionality, which is critical for cluster recovery.

  5. Metrics: Missing key metrics for:

    • Leader election count/duration
    • RAFT message latency
    • Follower lag
    • Dropped apply messages

Security Considerations 🔒

  1. TLS Not Enforced: RAFT transport uses TCP without TLS. For production, mutual TLS should be mandatory.

  2. No Authentication: RAFT AddPeer accepts any nodeID/address without authentication. Consider adding JWT or mTLS.

  3. Snapshot Security: Snapshots stored on disk without encryption. Consider encrypting at rest.

  4. Log Poisoning: No validation of log entries before applying to FSM beyond basic unmarshaling.


Performance & Resource Management ⚡

  1. BoltDB Disk Usage: Two separate BoltDB files (raft-log.db and raft-stable.db) could grow large. Need log compaction strategy.

  2. Sync Wait Timeout: waitForMsgsLanded uses SendTimeout which may be too short for large clusters under load.

  3. Context Propagation: Several places create new background contexts instead of propagating parent context, losing trace information.


Testing Gaps 🧪

Required test additions:

  1. Unit tests for:

    • pkg/raft/node.go bootstrap and non-bootstrap paths
    • pkg/raft/election.go all leadership transition scenarios
    • block/internal/syncing/raft_retriever.go apply loop edge cases
  2. Integration tests for:

    • Adding node to existing cluster
    • Network partition scenarios
    • Leader crash during block production
    • Snapshot creation and restoration
  3. Chaos tests for:

    • Random node failures
    • Network delays/packet loss
    • Disk I/O failures

Documentation 📝

Missing/Incomplete:

  1. No ADR (Architecture Decision Record) for RAFT integration approach
  2. Configuration parameters lack detailed descriptions of valid ranges
  3. No runbook for operational procedures (adding nodes, leadership transfer, recovery)
  4. Migration guide missing for existing deployments

Recommendations Summary

Before Merge:

  1. ✅ Fix critical issues ABCI interface #1-4 (bootstrap, error handling, race conditions)
  2. ✅ Add unit tests for core RAFT paths to reach >70% coverage
  3. ✅ Add configuration validation with sensible defaults
  4. ✅ Document RAFT configuration options in CLAUDE.md

Post-Merge (Can be follow-up PRs):

  1. Add TLS support for RAFT transport
  2. Implement comprehensive metrics and monitoring
  3. Add chaos/resilience tests
  4. Create operational runbook

Conclusion

This is a substantial and well-architected RAFT integration that brings true high availability to ev-node. The core design is sound with proper separation of concerns and elegant use of Go patterns. However, critical bugs in error handling and bootstrap logic must be fixed before merge to prevent production issues.

The 40% test coverage is concerning for such critical infrastructure. I strongly recommend adding tests for at least the core RAFT node lifecycle and leader election paths before merging.

Overall: Approve with required changes ✅ (after addressing critical issues #1-4)


@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @alpe, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the Rollkit network's resilience and fault tolerance by integrating the Raft consensus algorithm. The changes enable a robust failover mechanism, allowing nodes to seamlessly transition between leader and follower roles. This ensures continuous operation and consistent state across the network, even if individual nodes experience failures. The update touches various core components, from dependency management and block processing to RPC services and node lifecycle management, all aimed at achieving high availability.

Highlights

  • Raft Consensus Integration: Introduced Raft consensus for high availability, enabling dynamic leader election and state replication. This involves new dependencies on github.com/hashicorp/raft and github.com/hashicorp/raft-boltdb.
  • Node Architecture Refactoring: The node's core logic has been refactored to support a failover mechanism, allowing nodes to dynamically switch between aggregator (leader) and sync (follower) modes based on Raft's leadership status. This includes changes to how P2P clients are initialized and passed to node startup commands.
  • Block Processing with Raft: The block executor now checks Raft leadership before producing new blocks and proposes the block state to the Raft cluster after execution, ensuring state consistency across the cluster.
  • Cache and State Management Enhancements: Added methods to retrieve and persist the last submitted DA header and data heights from the cache. Logic was introduced to skip in-flight (temporary) headers and data during cache retrieval to prevent processing incomplete blocks.
  • New RPC Endpoints and Configuration: Added new RPC endpoints for querying Raft node status (e.g., /raft/node) and introduced new configuration flags for Raft-specific settings like node_id, raft_addr, bootstrap, and peers.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces high availability via RAFT consensus, which is a significant architectural change. The implementation is extensive, touching many parts of the system from node startup to block processing and syncing. Overall, the changes are well-structured, introducing clear interfaces for RAFT integration and refactoring the node components to support dynamic role switching between leader and follower. I've identified a few critical issues related to error handling and startup logic that need to be addressed, along with some medium-severity issues for improving code clarity and maintainability. Great work on this complex feature.

I am having trouble creating individual review comments. Click here to see my feedback.

node/full.go (113-115)

high

The naked return here will not propagate the error from newSingleRoleElector. This will cause the function to return (nil, nil), masking the underlying error. The error should be returned to the caller.

		if leaderElection, err = newSingleRoleElector(leaderFactory); err != nil {
			return nil, err
		}

node/full.go (117-119)

high

Similar to the previous case, the naked return here will swallow the error from newSingleRoleElector. The error should be propagated up to the caller.

		if leaderElection, err = newSingleRoleElector(followerFactory); err != nil {
			return nil, err
		}

pkg/raft/node.go (111-113)

high

This check prevents a node from starting if Bootstrap is false, which is problematic for nodes joining an existing cluster. A new node attempting to join will fail to start. The bootstrap logic should only execute if n.config.Bootstrap is true, and the function should return nil otherwise, allowing non-bootstrap nodes to start and join a cluster.

block/internal/cache/pending_headers.go (69-71)

medium

The method name GetLastSubmittedDataHeight is misleading as it's part of the PendingHeaders struct. For clarity and consistency, it should be renamed to GetLastSubmittedHeaderHeight.

This change will also require updating the call site in block/internal/cache/manager.go.

func (ph *PendingHeaders) GetLastSubmittedHeaderHeight() uint64 {
	return ph.base.getLastSubmittedHeight()
}

block/internal/executing/executor.go (570-572)

medium

The explicit type conversion types.Tx(tx) is redundant since types.Tx is an alias for []byte, and tx is already of type []byte. The change to a direct assignment is good, but it seems this loop could be replaced with a single, more efficient append call.

	data.Txs = append(data.Txs, batchData.Transactions...)

block/internal/syncing/syncer.go (184)

medium

This log message seems to be for debugging purposes, indicated by the +++ prefix. It should be logged at the Debug level instead of Info to avoid cluttering the logs in a production environment.

			s.logger.Debug().Uint64("header_height", state.LastSubmittedDAHeaderHeight).Uint64("data_height", state.LastSubmittedDADataHeight).Msg("received raft block state")

pkg/rpc/server/http.go (143-146)

medium

To adhere to Go's naming conventions for initialisms, the struct field NodeId should be renamed to NodeID.

				NodeID   string `json:"node_id"`
			}{
				IsLeader: raftNode.IsLeader(),
				NodeID:   raftNode.NodeID(),

@codecov
Copy link

codecov bot commented Jan 7, 2026

Codecov Report

❌ Patch coverage is 40.12672% with 567 lines in your changes missing coverage. Please review.
✅ Project coverage is 55.75%. Comparing base (453a8a4) to head (49ba50f).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
pkg/raft/node.go 12.12% 174 Missing ⚠️
pkg/raft/node_mock.go 45.40% 74 Missing and 21 partials ⚠️
block/internal/syncing/raft_retriever.go 0.00% 63 Missing ⚠️
node/full.go 32.30% 37 Missing and 7 partials ⚠️
block/internal/syncing/syncer.go 25.45% 39 Missing and 2 partials ⚠️
node/failover.go 74.45% 22 Missing and 13 partials ⚠️
block/internal/executing/executor.go 5.55% 30 Missing and 4 partials ⚠️
pkg/raft/election.go 80.23% 12 Missing and 5 partials ⚠️
pkg/rpc/server/http.go 6.66% 13 Missing and 1 partial ⚠️
block/components.go 27.27% 7 Missing and 1 partial ⚠️
... and 13 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2954      +/-   ##
==========================================
- Coverage   58.74%   55.75%   -3.00%     
==========================================
  Files          93      102       +9     
  Lines        8863     9814     +951     
==========================================
+ Hits         5207     5472     +265     
- Misses       3067     3714     +647     
- Partials      589      628      +39     
Flag Coverage Δ
combined 55.75% <40.12%> (-3.00%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

alpe added 6 commits January 7, 2026 16:11
* main:
  feat(tracing): add tracing to EngineClient (#2959)
  chore: inject W3C headers into engine client and eth client (#2958)
  feat: adding tracing for Executor and added initial configuration (#2957)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants