Skip to content
This repository was archived by the owner on May 29, 2026. It is now read-only.

fix(qbft): ignore benign duplicate gossip failures#514

Merged
gauravahuja merged 5 commits into
mainfrom
ci/runner-large-and-test-flakes
May 12, 2026
Merged

fix(qbft): ignore benign duplicate gossip failures#514
gauravahuja merged 5 commits into
mainfrom
ci/runner-large-and-test-flakes

Conversation

@gauravahuja

@gauravahuja gauravahuja commented May 8, 2026

Copy link
Copy Markdown
Contributor

LFDT-Lineth/lineth-monorepo#2540

Summary

This PR fixes the QBFT integration-test failure by addressing the duplicate-gossip path directly instead of changing CI runner size or integration-test concurrency.

  • Makes QbftGossiper a strict no-op. Locally created QBFT messages are already sent through Besu's ValidatorMulticaster, and replayed future-buffer messages were already propagated by libp2p when first received.
  • Keeps P2PValidatorMulticaster.send blocking on the broadcast future, but treats libp2p NoPeersForOutboundMessageException and MessageAlreadySeenException as benign outcomes. Other broadcast failures still propagate.
  • Adds unit coverage for the gossiper and multicaster behavior.
  • Replaces the fixed sleep in MaruPeerScoringTest "node disconnects validator when requests time out" with bounded awaits for peer connection and disconnection.
  • Updates CHANGELOG.md for the observable validator robustness improvement.

What this does not change

  • The testing / maru tests job remains on gha-runner-scale-set-ubuntu-24-amd64-med.
  • :app:integrationTest keeps the existing Gradle/JUnit concurrency configuration from main; this PR no longer changes forkEvery, maxParallelForks, or runner size.

Test plan

  • ./gradlew :consensus:test --tests '*QbftGossiperTest' --tests '*P2PValidatorMulticasterTest'
  • ./gradlew :app:integrationTest --tests 'maru.app.MaruMultiValidatorTest.block production recovers after 2 nodes offline and 1 returns' --rerun-tasks -Dmaru.integrationTest.maxParallelForks=1
  • ./gradlew build
  • CI: testing / maru tests passed on pushed branch
  • CI: main workflow passed on pushed branch
  • CI: MetaMask Security Code Scanner passed on pushed branch

@codecov-commenter

codecov-commenter commented May 8, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (main@e8533da). Learn more about missing BASE report.
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #514   +/-   ##
=======================================
  Coverage        ?   84.33%           
  Complexity      ?     1187           
=======================================
  Files           ?      231           
  Lines           ?     8373           
  Branches        ?      683           
=======================================
  Hits            ?     7061           
  Misses          ?      971           
  Partials        ?      341           
Flag Coverage Δ
kotlin 84.33% <100.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@gauravahuja gauravahuja requested a review from Filter94 May 8, 2026 20:17
gauravahuja and others added 3 commits May 11, 2026 16:47
…ep race

This addresses three sources of CI flakiness in :app:integrationTest:

1. **Log bleed-through between test classes.** The root build.gradle enables
   junit.jupiter.execution.parallel.mode.classes.default = "concurrent" globally
   for unit tests. With maxParallelForks = 1, this causes integration test
   classes to run concurrently in the same JVM, intermixing their stdout into
   each other's HTML test reports and creating CPU/port/peer contention. The
   :app:integrationTest task now overrides the mode to "same_thread" so each
   test class fully executes (BesuCluster + MaruApp lifecycle) before the
   next one starts. Verified: MaruPeerScoringTest's logs no longer appear in
   MaruMultiValidatorTest reports.

2. **Runner size.** The "testing / maru tests" job ran on the shared *-med
   runner pool. Heavy integration tests with 4 BesuClusters + 4 MaruApps
   per test contend with sibling jobs (chaos-tests on -large) on the same
   shared infra. Bump to *-large for the maru tests job to match what
   chaos-testing already uses.

3. **MaruPeerScoringTest "node disconnects validator when requests time out"
   flake.** The test had a fixed `sleep(timeout - 1.seconds)` followed by an
   exact `assertEquals(peerCount, 1)`. On slow CI, peer connection takes
   longer than the sleep window and the assertion fires before the connection
   is established, observing peerCount=0. Replaced with `await.untilAsserted
   { peerCount == 1 }` followed by a bounded `await` for peerCount==0 after
   the request timeout fires.

P2PTest still has Thread.sleep at lines 345 (gossipsub heartbeat) and
754/759 (manual polling loop). The 1100ms gossipsub wait can't be
trivially converted without exposing a mesh-readiness API; the polling
loop sleeps are correct usage. Left as-is with their existing comments.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rks=2

Replaces the same_thread JUnit override with Gradle's per-class JVM fork
isolation:
- forkEvery = 1: each test class runs in a fresh JVM
- maxParallelForks = max(2, CPUs/4): multiple JVMs in parallel

This preserves concurrency (each fork has its own stdout/stderr captured by
Gradle and attributed to the right test class — no log bleed between sibling
classes) while still parallelizing class-level test execution. On the *-large
runner (8 CPU, 32 GB) this gives 2 parallel forks @ 4 GB heap each, ~13 GB
resident total.

Override via -Dmaru.integrationTest.maxParallelForks=N if you need to tune.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@gauravahuja gauravahuja force-pushed the ci/runner-large-and-test-flakes branch from 0a6f58a to dd6b27b Compare May 11, 2026 21:11

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit dd6b27b. Configure here.

Comment thread app/build.gradle Outdated
@gauravahuja gauravahuja changed the title ci: parallel-isolated integration tests, bump runner to large, fix sleep race fix(qbft): ignore benign duplicate gossip failures May 11, 2026
class QbftGossiper(
val p2PValidatorMulticaster: P2PValidatorMulticaster,
) : QbftGossiper {
class QbftGossiper : QbftGossiper {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comments are a bit outdated now. No libp2p involved. Also, maybe we can call it a NoOpGossiper?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll update them after migration to monorepo

@gauravahuja gauravahuja merged commit af70311 into main May 12, 2026
45 checks passed
@gauravahuja gauravahuja deleted the ci/runner-large-and-test-flakes branch May 12, 2026 16:37
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants