fix(qbft): ignore benign duplicate gossip failures#514
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #514 +/- ##
=======================================
Coverage ? 84.33%
Complexity ? 1187
=======================================
Files ? 231
Lines ? 8373
Branches ? 683
=======================================
Hits ? 7061
Misses ? 971
Partials ? 341
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
…ep race
This addresses three sources of CI flakiness in :app:integrationTest:
1. **Log bleed-through between test classes.** The root build.gradle enables
junit.jupiter.execution.parallel.mode.classes.default = "concurrent" globally
for unit tests. With maxParallelForks = 1, this causes integration test
classes to run concurrently in the same JVM, intermixing their stdout into
each other's HTML test reports and creating CPU/port/peer contention. The
:app:integrationTest task now overrides the mode to "same_thread" so each
test class fully executes (BesuCluster + MaruApp lifecycle) before the
next one starts. Verified: MaruPeerScoringTest's logs no longer appear in
MaruMultiValidatorTest reports.
2. **Runner size.** The "testing / maru tests" job ran on the shared *-med
runner pool. Heavy integration tests with 4 BesuClusters + 4 MaruApps
per test contend with sibling jobs (chaos-tests on -large) on the same
shared infra. Bump to *-large for the maru tests job to match what
chaos-testing already uses.
3. **MaruPeerScoringTest "node disconnects validator when requests time out"
flake.** The test had a fixed `sleep(timeout - 1.seconds)` followed by an
exact `assertEquals(peerCount, 1)`. On slow CI, peer connection takes
longer than the sleep window and the assertion fires before the connection
is established, observing peerCount=0. Replaced with `await.untilAsserted
{ peerCount == 1 }` followed by a bounded `await` for peerCount==0 after
the request timeout fires.
P2PTest still has Thread.sleep at lines 345 (gossipsub heartbeat) and
754/759 (manual polling loop). The 1100ms gossipsub wait can't be
trivially converted without exposing a mesh-readiness API; the polling
loop sleeps are correct usage. Left as-is with their existing comments.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rks=2 Replaces the same_thread JUnit override with Gradle's per-class JVM fork isolation: - forkEvery = 1: each test class runs in a fresh JVM - maxParallelForks = max(2, CPUs/4): multiple JVMs in parallel This preserves concurrency (each fork has its own stdout/stderr captured by Gradle and attributed to the right test class — no log bleed between sibling classes) while still parallelizing class-level test execution. On the *-large runner (8 CPU, 32 GB) this gives 2 parallel forks @ 4 GB heap each, ~13 GB resident total. Override via -Dmaru.integrationTest.maxParallelForks=N if you need to tune. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
0a6f58a to
dd6b27b
Compare
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit dd6b27b. Configure here.
| class QbftGossiper( | ||
| val p2PValidatorMulticaster: P2PValidatorMulticaster, | ||
| ) : QbftGossiper { | ||
| class QbftGossiper : QbftGossiper { |
There was a problem hiding this comment.
The comments are a bit outdated now. No libp2p involved. Also, maybe we can call it a NoOpGossiper?
There was a problem hiding this comment.
I'll update them after migration to monorepo

LFDT-Lineth/lineth-monorepo#2540
Summary
This PR fixes the QBFT integration-test failure by addressing the duplicate-gossip path directly instead of changing CI runner size or integration-test concurrency.
QbftGossipera strict no-op. Locally created QBFT messages are already sent through Besu'sValidatorMulticaster, and replayed future-buffer messages were already propagated by libp2p when first received.P2PValidatorMulticaster.sendblocking on the broadcast future, but treats libp2pNoPeersForOutboundMessageExceptionandMessageAlreadySeenExceptionas benign outcomes. Other broadcast failures still propagate.MaruPeerScoringTest "node disconnects validator when requests time out"with bounded awaits for peer connection and disconnection.CHANGELOG.mdfor the observable validator robustness improvement.What this does not change
testing / maru testsjob remains ongha-runner-scale-set-ubuntu-24-amd64-med.:app:integrationTestkeeps the existing Gradle/JUnit concurrency configuration frommain; this PR no longer changesforkEvery,maxParallelForks, or runner size.Test plan
./gradlew :consensus:test --tests '*QbftGossiperTest' --tests '*P2PValidatorMulticasterTest'./gradlew :app:integrationTest --tests 'maru.app.MaruMultiValidatorTest.block production recovers after 2 nodes offline and 1 returns' --rerun-tasks -Dmaru.integrationTest.maxParallelForks=1./gradlew buildtesting / maru testspassed on pushed branchmainworkflow passed on pushed branch