Skip to content

V8.2.6 Release Testnet#4100

Merged
zsculac merged 33 commits intov6/release/testnetfrom
v6/prerelease/testnet
Mar 26, 2026
Merged

V8.2.6 Release Testnet#4100
zsculac merged 33 commits intov6/release/testnetfrom
v6/prerelease/testnet

Conversation

@zsculac
Copy link
Copy Markdown
Collaborator

@zsculac zsculac commented Mar 26, 2026

Description

Please include a summary of the change and which issue is fixed. Please also include relevant motivation and context. List any dependencies that are required for this change.

Fixes # (issue)

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration

  • Test A
  • Test B

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been merged and published in downstream modules

Bojan131 and others added 30 commits March 18, 2026 11:58
- Add semaphore (3 concurrent) to limit parallel replication messages
- Batch replication to groups of minAcks+2 with early exit when minimum reached
- Wrap individual node messages in try/catch so one failing peer doesn't kill the whole operation
- Add single retry on NACK before giving up on a peer
- Increase publish message timeout from 15s to 60s for large knowledge assets

Made-with: Cursor
- Add retry logic for transient errors (socket hang up, connection reset, 502/503, fee too low)
- Only bump gas price on nonce/fee errors, not on network errors
- Add gas estimation retries for transient RPC failures
- Add 60s timeout on RPC provider connection to prevent node hanging on unresponsive RPCs
- Fix BigNumber NaN bug in gas price calculation during retry
- Fix Gnosis EIP-1559 gas params and double gwei-parsing in gas price comparison
- Add RPC failover for blockchain event fetching (try all providers before failing)
- Log warning when blockchain events are missed due to large block gaps

Made-with: Cursor
- Add pending storage cache fallback in get-command so gets work right after publish
- Add merkle root index in pending-storage-service for fast operationId lookups
- Clean up merkle root index entries on cache removal to prevent memory leak
- Add retry logic in publish-finalization-command for reading cached assertion data
- Reduce cache retry window from 100s to 25s (5 retries x 5s) for faster failure detection
- Add getPublishOperationIdByUal repository method for cache lookups

Made-with: Cursor
- Move mutex from individual services to base OperationService with per-operation granularity
- Previously all operations shared one global mutex per service, which caused unnecessary blocking
  between unrelated operations. Now each operationId gets its own mutex.
- Clean up mutex when operation completes or fails to prevent memory buildup
- Fix updateMinAcksReached ordering in publish-service — was being called after markOperationAsCompleted
  which could cause a race where the result API returns before the flag is set
- Fix result controller to return signatures when status is COMPLETED even if minAcksReached flag
  wasn't set yet (handles the race condition from the client side too)
- Add try/catch around signature loading in result controller to not crash on missing files

Made-with: Cursor
bwrap sandbox fails on GitHub Actions runners with
"Failed RTM_NEWADDR: Operation not permitted". GitHub Actions
already provides process isolation so the sandbox is unnecessary.

Made-with: Cursor
- Don't delete mutex on operation completion/failure; late responses
  could create a new mutex and break serialization for the same operationId
- Track terminal operations with timestamps so late responses short-circuit
  inside the same mutex instead of processing stale data
- Add periodic sweeper (5min TTL) to clean up old terminal mutexes and
  prevent memory buildup

Made-with: Cursor
Initialize responses as empty array instead of 0 so the
terminal short-circuit path returns an empty result instead
of crashing on for..of iteration.

Made-with: Cursor
Wrap sendMessageResponse in handleError with try/catch so a
closed stream does not bring down the entire node.

Made-with: Cursor
Always initialize the operationId entry in the status map so
callers don't crash when responses is empty (terminal short-circuit).

Made-with: Cursor
- Skip cache fallback for single-KA requests so a KA-scoped
  get never receives the whole KC payload
- Filter cached assertion by contentType so public-only requests
  never receive private triples
- Run validateResponse() on cached data before completing, same
  validation gate as the normal local and network paths

Made-with: Cursor
…tructor

The cacheDataset call at line 137 uses this.pendingStorageService
but it was never assigned in the constructor, causing a runtime crash.

Made-with: Cursor
Fix local get after publish and cache reliability
Fix operation concurrency and response handling
…lication

Made-with: Cursor

# Conflicts:
#	src/commands/protocols/publish/sender/publish-replication-command.js
Made-with: Cursor
@zsculac zsculac requested a review from branarakic as a code owner March 26, 2026 15:52
Copy link
Copy Markdown

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR introduces meaningful resilience changes (retry/failover, pending-cache fallback, centralized response mutexing) and a new Codex review workflow, but it also introduces correctness regressions in operation finalization and GET response shape. OperationService now marks operations terminal before persistence succeeds, which can drop later responses and leave operations stuck after transient failures. GET cache fallback can complete requests without metadata even when includeMetadata is requested. Maintainability is mixed: fault-tolerance intent improved, but the new paths need safer state transitions and regression coverage.

endStatuses,
options = {},
) {
this._markOperationTerminal(operationId);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: _markOperationTerminal(operationId) is set before status/cache writes complete, so if finalization throws, later responses are skipped by getResponsesStatuses and the operation can remain stuck. Mark terminal only after completion/failed state is durably persisted.


if (!cachePassed) return null;

const cachedResponseData = { assertion: filteredAssertion };
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: Cache fallback always builds cachedResponseData with only assertion, so GET requests with includeMetadata=true can complete without metadata on this path. Pass includeMetadata into _tryCacheFallback and either populate metadata or bypass fallback when metadata is required.

gasPrice &&
ethers.utils.parseUnits(gasPrice.toString(), 'gwei').gt(this.defaultGasPrice)
) {
if (gasPrice && gasPrice.gt && gasPrice.gt(this.defaultGasPrice)) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Issue: In the response.data.result path, gasPrice is a Number, so gasPrice.gt is undefined and the oracle value is always ignored in favor of defaults. Normalize both oracle formats to ethers.BigNumber (e.g., ethers.BigNumber.from(response.data.result)) before this comparison.

return Command.empty();
}

async _tryCacheFallback(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Issue: This new pending-cache fallback materially changes GET behavior, but the PR adds no regression tests for key branches (includeMetadata, private-only content, permissioned paranet checks). Add focused tests to lock the response contract and avoid silent drift.

@zsculac zsculac merged commit d41b1df into v6/release/testnet Mar 26, 2026
8 of 10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants