Conversation
- Add semaphore (3 concurrent) to limit parallel replication messages - Batch replication to groups of minAcks+2 with early exit when minimum reached - Wrap individual node messages in try/catch so one failing peer doesn't kill the whole operation - Add single retry on NACK before giving up on a peer - Increase publish message timeout from 15s to 60s for large knowledge assets Made-with: Cursor
- Add retry logic for transient errors (socket hang up, connection reset, 502/503, fee too low) - Only bump gas price on nonce/fee errors, not on network errors - Add gas estimation retries for transient RPC failures - Add 60s timeout on RPC provider connection to prevent node hanging on unresponsive RPCs - Fix BigNumber NaN bug in gas price calculation during retry - Fix Gnosis EIP-1559 gas params and double gwei-parsing in gas price comparison - Add RPC failover for blockchain event fetching (try all providers before failing) - Log warning when blockchain events are missed due to large block gaps Made-with: Cursor
- Add pending storage cache fallback in get-command so gets work right after publish - Add merkle root index in pending-storage-service for fast operationId lookups - Clean up merkle root index entries on cache removal to prevent memory leak - Add retry logic in publish-finalization-command for reading cached assertion data - Reduce cache retry window from 100s to 25s (5 retries x 5s) for faster failure detection - Add getPublishOperationIdByUal repository method for cache lookups Made-with: Cursor
- Move mutex from individual services to base OperationService with per-operation granularity - Previously all operations shared one global mutex per service, which caused unnecessary blocking between unrelated operations. Now each operationId gets its own mutex. - Clean up mutex when operation completes or fails to prevent memory buildup - Fix updateMinAcksReached ordering in publish-service — was being called after markOperationAsCompleted which could cause a race where the result API returns before the flag is set - Fix result controller to return signatures when status is COMPLETED even if minAcksReached flag wasn't set yet (handles the race condition from the client side too) - Add try/catch around signature loading in result controller to not crash on missing files Made-with: Cursor
add codex review
bwrap sandbox fails on GitHub Actions runners with "Failed RTM_NEWADDR: Operation not permitted". GitHub Actions already provides process isolation so the sandbox is unnecessary. Made-with: Cursor
- Don't delete mutex on operation completion/failure; late responses could create a new mutex and break serialization for the same operationId - Track terminal operations with timestamps so late responses short-circuit inside the same mutex instead of processing stale data - Add periodic sweeper (5min TTL) to clean up old terminal mutexes and prevent memory buildup Made-with: Cursor
Initialize responses as empty array instead of 0 so the terminal short-circuit path returns an empty result instead of crashing on for..of iteration. Made-with: Cursor
Wrap sendMessageResponse in handleError with try/catch so a closed stream does not bring down the entire node. Made-with: Cursor
Always initialize the operationId entry in the status map so callers don't crash when responses is empty (terminal short-circuit). Made-with: Cursor
- Skip cache fallback for single-KA requests so a KA-scoped get never receives the whole KC payload - Filter cached assertion by contentType so public-only requests never receive private triples - Run validateResponse() on cached data before completing, same validation gate as the normal local and network paths Made-with: Cursor
…tructor The cacheDataset call at line 137 uses this.pendingStorageService but it was never assigned in the constructor, causing a runtime crash. Made-with: Cursor
Fix local get after publish and cache reliability
Fix operation concurrency and response handling
Fix blockchain transaction error handling
…lication Made-with: Cursor # Conflicts: # src/commands/protocols/publish/sender/publish-replication-command.js
Made-with: Cursor
Fix publish replication reliability
fix: remove sandbox from codex review workflow
There was a problem hiding this comment.
This PR introduces meaningful resilience changes (retry/failover, pending-cache fallback, centralized response mutexing) and a new Codex review workflow, but it also introduces correctness regressions in operation finalization and GET response shape. OperationService now marks operations terminal before persistence succeeds, which can drop later responses and leave operations stuck after transient failures. GET cache fallback can complete requests without metadata even when includeMetadata is requested. Maintainability is mixed: fault-tolerance intent improved, but the new paths need safer state transitions and regression coverage.
| endStatuses, | ||
| options = {}, | ||
| ) { | ||
| this._markOperationTerminal(operationId); |
There was a problem hiding this comment.
🔴 Bug: _markOperationTerminal(operationId) is set before status/cache writes complete, so if finalization throws, later responses are skipped by getResponsesStatuses and the operation can remain stuck. Mark terminal only after completion/failed state is durably persisted.
|
|
||
| if (!cachePassed) return null; | ||
|
|
||
| const cachedResponseData = { assertion: filteredAssertion }; |
There was a problem hiding this comment.
🔴 Bug: Cache fallback always builds cachedResponseData with only assertion, so GET requests with includeMetadata=true can complete without metadata on this path. Pass includeMetadata into _tryCacheFallback and either populate metadata or bypass fallback when metadata is required.
| gasPrice && | ||
| ethers.utils.parseUnits(gasPrice.toString(), 'gwei').gt(this.defaultGasPrice) | ||
| ) { | ||
| if (gasPrice && gasPrice.gt && gasPrice.gt(this.defaultGasPrice)) { |
There was a problem hiding this comment.
🟡 Issue: In the response.data.result path, gasPrice is a Number, so gasPrice.gt is undefined and the oracle value is always ignored in favor of defaults. Normalize both oracle formats to ethers.BigNumber (e.g., ethers.BigNumber.from(response.data.result)) before this comparison.
| return Command.empty(); | ||
| } | ||
|
|
||
| async _tryCacheFallback( |
There was a problem hiding this comment.
🟡 Issue: This new pending-cache fallback materially changes GET behavior, but the PR adds no regression tests for key branches (includeMetadata, private-only content, permissioned paranet checks). Add focused tests to lock the response contract and avoid silent drift.
Description
Please include a summary of the change and which issue is fixed. Please also include relevant motivation and context. List any dependencies that are required for this change.
Fixes # (issue)
Type of change
Please delete options that are not relevant.
How Has This Been Tested?
Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration
Checklist: