contract,transport: reduce log spam during backend outages by full-bars · Pull Request #180 · urnetwork/connect

full-bars · 2026-05-14T21:10:23Z

Problem

During platform outages or degraded connectivity, two log lines flood terminal output at high volume, making it impossible to read logs or assess provider health:

```
[t]auth error = Timeout. # one per transport, every ~5s
[contract]oob err = Timeout.; backing off create contract OOB requests for 1m0s # one per ContractManager instance
```

With 1000 proxies running, each proxy spawns its own `PlatformTransport` and its own `ContractManager`. Under a platform outage all 1000 can log simultaneously and continuously.

Root cause

Both error paths had rate limiters scoped to the wrong level:

`[t]auth error` — rate limiter was added per `PlatformTransport` instance. 1000 transports = 1000 independent limiters, each fires freely.
`[contract]oob err` — `markCreateContractOobError()` used a per-`ContractManager` mutex. 1000 proxies = 1000 `ContractManager` instances (one created per `provideWithProxy` call), each fires independently.

The fix in both cases: package-level `atomic.Int64` shared across all instances, so the entire process emits at most one log line per minute regardless of how many proxies are running.

Changes

`transport.go`

Added `lastAuthErrLogNano atomic.Int64` and `suppressedAuthErrCount atomic.Int64` at package level
`shouldLogAuthErr()` — CAS-based rate limiter, max one emit per minute globally across all `PlatformTransport` instances
Suppressed count is reported on the next emitted line: `[t]auth error = Timeout. (32 suppressed)`
Errors that are suppressed are still logged at `glog.V(1)` for operators running with `-v`

`transfer_contract_manager.go`

Added `lastOobErrLogNano atomic.Int64` and `suppressedOobErrCount atomic.Int64` at package level
`shouldLogOobErr()` — same CAS pattern
`markCreateContractOobError()` changed to void — still sets per-instance `createContractOobErrorBackoffUntil` to gate future contract creation attempts, but logging is now handled separately via the global limiter
Suppressed count reported on next emit: `[contract]oob err = Timeout.; backing off ... (34 suppressed)`

Stress test results

Tested on a live Detroit provider running 1000 proxies using `tc netem` applied via `nsenter` into the container network namespace. Five stages run across two builds.

Stages 1–2 — `netem delay 2000ms loss 50%` (8 minutes each)

Metric	Result
`[t]auth error` lines	9 — exactly 1 per minute
`[contract]oob err` lines	0
Peak suppressed count	3,964

Auth fix and OOB fix both verified. Suppressed counts climbed from 52 → 925 → 3,030 → ~3,960 as all 1000 proxies cascaded into timeout simultaneously.

Stages 3–5 — `netem loss 95% / 99% / 100%` (10 minutes each)

Metric	Result
`[t]auth error` lines	1 per minute — fix held at all loss levels
`[contract]oob err` lines	0–1 per stage, suppressed count reported correctly
Peak suppressed count	3,980
New pattern observed	`[r]drop = Timeout.` floods under 95%+ loss

Organic validation

Under real peak load (no netem), observed: `[t]auth error ... (101 suppressed)` and `[t]auth error ... (3,952 suppressed)` — single log lines absorbing thousands of simultaneous transport failures.

Expected behavior after fix

Log line	Before	After
`[t]auth error`	Every ~5s per transport × N proxies	At most once per minute globally, suppressed count reported
`[contract]oob err`	Every timeout per ContractManager × N proxies	At most once per minute globally, suppressed count reported

During URnetwork outages or degraded backend conditions, two log lines spam the console on every retry cycle, making it difficult to observe other events: [contract]oob err = Timeout [t]auth error <id> = No successful strategy found. This commit addresses both. contract: OOB error backoff (ryanmello07) Adds SignStoredContract/VerifyStoredContract helpers and a NetworkEventTimeChangeHmac cutover (2026-07-01) for a planned HMAC format migration - verifiers accept both legacy and standard forms at all times so the rollout is asymmetric. Adds CreateContractOobErrorBackoff (default 1 min) to ContractManagerSettings. After the first OOB error, CreateContract returns early for the duration of the backoff window. The log line fires at most once per minute instead of once per in-flight request. transport: auth error logged on state transition only runH1 and runH3 both retry the platform connection in a tight loop. Previously every failed attempt emitted an INFO log line. Now the first failure logs at INFO; subsequent retries in the same failure run are demoted to V(1). The flag resets on successful connect so the next disconnection surfaces at INFO again.

full-bars · 2026-05-15T20:43:22Z

Testing update — moving to draft

Ran a live outage simulation against this PR on a Detroit server running 200 DC proxies. After ~13 hours of warmup the provider reached 877–2,143 active connections with sustained throughput of 1.4 GiB/hr. During that window there was also a real platform OOB condition lasting 10+ hours, which gave extended natural validation of the [contract]oob err rate-limiting behavior.

What the test found:

The [contract]oob err fix works correctly under normal conditions. However, two issues surfaced under aggressive tc netem stress testing (2000ms delay + 50% packet loss with 2,143 connections):

[contract]oob err double-fire — two goroutines could both pass the backoff check before either had set it, resulting in duplicate log lines. The check and the set were not atomic.
[t]auth error not fully rate-limited — the existing fix suppresses repeated errors per transport, but with 2,143 transports all timing out simultaneously, each logs once at the same moment — still producing a wall of log lines. There was no global rate limit across transports.

Fixes applied in this update:

transfer_contract_manager.go: markCreateContractOobError() now atomically checks and sets the backoff inside a single mutex lock, returning false if already active. The log only fires when it returns true.
transport.go: PlatformTransport gains a shared shouldLogAuthErr() method that rate-limits auth error logging to once per minute across all transport goroutines on the same transport instance. Per-transport authErrLogged is preserved for within-session suppression.

Next steps: Rebuilding the test binary and restarting the container for a second full warmup + stress test cycle. Moving to draft until that validates.

markCreateContractOobError now atomically checks and sets the backoff inside a single mutex lock, preventing duplicate log lines when two goroutines race through the check simultaneously. PlatformTransport gains shouldLogAuthErr() which rate-limits [t]auth error logging to once per minute across all transport goroutines on the same instance, preventing log floods when many transports time out simultaneously. Per-transport authErrLogged is preserved for within-session suppression.

Both log lines were rate-limited at the wrong scope. Each proxy creates its own ContractManager and PlatformTransport, so per-instance limiters allow 1000 instances to all log simultaneously during an outage. Switches both to package-level atomic.Int64 CAS rate limiters (max one log line per minute globally across all instances). Suppressed count is reported on each emitted line so operators know how many errors were hidden. Per-instance OOB backoff still gates contract creation. Verified under live stress testing with 1000 proxies + tc netem: - [t]auth error: exactly 1 line/min, suppressed counts up to 3,980 - [contract]oob err: 0-1 lines/min, suppressed count reported correctly

full-bars · 2026-05-18T00:22:27Z

"No Successful Strategy Found" errors were not reproducible under netem-based stress testing. Even at 100% packet loss, the provider enters transport auth timeout mode before HTTP-layer strategy selection is reached. This condition may require a different trigger (e.g. routes present but all destination peers unavailable) rather than raw network blackout. It remains a known log spam candidate for a follow-up PR once a reliable reproduction method is established.

The timeout spam addressed here ([t]auth error and [contract]oob err) was fully reproducible and verified fixed under all test conditions.

Connection drop errors flood logs during backend outages without suppression, unlike [t]auth error and [contract]oob err. Implement identical package-level atomic rate-limiting pattern: max one message per minute globally, suppression count reported. Validated in production incident 2026-05-30.

full-bars · 2026-05-30T03:34:01Z

Production Validation

This PR has been validated in production during a recent backend API outage (2026-05-29 approx 3:10 - 3:21PM, ~11 minutes duration). Analysis of ~120 servers across multiple geographic regions (North America: New York, Los Angeles, Houston, Dallas, Denver, Chicago, Kansas City, Atlanta; Europe: London, Berlin, Stockholm) confirmed the existing fixes for [t]auth error and [contract]oob err log spam are effective. The outage also revealed one additional error type that needed the same treatment, which this PR now addresses.

1. [r]drop Error Rate-Limiting

Today's outage revealed that connection drop errors ([r]drop) flood logs the same way [t]auth error and [contract]oob err did before this PR's fixes. This PR adds rate-limiting for [r]drop using the identical pattern (package-level atomic counter, max 1 message/min, suppression count reported), completing comprehensive log spam prevention across all three error types that occur during backend outages.

2. Bandwidth Leak Prevention (Deferred)

Today's incident also validated the anti-bandwidth-leak measures needed to address issues #175 and #181. I implemented and proved these mechanisms in the custom build, but they require foundational changes to the contract and transfer systems that will be more effective as a dedicated PR. This allows proper testing of the degradation detection thresholds and throttling behavior without conflating them with log-spam fixes.

Lessons Learned - Faster Data Collection

This incident also informed improvements to how I collect incident data. I developed an automated baseline collection system that gathers version info, system metrics, bandwidth history, and logs from roughly 120 servers in under 60 seconds. This means future outages will have complete fleet-wide context captured immediately, enabling faster root cause analysis and validation of fixes like those in this PR. :)

full-bars mentioned this pull request May 14, 2026

Bandwidth amplification during backend outages (resend queue + ack loss) #181

Open

Ryanmello07 requested a review from xcolwell May 14, 2026 21:39

full-bars marked this pull request as draft May 15, 2026 20:43

full-bars added 2 commits May 15, 2026 13:44

full-bars marked this pull request as ready for review May 18, 2026 00:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

contract,transport: reduce log spam during backend outages#180

contract,transport: reduce log spam during backend outages#180
full-bars wants to merge 4 commits into
urnetwork:mainfrom
full-bars:fix/reduce-outage-log-spam

full-bars commented May 14, 2026 •

edited

Loading

Uh oh!

full-bars commented May 15, 2026

Uh oh!

full-bars commented May 18, 2026 •

edited

Loading

Uh oh!

full-bars commented May 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

full-bars commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Root cause

Changes

Stress test results

Stages 1–2 — `netem delay 2000ms loss 50%` (8 minutes each)

Stages 3–5 — `netem loss 95% / 99% / 100%` (10 minutes each)

Organic validation

Expected behavior after fix

Uh oh!

full-bars commented May 15, 2026

Uh oh!

full-bars commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

full-bars commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Production Validation

1. [r]drop Error Rate-Limiting

2. Bandwidth Leak Prevention (Deferred)

Lessons Learned - Faster Data Collection

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

full-bars commented May 14, 2026 •

edited

Loading

full-bars commented May 18, 2026 •

edited

Loading

full-bars commented May 30, 2026 •

edited

Loading