Skip to content

contract,transport: reduce log spam during backend outages#180

Open
full-bars wants to merge 4 commits into
urnetwork:mainfrom
full-bars:fix/reduce-outage-log-spam
Open

contract,transport: reduce log spam during backend outages#180
full-bars wants to merge 4 commits into
urnetwork:mainfrom
full-bars:fix/reduce-outage-log-spam

Conversation

@full-bars
Copy link
Copy Markdown

@full-bars full-bars commented May 14, 2026

Problem

During platform outages or degraded connectivity, two log lines flood terminal output at high volume, making it impossible to read logs or assess provider health:

```
[t]auth error = Timeout. # one per transport, every ~5s
[contract]oob err = Timeout.; backing off create contract OOB requests for 1m0s # one per ContractManager instance
```

With 1000 proxies running, each proxy spawns its own `PlatformTransport` and its own `ContractManager`. Under a platform outage all 1000 can log simultaneously and continuously.

Root cause

Both error paths had rate limiters scoped to the wrong level:

  • `[t]auth error` — rate limiter was added per `PlatformTransport` instance. 1000 transports = 1000 independent limiters, each fires freely.
  • `[contract]oob err` — `markCreateContractOobError()` used a per-`ContractManager` mutex. 1000 proxies = 1000 `ContractManager` instances (one created per `provideWithProxy` call), each fires independently.

The fix in both cases: package-level `atomic.Int64` shared across all instances, so the entire process emits at most one log line per minute regardless of how many proxies are running.

Changes

`transport.go`

  • Added `lastAuthErrLogNano atomic.Int64` and `suppressedAuthErrCount atomic.Int64` at package level
  • `shouldLogAuthErr()` — CAS-based rate limiter, max one emit per minute globally across all `PlatformTransport` instances
  • Suppressed count is reported on the next emitted line: `[t]auth error = Timeout. (32 suppressed)`
  • Errors that are suppressed are still logged at `glog.V(1)` for operators running with `-v`

`transfer_contract_manager.go`

  • Added `lastOobErrLogNano atomic.Int64` and `suppressedOobErrCount atomic.Int64` at package level
  • `shouldLogOobErr()` — same CAS pattern
  • `markCreateContractOobError()` changed to void — still sets per-instance `createContractOobErrorBackoffUntil` to gate future contract creation attempts, but logging is now handled separately via the global limiter
  • Suppressed count reported on next emit: `[contract]oob err = Timeout.; backing off ... (34 suppressed)`

Stress test results

Tested on a live Detroit provider running 1000 proxies using `tc netem` applied via `nsenter` into the container network namespace. Five stages run across two builds.

Stages 1–2 — `netem delay 2000ms loss 50%` (8 minutes each)

Metric Result
`[t]auth error` lines 9 — exactly 1 per minute
`[contract]oob err` lines 0
Peak suppressed count 3,964

Auth fix and OOB fix both verified. Suppressed counts climbed from 52 → 925 → 3,030 → ~3,960 as all 1000 proxies cascaded into timeout simultaneously.

Stages 3–5 — `netem loss 95% / 99% / 100%` (10 minutes each)

Metric Result
`[t]auth error` lines 1 per minute — fix held at all loss levels
`[contract]oob err` lines 0–1 per stage, suppressed count reported correctly
Peak suppressed count 3,980
New pattern observed `[r]drop = Timeout.` floods under 95%+ loss

Organic validation

Under real peak load (no netem), observed: `[t]auth error ... (101 suppressed)` and `[t]auth error ... (3,952 suppressed)` — single log lines absorbing thousands of simultaneous transport failures.

Expected behavior after fix

Log line Before After
`[t]auth error` Every ~5s per transport × N proxies At most once per minute globally, suppressed count reported
`[contract]oob err` Every timeout per ContractManager × N proxies At most once per minute globally, suppressed count reported

During URnetwork outages or degraded backend conditions, two log lines
spam the console on every retry cycle, making it difficult to observe
other events:

  [contract]oob err = Timeout
  [t]auth error <id> = No successful strategy found.

This commit addresses both.

contract: OOB error backoff (ryanmello07)
Adds SignStoredContract/VerifyStoredContract helpers and a
NetworkEventTimeChangeHmac cutover (2026-07-01) for a planned HMAC
format migration - verifiers accept both legacy and standard forms at
all times so the rollout is asymmetric.

Adds CreateContractOobErrorBackoff (default 1 min) to
ContractManagerSettings. After the first OOB error, CreateContract
returns early for the duration of the backoff window. The log line fires
at most once per minute instead of once per in-flight request.

transport: auth error logged on state transition only
runH1 and runH3 both retry the platform connection in a tight loop.
Previously every failed attempt emitted an INFO log line. Now the first
failure logs at INFO; subsequent retries in the same failure run are
demoted to V(1). The flag resets on successful connect so the next
disconnection surfaces at INFO again.
@full-bars
Copy link
Copy Markdown
Author

Testing update — moving to draft

Ran a live outage simulation against this PR on a Detroit server running 200 DC proxies. After ~13 hours of warmup the provider reached 877–2,143 active connections with sustained throughput of 1.4 GiB/hr. During that window there was also a real platform OOB condition lasting 10+ hours, which gave extended natural validation of the [contract]oob err rate-limiting behavior.

What the test found:

The [contract]oob err fix works correctly under normal conditions. However, two issues surfaced under aggressive tc netem stress testing (2000ms delay + 50% packet loss with 2,143 connections):

  1. [contract]oob err double-fire — two goroutines could both pass the backoff check before either had set it, resulting in duplicate log lines. The check and the set were not atomic.

  2. [t]auth error not fully rate-limited — the existing fix suppresses repeated errors per transport, but with 2,143 transports all timing out simultaneously, each logs once at the same moment — still producing a wall of log lines. There was no global rate limit across transports.

Fixes applied in this update:

  • transfer_contract_manager.go: markCreateContractOobError() now atomically checks and sets the backoff inside a single mutex lock, returning false if already active. The log only fires when it returns true.
  • transport.go: PlatformTransport gains a shared shouldLogAuthErr() method that rate-limits auth error logging to once per minute across all transport goroutines on the same transport instance. Per-transport authErrLogged is preserved for within-session suppression.

Next steps: Rebuilding the test binary and restarting the container for a second full warmup + stress test cycle. Moving to draft until that validates.

@full-bars full-bars marked this pull request as draft May 15, 2026 20:43
full-bars added 2 commits May 15, 2026 13:44
markCreateContractOobError now atomically checks and sets the backoff
inside a single mutex lock, preventing duplicate log lines when two
goroutines race through the check simultaneously.

PlatformTransport gains shouldLogAuthErr() which rate-limits [t]auth
error logging to once per minute across all transport goroutines on the
same instance, preventing log floods when many transports time out
simultaneously. Per-transport authErrLogged is preserved for
within-session suppression.
Both log lines were rate-limited at the wrong scope. Each proxy creates
its own ContractManager and PlatformTransport, so per-instance limiters
allow 1000 instances to all log simultaneously during an outage.

Switches both to package-level atomic.Int64 CAS rate limiters (max one
log line per minute globally across all instances). Suppressed count is
reported on each emitted line so operators know how many errors were
hidden. Per-instance OOB backoff still gates contract creation.

Verified under live stress testing with 1000 proxies + tc netem:
- [t]auth error: exactly 1 line/min, suppressed counts up to 3,980
- [contract]oob err: 0-1 lines/min, suppressed count reported correctly
@full-bars full-bars marked this pull request as ready for review May 18, 2026 00:19
@full-bars
Copy link
Copy Markdown
Author

full-bars commented May 18, 2026

"No Successful Strategy Found" errors were not reproducible under netem-based stress testing. Even at 100% packet loss, the provider enters transport auth timeout mode before HTTP-layer strategy selection is reached. This condition may require a different trigger (e.g. routes present but all destination peers unavailable) rather than raw network blackout. It remains a known log spam candidate for a follow-up PR once a reliable reproduction method is established.

The timeout spam addressed here ([t]auth error and [contract]oob err) was fully reproducible and verified fixed under all test conditions.

Connection drop errors flood logs during backend outages without suppression,
unlike [t]auth error and [contract]oob err. Implement identical package-level
atomic rate-limiting pattern: max one message per minute globally, suppression
count reported. Validated in production incident 2026-05-30.
@full-bars
Copy link
Copy Markdown
Author

full-bars commented May 30, 2026

Production Validation

This PR has been validated in production during a recent backend API outage (2026-05-29 approx 3:10 - 3:21PM, ~11 minutes duration). Analysis of ~120 servers across multiple geographic regions (North America: New York, Los Angeles, Houston, Dallas, Denver, Chicago, Kansas City, Atlanta; Europe: London, Berlin, Stockholm) confirmed the existing fixes for [t]auth error and [contract]oob err log spam are effective. The outage also revealed one additional error type that needed the same treatment, which this PR now addresses.

1. [r]drop Error Rate-Limiting

Today's outage revealed that connection drop errors ([r]drop) flood logs the same way [t]auth error and [contract]oob err did before this PR's fixes. This PR adds rate-limiting for [r]drop using the identical pattern (package-level atomic counter, max 1 message/min, suppression count reported), completing comprehensive log spam prevention across all three error types that occur during backend outages.

2. Bandwidth Leak Prevention (Deferred)

Today's incident also validated the anti-bandwidth-leak measures needed to address issues #175 and #181. I implemented and proved these mechanisms in the custom build, but they require foundational changes to the contract and transfer systems that will be more effective as a dedicated PR. This allows proper testing of the degradation detection thresholds and throttling behavior without conflating them with log-spam fixes.

Lessons Learned - Faster Data Collection

This incident also informed improvements to how I collect incident data. I developed an automated baseline collection system that gathers version info, system metrics, bandwidth history, and logs from roughly 120 servers in under 60 seconds. This means future outages will have complete fleet-wide context captured immediately, enabling faster root cause analysis and validation of fixes like those in this PR. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant