Skip to content

Port bandwidth leak prevention: backend degradation detection & gating#182

Open
full-bars wants to merge 1 commit into
urnetwork:mainfrom
full-bars:fix/bandwidth-leak-prevention
Open

Port bandwidth leak prevention: backend degradation detection & gating#182
full-bars wants to merge 1 commit into
urnetwork:mainfrom
full-bars:fix/bandwidth-leak-prevention

Conversation

@full-bars
Copy link
Copy Markdown

@full-bars full-bars commented May 30, 2026

Bandwidth Leak Prevention: Backend Degradation Detection & Gating

I've ported the proven backend degradation detection and contract gating mechanisms
from my v3.23-fix implementation
to upstream. This addresses the data leak during control API outages (issues #175, #181).

The Problem

When the control API is unreachable, providers continue attempting contract creation
and queueing transfers against a dead API. Without throttling, this becomes a massive
retry storm: thousands of route attempts per second, each spawning goroutines and
queuing transfers that will never complete because clients can't authorize.

The provider essentially screams at an API that isn't even listening, blasting the
network with data that has no chance of reaching a client. On metered bandwidth
connections (common for proxy providers), a sustained outage can consume an entire
month's bandwidth allocation in just a few hours. For proxy users relying on that
bandwidth, this translates to service loss beyond the API outage itself — the pipe
is wide open but the data is being wasted.

The core issue: there's no signal to the provider that the API is down. So it keeps
trying, keeps transferring, keeps wasting bandwidth, until either the outage ends or
the budget is exhausted. This fix adds that signal via degradation detection, then
uses it to stop the bleeding.

The Solution

Backend Degradation Detection:

  • Track consecutive backend failures (auth timeouts + OOB errors) with atomic counters
  • Require threshold (3+ consecutive) + recency check (within 2 minutes) to avoid false
    positives from transient timeouts
  • Distinguish sustained outages from normal churn on a busy provider

Gating & Throttling When Degraded:

  • Skip contract creation goroutines (don't queue work against dead API)
  • Increase contract retry interval from 5s to 30s (reduce API call frequency)
  • Apply exponential backoff to resend timeouts (spread 16 retries across much longer
    intervals instead of sending them in quick succession)

Production Validation

I validated this strategy during a recent control API outage on my fleet (2026-05-30, 3:10-3:21 PM PDT):

  • Scope: ~120 servers across North America + Europe
  • Duration: 11 minutes of outage window
  • Result: Bandwidth consumption dropped to ~12% of baseline during outage, confirming
    that the mechanisms effectively reduce (not eliminate) the leak
  • Post-recovery: Traffic briefly spikes as in-flight data is retried, but this is
    intended cleanup behavior (clients and contracts are long gone, so it amounts to
    clearing queues). The strategy remains effective — the leak is contained during the
    actual outage window.

Important caveat: An 11-minute window provides limited statistical power. The data
shows the mechanism works as designed, but edge cases or longer outages might reveal
new issues. The fix is based on production code that's been running in my
v3.23-fix implementation, so
confidence is high.

Implementation Details

  • Atomics ensure safe concurrent updates across multiple goroutines
  • Threshold + recency window (3+ failures within 2 min) prevents false triggers on
    transient timeouts
  • Exponential backoff caps at 64× RTT to avoid excessively long waits
  • Contract creation gating at the goroutine launch site (not in API layer) to avoid
    feedback loops

Related PRs

I've ported the backend degradation detection and contract gating mechanisms
from my [v3.23-fix implementation](https://github.com/full-bars/urnetwork-3.23-fix)
to upstream. This implementation distinguishes sustained API outages from
transient timeouts, then gates contract creation and applies exponential
backoff to resend queues when degraded.

This prevents providers from wasting bandwidth on contract attempts and
retransmissions during control API failures, effectively reducing the bandwidth
leak during outage windows. The mechanism was validated in production during a
recent incident across ~120 servers with 11 minutes of outage data.

Changes:
- transport.go: Backend degradation detection atomics, threshold logic, auth error tracking
- transfer_contract_manager.go: OOB error failure tracking and reset on success
- transfer.go: Contract retry backoff, contract creation gating, resend exponential backoff
@full-bars
Copy link
Copy Markdown
Author

@xcolwell — requesting review on this bandwidth leak prevention PR.

This ports the backend degradation detection and contract gating mechanisms from the fork I maintain to upstream, addressing issues #175 and #181. The implementation distinguishes sustained API outages from transient timeouts, then gates contract creation and applies exponential backoff when degraded.

Validated in production during a recent 11-minute control API outage across ~120 servers—bandwidth consumption dropped to ~12% of baseline, confirming the mechanisms effectively reduce the leak.

Related: PR#180 (log spam fixes for auth/contract/drop errors).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant