Skip to content

[BUG] Massive Bandwidth Leak and Resource Exhaustion during Server-Side Outage #175

@full-bars

Description

@full-bars

Bug Report: Severe Bandwidth & Resource Leak During Server-Side Outages

Summary

The URnetwork provider software (both Docker and standalone binary) exhibits a severe "bandwidth leak" bug during server-side outages. When the central infrastructure is unreachable or authentication fails, the provider enters an uncontrolled high-frequency retry loop.

This loop consumes massive amounts of raw bandwidth and CPU while producing zero credited transfer or productive work.


Observed Behavior

  • Productive Traffic vs. Burned Bandwidth: Between 00:00 AM and 05:00 AM, the official transfer stats show 0 bytes of change, confirming the network was non-functional. However, vnstat recorded approximately 125 GiB of raw network traffic during that same window.
  • High CPU Usage: CPU usage remains pegged at 100–150%, with host load averages exceeding 14.0.
  • Log Spam: Flooding with [t]auth error messages (No successful strategy found / Timeout) at a high frequency.
  • Critical Resource Leaks: Resident memory (RES) grew from 1.1 GiB to 1.3 GiB in 20 minutes. The process accumulated nearly 5,000 file descriptors, indicating that failed authentication attempts are not being properly cleaned up.

Environment

Field Value
OS Ubuntu 24.04.4 LTS (aarch64)
Binary Version 2026.3.23+895075980
Docker Image bringyour/community-provider:g4-latest

Supporting Evidence

1. Proxy Host Request Stats (Last 24 Hours)

The provider-side stats for the last 24 hours show a staggering volume of requests. The software hit the central infrastructure nearly 30 million times during this period.

Host Requests Usage Avg. bytes per request
connect.bringyour.com 25,810,722 346.24 GB 14,403.88 B
api.bringyour.com 3,366,462 45.55 GB 14,527.95 B
Total 29,177,184 391.79 GB

2. Transfer Stats History

Showing 0 change in unpaid data since ~12:17 AM, confirming zero productive work during the outage window.

Timestamp Unpaid (GB) Change (GB)
04/23/2026 05:00:49 AM 560.499 0.000
04/23/2026 04:00:08 AM 560.499 0.000
04/23/2026 03:00:12 AM 560.499 0.000
04/23/2026 02:00:16 AM 560.499 0.000
04/23/2026 01:00:11 AM 560.499 0.000
04/23/2026 12:17:01 AM 560.499 0.000
04/23/2026 12:00:50 AM 560.499 1.857

3. vnstat Hourly Data

High GiB consumption during the same period as the outage:

enp0s6  /  hourly (2026-04-23)
        hour        rx      |     tx      |    total    |   avg. rate
    ------------------------+-------------+-------------+---------------
        00:00     10.43 GiB |   15.51 GiB |   25.94 GiB |   61.90 Mbit/s
        01:00      9.19 GiB |   15.20 GiB |   24.39 GiB |   58.19 Mbit/s
        02:00      9.42 GiB |   16.53 GiB |   25.95 GiB |   61.92 Mbit/s
        03:00      6.35 GiB |   10.75 GiB |   17.09 GiB |   40.78 Mbit/s
        04:00      6.22 GiB |   11.05 GiB |   17.26 GiB |   41.19 Mbit/s
        05:00      6.48 GiB |    7.99 GiB |   14.46 GiB |  138.05 Mbit/s
    ------------------------+-------------+-------------+---------------
    Total burned during outage: ~125 GiB in 5 hours on this single server.

4. Process Status (top)

PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
1612252 ubuntu  20   0 2836788   1.1g  17552 S 116.7   4.9   20,22 urnetwork

5. Log Snippet

I0423 12:12:37.038664 40 transport.go:459] [t]auth error = No successful strategy found.
I0423 05:11:28.987004 1612252 transport.go:485] [t]auth error = Timeout.

6. Socket State & Resource Leak Snapshot

Analysis of the ~4,600–5,000 connections shows the process is stuck in a massive, high-concurrency retry loop:

Metric Value Description
SYN-SENT ~1,600 Aggressively opening new connections simultaneously without rate-limiting
ESTABLISHED ~3,000 Technically open but non-functional/idle connections that are never reaped
FD Count ~4,950 Open file descriptors accumulated from uncleaned failed auth attempts
Memory Growth +200 MB in 20 min RES grew from 1.1 GiB to 1.3 GiB with zero productive traffic

Reproduction Steps

  1. Run a URnetwork provider instance.
  2. Block access to URnetwork central authentication servers, or wait for a server-side outage.
  3. Observe vnstat, resident memory, FD counts, and CPU usage.

Suggested Fix

  • Exponential Backoff: Implement a mandatory increasing delay between auth retries.
  • Circuit Breaker: Suspend network activity if No successful strategy found persists beyond a threshold.
  • Log Throttling: Rate-limit duplicate error messages to prevent log flooding.
  • Connection Management: Ensure all sockets are explicitly closed on authentication failure, and implement a reaper for idle/zombie ESTABLISHED connections.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions