Skip to content

Bandwidth amplification during backend outages (resend queue + ack loss) #181

@full-bars

Description

@full-bars

Summary

During URnetwork backend outages, significant volumes of traffic leak through the data plane even though the control plane (API) is unreachable. This was identified during code review on #180.

Suspected mechanism

The resend queue in transfer.go and the multi-client race in ip_remote_multi_client.go interact to amplify traffic when the platform is degraded:

1. Resend queue loops on ack loss (transfer.go ~line 1569–1605)

When the platform is degraded, writes to providers can still succeed (data plane connections may remain up) but acks are never returned. The resend queue re-queues items unconditionally after each write — whether the write succeeded or failed — using the last known scaled RTT as the retry interval:

err := c()
if err != nil {
    glog.V(1).Infof("[s]resend drop = %s", err)
}
// re-queued regardless of whether write succeeded or failed
item.sendCount += 1
item.resendTime = sendTime.Add(self.rttWindow.ScaledRtt())
self.resendQueue.Add(item)

A packet in-flight at the time of the outage can be retransmitted many times before the send sequence closes at CreateContractTimeout (30s default). At a pre-outage RTT of 50ms, a single packet could be sent ~600 times in that window.

2. Multi-client race duplicates packets (ip_remote_multi_client.go ~line 1245–1264)

When no established client exists for a path, MultiRaceClientCount clients race in parallel. Each receives a MessagePoolShareReadOnly copy of the packet and sends it independently:

for _, client := range raceOrderedClients {
    go send(client) // each client sends its own copy
}

During an outage, the window expansion loop (ip_remote_multi_client.go ~line 2106) continues creating new clients every 15s (WindowResizeTimeout). More clients in the window means more copies of each packet sent per resend cycle.

3. No global circuit breaker

There is no mechanism that detects a backend outage and halts packet acceptance or resend attempts across all sequences. Each send sequence independently retries for up to 30 seconds before closing.

Expected vs observed behavior

  • Expected: when the API is unreachable, traffic stops or degrades gracefully to near-zero
  • Observed: large volumes of traffic continue to flow (and amplify) for up to 30s per sequence per packet

Suggested investigation areas

  • Adding a per-packet max resend count to the resend queue
  • Coordinating outage detection across sequences (a shared backoff flag similar to CreateContractOobErrorBackoff added in contract,transport: reduce log spam during backend outages #180, but for the data plane)
  • Reviewing whether MultiRaceClientCount should be reduced or bypassed when the platform is known to be degraded

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions