Bandwidth amplification during backend outages (resend queue + ack loss)

## Summary

During URnetwork backend outages, significant volumes of traffic leak through the data plane even though the control plane (API) is unreachable. This was identified during code review on #180.

## Suspected mechanism

The resend queue in `transfer.go` and the multi-client race in `ip_remote_multi_client.go` interact to amplify traffic when the platform is degraded:

**1. Resend queue loops on ack loss** (`transfer.go` ~line 1569–1605)

When the platform is degraded, writes to providers can still succeed (data plane connections may remain up) but acks are never returned. The resend queue re-queues items unconditionally after each write — whether the write succeeded or failed — using the last known scaled RTT as the retry interval:

```go
err := c()
if err != nil {
    glog.V(1).Infof("[s]resend drop = %s", err)
}
// re-queued regardless of whether write succeeded or failed
item.sendCount += 1
item.resendTime = sendTime.Add(self.rttWindow.ScaledRtt())
self.resendQueue.Add(item)
```

A packet in-flight at the time of the outage can be retransmitted many times before the send sequence closes at `CreateContractTimeout` (30s default). At a pre-outage RTT of 50ms, a single packet could be sent ~600 times in that window.

**2. Multi-client race duplicates packets** (`ip_remote_multi_client.go` ~line 1245–1264)

When no established client exists for a path, `MultiRaceClientCount` clients race in parallel. Each receives a `MessagePoolShareReadOnly` copy of the packet and sends it independently:

```go
for _, client := range raceOrderedClients {
    go send(client) // each client sends its own copy
}
```

During an outage, the window expansion loop (`ip_remote_multi_client.go` ~line 2106) continues creating new clients every 15s (`WindowResizeTimeout`). More clients in the window means more copies of each packet sent per resend cycle.

**3. No global circuit breaker**

There is no mechanism that detects a backend outage and halts packet acceptance or resend attempts across all sequences. Each send sequence independently retries for up to 30 seconds before closing.

## Expected vs observed behavior

- **Expected**: when the API is unreachable, traffic stops or degrades gracefully to near-zero
- **Observed**: large volumes of traffic continue to flow (and amplify) for up to 30s per sequence per packet

## Suggested investigation areas

- Adding a per-packet max resend count to the resend queue
- Coordinating outage detection across sequences (a shared backoff flag similar to `CreateContractOobErrorBackoff` added in #180, but for the data plane)
- Reviewing whether `MultiRaceClientCount` should be reduced or bypassed when the platform is known to be degraded

## Related

- PR #180 (log spam reduction, where this was identified)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bandwidth amplification during backend outages (resend queue + ack loss) #181

Summary

Suspected mechanism

Expected vs observed behavior

Suggested investigation areas

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Bandwidth amplification during backend outages (resend queue + ack loss) #181

Description

Summary

Suspected mechanism

Expected vs observed behavior

Suggested investigation areas

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions