You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
During URnetwork backend outages, significant volumes of traffic leak through the data plane even though the control plane (API) is unreachable. This was identified during code review on #180.
Suspected mechanism
The resend queue in transfer.go and the multi-client race in ip_remote_multi_client.go interact to amplify traffic when the platform is degraded:
1. Resend queue loops on ack loss (transfer.go ~line 1569–1605)
When the platform is degraded, writes to providers can still succeed (data plane connections may remain up) but acks are never returned. The resend queue re-queues items unconditionally after each write — whether the write succeeded or failed — using the last known scaled RTT as the retry interval:
err:=c()
iferr!=nil {
glog.V(1).Infof("[s]resend drop = %s", err)
}
// re-queued regardless of whether write succeeded or faileditem.sendCount+=1item.resendTime=sendTime.Add(self.rttWindow.ScaledRtt())
self.resendQueue.Add(item)
A packet in-flight at the time of the outage can be retransmitted many times before the send sequence closes at CreateContractTimeout (30s default). At a pre-outage RTT of 50ms, a single packet could be sent ~600 times in that window.
When no established client exists for a path, MultiRaceClientCount clients race in parallel. Each receives a MessagePoolShareReadOnly copy of the packet and sends it independently:
for_, client:=rangeraceOrderedClients {
gosend(client) // each client sends its own copy
}
During an outage, the window expansion loop (ip_remote_multi_client.go ~line 2106) continues creating new clients every 15s (WindowResizeTimeout). More clients in the window means more copies of each packet sent per resend cycle.
3. No global circuit breaker
There is no mechanism that detects a backend outage and halts packet acceptance or resend attempts across all sequences. Each send sequence independently retries for up to 30 seconds before closing.
Expected vs observed behavior
Expected: when the API is unreachable, traffic stops or degrades gracefully to near-zero
Observed: large volumes of traffic continue to flow (and amplify) for up to 30s per sequence per packet
Suggested investigation areas
Adding a per-packet max resend count to the resend queue
Summary
During URnetwork backend outages, significant volumes of traffic leak through the data plane even though the control plane (API) is unreachable. This was identified during code review on #180.
Suspected mechanism
The resend queue in
transfer.goand the multi-client race inip_remote_multi_client.gointeract to amplify traffic when the platform is degraded:1. Resend queue loops on ack loss (
transfer.go~line 1569–1605)When the platform is degraded, writes to providers can still succeed (data plane connections may remain up) but acks are never returned. The resend queue re-queues items unconditionally after each write — whether the write succeeded or failed — using the last known scaled RTT as the retry interval:
A packet in-flight at the time of the outage can be retransmitted many times before the send sequence closes at
CreateContractTimeout(30s default). At a pre-outage RTT of 50ms, a single packet could be sent ~600 times in that window.2. Multi-client race duplicates packets (
ip_remote_multi_client.go~line 1245–1264)When no established client exists for a path,
MultiRaceClientCountclients race in parallel. Each receives aMessagePoolShareReadOnlycopy of the packet and sends it independently:During an outage, the window expansion loop (
ip_remote_multi_client.go~line 2106) continues creating new clients every 15s (WindowResizeTimeout). More clients in the window means more copies of each packet sent per resend cycle.3. No global circuit breaker
There is no mechanism that detects a backend outage and halts packet acceptance or resend attempts across all sequences. Each send sequence independently retries for up to 30 seconds before closing.
Expected vs observed behavior
Suggested investigation areas
CreateContractOobErrorBackoffadded in contract,transport: reduce log spam during backend outages #180, but for the data plane)MultiRaceClientCountshould be reduced or bypassed when the platform is known to be degradedRelated