Skip to content

device: reduce reconnect delay after server restart from 120 s to ~5 s#2

Open
full-bars wants to merge 1 commit into
urnetwork:masterfrom
full-bars:fix/session-rotation-on-drain-startup
Open

device: reduce reconnect delay after server restart from 120 s to ~5 s#2
full-bars wants to merge 1 commit into
urnetwork:masterfrom
full-bars:fix/session-rotation-on-drain-startup

Conversation

@full-bars
Copy link
Copy Markdown

Problem

When the WireGuard server is restarted or redeployed, clients hold session keypairs that the new instance knows nothing about. WireGuard has no protocol-level "server is going away" message, so clients sit idle until their session expires naturally at RekeyAfterTime (120 s). During that window packets are silently dropped.

Raised in the URnetwork Discord: the userspacewireguard fork was identified as the right place to fix this because the kernel implementation has no equivalent hook.

Changes

1. Server-initiated handshake on startup (device.goupLocked)

SendHandshakeInitiation is now called for every configured peer when the device comes up, in addition to the existing persistent-keepalive send. The new server proactively reaches out to all peers; clients respond within RekeyTimeout (5 s). Reconnect time after a restart drops from up to 120 s to under 5 s for peers with a known endpoint.

2. DrainPeers method + Config.Drain flag (device.go / uapi.go)

DrainPeers() calls ExpireCurrentKeypairs() on every peer — already implemented on Peer, just not wired up at the device level. Exhausting the send nonce makes the client's very next outbound packet trigger a fresh handshake instead of silently failing.

Config.Drain bool exposes this via IpcSet2 so deployment scripts can signal a drain through the existing IPC path before bringing the old process down:

device.IpcSet2(&device.Config{Drain: true})

Behaviour summary

Scenario Before After
Server restarts, peer has known endpoint up to 120 s ~5 s
Server restarts, peer has dynamic endpoint up to 120 s up to PersistentKeepalive + 15 s
Graceful drain before restart up to 120 s immediate re-handshake on next send

Notes

  • bindtest compile errors are pre-existing on master and unrelated to these changes (./device/... builds cleanly).
  • A proper protocol-level CloseNotify message would be the cleanest long-term fix, but these two hooks address the problem without any protocol changes and are backwards-compatible.

Without a way to notify clients of a server restart, clients wait up to
RekeyAfterTime (120 s) before re-handshaking with the new instance.

Two changes to close that gap:

- upLocked: call SendHandshakeInitiation for every peer when the device
  comes up. The new server proactively reaches out to all configured
  peers so they can re-establish in under RekeyTimeout (5 s) rather
  than waiting for natural session expiry.

- DrainPeers / Config.Drain: add an explicit drain signal. Calling
  DrainPeers() (or setting Drain: true in IpcSet2) expires all current
  keypairs so the next send from each client triggers an immediate
  re-handshake instead of silently using a session the restarted server
  no longer knows about.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant