sandbox: prevent duplicate bridge IP assignment (MIR-1238)#856
Conversation
The sandbox IPAM handed the same rt0 address to two live sandboxes on a runner, causing ARP conflicts, martian-source drops, and connection flapping for the affected apps. netdb cannot itself hold two reserved rows for one IP (ip is a PRIMARY KEY); the bug is a divergence between netdb's lease bookkeeping and the addresses actually live on the bridge. A sandbox's lease was released in netdb while its container kept running, and the now-"free" address was later handed to a second sandbox. The container watchdog is the source of the premature release: it freed an IP whenever a labeled container was missing from the node-scoped sandbox list (which is built from an index that can lag), released it before the container was actually removed, and released it even when removal failed. Defense in depth: - watchdog: release an IP only after the container is confirmed removed (leave it reserved on failure); before treating a container as orphaned, re-fetch its sandbox entity directly by ID (a linearizable key lookup, not an index read) so a sandbox the index transiently omitted is never killed or has its IP released. - AllocateOnBridge: verify a freshly reserved address is not already live on the bridge (from running-container labels); on conflict keep it quarantined and retry, so a divergence can never surface as a duplicate handed to a new sandbox. - IPReconciler: periodically (and on boot) re-reserve any address live on the bridge but free in netdb, and conservatively reap reservations with no live owner after several consecutive misses. Adds unit tests for netdb concurrency/ReservedAddrs, the conflict-free reserve loop, and the reconciler, plus an iso regression test that a running sandbox missing from the node index is kept.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (5)
✅ Files skipped from review due to trivial changes (1)
🚧 Files skipped from review as they are similar to previous changes (3)
📝 WalkthroughWalkthroughThis PR adds IP reconciliation and conflict-aware allocation to keep netdb IP reservations aligned with actual live bridge usage. The changes introduce a new Comment |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
controllers/sandbox/watchdog.go (1)
152-152:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winFix typo in comment.
"sandboes" should be "sandboxes".
📝 Proposed fix
- // Consider all non DEAD sandboes as valid. + // Consider all non DEAD sandboxes as valid.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@controllers/sandbox/watchdog.go` at line 152, Fix the typo in the comment that currently reads "Consider all non DEAD sandboes as valid."—change "sandboes" to "sandboxes" so the comment reads "Consider all non DEAD sandboxes as valid." Locate this comment string in controllers/sandbox/watchdog.go and update the text accordingly.
🧹 Nitpick comments (1)
pkg/netdb/netdb.go (1)
436-440: ⚡ Quick winLog unparseable IP addresses during scan.
Silently skipping rows that fail
netip.ParseAddrcould mask database corruption or stale/invalid data without alerting operators. Add a log statement beforecontinueso administrators can investigate and repair the corrupted entries.📊 Suggested logging for parse failures
addr, err := netip.ParseAddr(ipStr) if err != nil { - // Skip rows that can't be parsed rather than failing the whole scan. + s.netdb.Log.Warn("skipping unparseable IP in reserved addresses scan", + "ip", ipStr, "subnet", s.net.String(), "error", err) continue }Note: This assumes a logger is available on
s.netdb. If not, consider passing a logger toReservedAddrs()or using a package-level logger.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/netdb/netdb.go` around lines 436 - 440, In ReservedAddrs (or the function containing the netip.ParseAddr call), add a log call that reports the unparseable ipStr before the continue so parse failures are visible; reference the netip.ParseAddr invocation and use the service/struct logger (s.netdb or another logger passed into ReservedAddrs) to log a message including the offending ipStr and the parse error to aid investigation of corrupted or stale DB entries.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@controllers/sandbox/watchdog.go`:
- Line 152: Fix the typo in the comment that currently reads "Consider all non
DEAD sandboes as valid."—change "sandboes" to "sandboxes" so the comment reads
"Consider all non DEAD sandboxes as valid." Locate this comment string in
controllers/sandbox/watchdog.go and update the text accordingly.
---
Nitpick comments:
In `@pkg/netdb/netdb.go`:
- Around line 436-440: In ReservedAddrs (or the function containing the
netip.ParseAddr call), add a log call that reports the unparseable ipStr before
the continue so parse failures are visible; reference the netip.ParseAddr
invocation and use the service/struct logger (s.netdb or another logger passed
into ReservedAddrs) to log a message including the offending ipStr and the parse
error to aid investigation of corrupted or stale DB entries.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: acaa6416-6826-4322-ae31-38c0c6d92650
📒 Files selected for processing (10)
controllers/sandbox/ipreconciler.gocontrollers/sandbox/ipreconciler_test.gocontrollers/sandbox/sandbox.gocontrollers/sandbox/sandbox_frozen_test.gocontrollers/sandbox/watchdog.gocontrollers/sandbox/watchdogtest/container_test.gonetwork/config.gonetwork/config_test.gopkg/netdb/netdb.gopkg/netdb/netdb_test.go
There was a problem hiding this comment.
Review: sandbox — prevent duplicate bridge IP assignment (MIR-1238)
This PR tackles a real, concrete bug: netdb lease bookkeeping can diverge from what's live on the bridge, and the divergence causes a second sandbox to receive an IP already in use. The fix is multi-layered — a conflict-check at allocation time, a repair/reap reconciler, and a hardening of the watchdog's orphan-detection against stale index reads. I read through all ten changed files, checked the entity-access plumbing, and exercised a few edge-case paths mentally. Here's what I found.
What the change does
graph LR
A["AllocateOnBridge"] -->|"calls inUse() first"| B["reserveConflictFree<br/>(network/config.go)"]
B -->|"skips live addrs,<br/>quarantines them"| C["netdb.Subnet.Reserve()"]
D["IPReconciler<br/>(ipreconciler.go)"] -->|"every 5 min"| E["liveBridgeIPs<br/>(sandbox.go)"]
E -->|"reads containerd labels"| F["Re-reserves lost addrs<br/>or reaps stale leases"]
G["ContainerWatchdog<br/>(watchdog.go)"] -->|"per-container re-fetch"| H["sandboxStillValid<br/>(EAC.Get — linearizable)"]
H -->|"stale index guard"| I["Only then mark orphaned<br/>and release IP after removal"]
Issues I found
1. CheckInterval is hard-coded in sandbox.go but IPReconciler already has a default (minor redundancy, not a bug)
In sandbox.go line 567, CheckInterval: 5 * time.Minute is set explicitly on the struct. The IPReconciler.Start() already supplies this default when CheckInterval == 0. The explicit value wins over the default, so behaviour is correct, but it's confusing to maintain in two places — a future reader might change the default in Start() and wonder why it has no effect.
2. reap logic: the miss counter is never cleared when ReleaseAddr fails
In ipreconciler.go lines 143–146:
if err := r.Subnet.ReleaseAddr(addr); err != nil {
r.Log.Error(...)
continue // ← miss counter for this addr is NOT deleted
}
delete(r.misses, addr)After a failed release, r.misses[addr] stays at ReleaseAfterMisses (the threshold). On the very next cycle, r.misses[addr]++ pushes it above the threshold again, and a release is attempted again. This is actually fine and arguably desirable (keep retrying), but the subsequent cleanup block at lines 154–158 will not remove it from the map either (because it's still in reservedSet). So for a persistently-failing release, the miss counter grows indefinitely. Not a correctness bug, but a map that grows without bound is worth noting.
3. sandboxStillValid — the errors.Is(err, cond.ErrNotFound{}) idiom
Looking at cond.ErrNotFound.Is():
func (e ErrNotFound) Is(target error) bool {
_, ok := target.(ErrNotFound)
return ok
}errors.Is passes the target to the receiver's .Is() method. Here the receiver is err (the actual error), and target is cond.ErrNotFound{}. So the matching is checking target.(ErrNotFound) — if target is ErrNotFound, it returns true. This is the correct use of the pattern (same as all other callsites in the repo). No bug.
However, there's a subtler question: does EAC.Get ever return a cond.ErrNotFound for a genuinely absent entity, or does it return (resp, nil) with resp.HasEntity() == false? Looking at the generated EntityAccessClient.Get implementation (line 2875), it surfaces RPC errors including not-found via cond.RemoteError which maps "not-found" back to cond.ErrNotFound. The !resp.HasEntity() check on line 355 handles the success-but-absent case. Both paths return false (treat as gone), so the logic is correct.
4. liveBridgeIPs does not filter by container state — dead containers count as live
liveBridgeIPs (sandbox.go ~line 1390) iterates all containerd containers with our entity label and reads their IP label, regardless of whether the task is running or stopped. A container whose task has exited but whose containerd record still exists (e.g. between SIGKILL and removeContainer) will contribute its IP to the live set, causing the reconciler to re-reserve that address and possibly block its reuse. This is conservative (fail-safe direction), but could cause an address to stay reserved for an extra reconciler cycle (5–15 minutes by default) after the watchdog's removal attempt fails and before the reconciler reaps it. Given the conservative reap threshold (3 misses × 5 min = 15 min), this window is probably fine, but it's worth documenting the assumption.
5. allocBridgeMaxAttempts = 32 is not explicitly tested
The test TestReserveConflictFree covers the happy path and the nil-live-set path, but doesn't verify the allocBridgeMaxAttempts exhaustion path returns an error. A /24 subnet with 252 usable addresses has enough headroom that a conflict storm of 32+ is extremely unlikely, but the failure mode (returning an error that blocks sandbox creation) is production-impacting enough to warrant a test case.
6. Watchdog: sandboxStillValid is called for every container not in validContainers, including ones the index explicitly returned as invalid
The per-ID re-fetch is done for every container absent from validContainers. This means containers whose sandbox entities were explicitly returned by the index lookup as DEAD (past grace) also trigger a re-fetch. The current path is still correct (the re-fetch will return false for them), but it adds N extra RPCs where N is the number of stale containers — one per watchdog cycle. In a heavily-drained node this is harmless; on a busy system with many legitimately-dead sandboxes it's a small extra load. Not a blocker, but worth a comment.
What's done well
- The decision to release the IP only after confirmed container removal (watchdog.go, lines 293–307) directly closes the specific race. This is the right ordering.
- The
reserveConflictFreequarantine strategy (leave conflicting IPs reserved rather than releasing them) is safe: it prevents re-issue of a potentially live address without requiring the caller to handle cleanup. sandboxStillValid's fail-safereturn trueon any non-not-found error is the correct conservative choice for a component protecting live IP assignments.IPReconciler.Start()runs an immediate initial reconcile before the first ticker fires, which means a restart heals any bookkeeping divergence that accumulated during downtime without waiting 5 minutes.- Test coverage is real: the three
IPReconcilerunit tests cover repair, threshold-based reaping, and the no-reap-while-live invariant. The watchdog integration test directly exercises the stale-index guard scenario.
Summary
The core logic is sound and the changes are well-motivated. The concerns above are mostly caveats rather than blockers: the most actionable ones are the growing miss-counter map (issue 2) and the lack of an exhaustion-path test for reserveConflictFree (issue 5). Neither prevents merging, but both are cheap to fix and I'd feel better seeing them addressed.
| } | ||
|
|
||
| if err := r.Subnet.ReleaseAddr(addr); err != nil { | ||
| r.Log.Error("ip reconciler failed to release leaked address", "addr", addr, "error", err) |
There was a problem hiding this comment.
When ReleaseAddr fails, r.misses[addr] is left at the threshold value. On the next cycle, r.misses[addr]++ pushes it above the threshold and another release is attempted — which is probably the desired retry behaviour. However, the miss counter for this address will never be cleaned up by the guard block below (lines 154–158) as long as the address stays reserved, so it grows unboundedly on each failed cycle. Consider either capping the counter at ReleaseAfterMisses after a failure, or adding a comment explaining the retry-forever intent.
| Log: c.Log.With("module", "ip-reconciler"), | ||
| Subnet: c.Subnet, | ||
| LiveIPs: func(ctx context.Context) (map[netip.Addr]bool, error) { | ||
| return c.liveBridgeIPs(ctx) |
There was a problem hiding this comment.
CheckInterval: 5 * time.Minute is already the default applied in IPReconciler.Start() when CheckInterval == 0. Setting it here makes the two places diverge silently when one is changed. Either remove the explicit assignment here and rely on the default, or add a named constant that both sites share.
| // already marked it reserved, so it is quarantined and won't be handed out | ||
| // again) and the next address is tried. | ||
| func reserveConflictFree(log *slog.Logger, bridge string, subnet *netdb.Subnet, live map[netip.Addr]bool) (netip.Prefix, error) { | ||
| for attempt := 0; ; attempt++ { |
There was a problem hiding this comment.
The allocBridgeMaxAttempts exhaustion path (returning an error after 32 conflicted addresses) has no test. This error blocks sandbox creation, so it would be useful to have a test that pre-fills the first 32 addresses as live and verifies the function returns a clear error rather than silently succeeding with an unexpected address.
| // directly by ID — a linearizable key lookup, not an index read. If the | ||
| // sandbox still exists and is not DEAD past the grace window, the index | ||
| // merely missed it: never kill it or release its IP (MIR-1238). | ||
| if w.sandboxStillValid(cleanupCtx, sc.sandboxID, now, graceWindow) { |
There was a problem hiding this comment.
sandboxStillValid is called for every container not in validContainers, including those whose sandbox entities were explicitly returned by the node-scoped index as DEAD past the grace window. For those the re-fetch is redundant (it'll return false again). Consider only calling sandboxStillValid when the sandbox ID is absent from the index response entirely — i.e. when validContainers never included it — not when it was present and expired.
- ipreconciler: cap the miss counter on a failed ReleaseAddr so a persistently failing release retries every cycle without the counter (and map entry) growing without bound. - watchdog: only re-fetch a sandbox entity when the node index has no record of it; when the index returned it but it is DEAD past grace, that verdict is authoritative, so skip the redundant per-cycle RPC. - sandbox: rely on the IPReconciler's default CheckInterval instead of duplicating the 5-minute value; document that liveBridgeIPs does not filter on task state and why that is conservative. - network: test the allocBridgeMaxAttempts exhaustion path returns a clear error.
There was a problem hiding this comment.
Re-review: sandbox — prevent duplicate bridge IP assignment (MIR-1238)
I re-read this PR at the updated head (01d7300) — there are now two commits. All four of the specific issues I flagged as inline comments in my first review have been addressed. Here's the accounting:
What changed since my first review
Issue #1 — CheckInterval in two places (sandbox.go)
Fixed. The IPReconciler initialization in sandbox.go (around line 569) no longer hard-codes CheckInterval: 5 * time.Minute. The comment now reads // CheckInterval is left at the IPReconciler default (see Start). The default in Start() is the single source of truth.
Issue #2 — Miss counter grows without bound on persistent ReleaseAddr failure
Fixed. ipreconciler.go lines 145–148 now cap the counter:
// Cap the counter so a persistently failing release retries every
// cycle without letting the miss count (and the map entry) grow
// without bound.
r.misses[addr] = r.ReleaseAfterMissesThat's exactly what I asked for: retry-every-cycle semantics with a bounded map.
Issue #5 — allocBridgeMaxAttempts exhaustion path not tested
Fixed. network/config_test.go now contains "errors when every candidate up to the attempt limit is live", which pre-fills .2 through .2+32 as live and asserts reserveConflictFree returns an error containing "no conflict-free address". The quarantine behaviour (.2 stays reserved after being skipped) is also verified in the first sub-test.
Issue #6 — sandboxStillValid called redundantly for index-returned DEAD sandboxes
Fixed cleanly. The updated CleanupOrphanedContainers builds indexedSandboxes (a set of all sandbox IDs the index returned, regardless of state), and the re-fetch guard is:
if !indexedSandboxes[sc.sandboxID] && w.sandboxStillValid(...) {sandboxStillValid is now only invoked for sandboxes completely absent from the index response. If the index returned a sandbox and it's DEAD past the grace window, the existing verdict is used directly — no extra RPC. That's the precise fix I requested.
Remaining observations from my first review (still fine)
- Issue #3 (the
errors.Iscorrectness question) was always correct — no change needed, confirmed again. - Issue #4 (
liveBridgeIPsconservatively counts exiting containers as live) is now explicitly documented in a multi-line comment onliveBridgeIPsitself explaining why this is intentional and what the worst-case consequence is. The reasoning is sound.
A small residual note
The typo sandboes (for "sandboxes") at watchdog.go line 159 was flagged by the other bot reviewer and is still present. It's in a comment, not code, and it doesn't affect correctness at all — but it's cheap to fix.
Overall
The core logic is sound and was already so in the first review. This round of updates closes every actionable concern I raised. The test suite is substantively better: the exhaustion path test, the quarantine invariant, the concurrent-reservation stress test in netdb_test.go, and the integration test for index-missed live sandboxes all give real confidence in the properties this change is claiming. I'm comfortable marking this ready to merge.
|
|
||
| indexedSandboxes[sb.ID.String()] = true | ||
|
|
||
| // Consider all non DEAD sandboes as valid. |
There was a problem hiding this comment.
Typo in comment: sandboes → sandboxes. This was also flagged by the other reviewer; trivial to fix.
Problem
On garden runner-1 the sandbox IPAM handed
10.8.64.2to two live sandboxes (reviewagent + sportsagent) on thert0bridge, causing ARP conflicts,martian sourcekernel drops, and constant flapping of the apps' long-lived chitchat WebSocket connections (connection reset by peer/i/o timeout) — surfacing to users as intermittent "empty reply (no responder)".Root cause
netdb cannot itself hold two reserved rows for one IP (
ipis aPRIMARY KEY). The bug is a divergence between netdb's lease bookkeeping and the addresses actually live on the bridge: a sandbox's.2lease was released in netdb while its container kept running on.2, and ~19h later the now-"free".2was handed to a second sandbox.The container watchdog caused the premature release. When a labeled container was missing from the node-scoped sandbox list (built from an index that can lag), it:
A transient/stale index → a live sandbox looks orphaned → its IP returns to the pool → reused.
Fix (defense in depth)
controllers/sandbox/watchdog.go): release an IP only after the container is confirmed removed (leave it reserved on failure); before treating a container as orphaned, re-fetch its sandbox entity directly by ID (a linearizable key lookup, not an index read) so a sandbox the index transiently omitted is never killed or has its IP released.network/config.go,sandbox.go):AllocateOnBridgeverifies a freshly reserved address isn't already live on the bridge (from running-container labels — no netns entry, no ARP); on conflict it keeps the address quarantined and retries, so a divergence can never surface as a duplicate.controllers/sandbox/ipreconciler.go): periodically (and on boot) re-reserves any address live on the bridge but free in netdb (repairing the exact divergence), and conservatively reaps reservations with no live owner after several consecutive misses.Tests
Reservenever duplicates;ReservedAddrs.reserveConflictFreeskips a live-but-free address and quarantines it.make lint: 0 issues. iso:controllers/sandbox(120),watchdogtest(14),pkg/netdball pass.Notes
sandbox.gois a frozen file (saga-transition guard); the changes are in shared code (AllocateNetwork,Initlifecycle, both reached by the saga path viainner), so the frozen hash was updated per the test's documented procedure.