Skip to content

sandbox: prevent duplicate bridge IP assignment (MIR-1238)#856

Open
evanphx wants to merge 2 commits into
mainfrom
mir-1238-duplicate-sandbox-ip-10-8-64-2-assigned-to-two-san
Open

sandbox: prevent duplicate bridge IP assignment (MIR-1238)#856
evanphx wants to merge 2 commits into
mainfrom
mir-1238-duplicate-sandbox-ip-10-8-64-2-assigned-to-two-san

Conversation

@evanphx

@evanphx evanphx commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Problem

On garden runner-1 the sandbox IPAM handed 10.8.64.2 to two live sandboxes (reviewagent + sportsagent) on the rt0 bridge, causing ARP conflicts, martian source kernel drops, and constant flapping of the apps' long-lived chitchat WebSocket connections (connection reset by peer / i/o timeout) — surfacing to users as intermittent "empty reply (no responder)".

Root cause

netdb cannot itself hold two reserved rows for one IP (ip is a PRIMARY KEY). The bug is a divergence between netdb's lease bookkeeping and the addresses actually live on the bridge: a sandbox's .2 lease was released in netdb while its container kept running on .2, and ~19h later the now-"free" .2 was handed to a second sandbox.

The container watchdog caused the premature release. When a labeled container was missing from the node-scoped sandbox list (built from an index that can lag), it:

  1. released the IP before removing the container, and
  2. released it even when removal failed, and
  3. never re-checked whether the sandbox actually still existed.

A transient/stale index → a live sandbox looks orphaned → its IP returns to the pool → reused.

Fix (defense in depth)

  • Watchdog (controllers/sandbox/watchdog.go): release an IP only after the container is confirmed removed (leave it reserved on failure); before treating a container as orphaned, re-fetch its sandbox entity directly by ID (a linearizable key lookup, not an index read) so a sandbox the index transiently omitted is never killed or has its IP released.
  • Allocation-time guard (network/config.go, sandbox.go): AllocateOnBridge verifies a freshly reserved address isn't already live on the bridge (from running-container labels — no netns entry, no ARP); on conflict it keeps the address quarantined and retries, so a divergence can never surface as a duplicate.
  • IPReconciler (controllers/sandbox/ipreconciler.go): periodically (and on boot) re-reserves any address live on the bridge but free in netdb (repairing the exact divergence), and conservatively reaps reservations with no live owner after several consecutive misses.

Tests

  • netdb: concurrent Reserve never duplicates; ReservedAddrs.
  • network: reserveConflictFree skips a live-but-free address and quarantines it.
  • reconciler: re-reserves a lost live address; reaps only after the miss threshold; never reaps a live address.
  • iso regression test: a RUNNING sandbox missing from the node index is kept, not orphaned.

make lint: 0 issues. iso: controllers/sandbox (120), watchdogtest (14), pkg/netdb all pass.

Notes

  • sandbox.go is a frozen file (saga-transition guard); the changes are in shared code (AllocateNetwork, Init lifecycle, both reached by the saga path via inner), so the frozen hash was updated per the test's documented procedure.
  • The live garden duplicate was mitigated separately by recycling the two colliding sandboxes onto distinct IPs.

The sandbox IPAM handed the same rt0 address to two live sandboxes on a
runner, causing ARP conflicts, martian-source drops, and connection
flapping for the affected apps. netdb cannot itself hold two reserved
rows for one IP (ip is a PRIMARY KEY); the bug is a divergence between
netdb's lease bookkeeping and the addresses actually live on the bridge.
A sandbox's lease was released in netdb while its container kept running,
and the now-"free" address was later handed to a second sandbox.

The container watchdog is the source of the premature release: it freed
an IP whenever a labeled container was missing from the node-scoped
sandbox list (which is built from an index that can lag), released it
before the container was actually removed, and released it even when
removal failed.

Defense in depth:

- watchdog: release an IP only after the container is confirmed removed
  (leave it reserved on failure); before treating a container as
  orphaned, re-fetch its sandbox entity directly by ID (a linearizable
  key lookup, not an index read) so a sandbox the index transiently
  omitted is never killed or has its IP released.

- AllocateOnBridge: verify a freshly reserved address is not already
  live on the bridge (from running-container labels); on conflict keep
  it quarantined and retry, so a divergence can never surface as a
  duplicate handed to a new sandbox.

- IPReconciler: periodically (and on boot) re-reserve any address live
  on the bridge but free in netdb, and conservatively reap reservations
  with no live owner after several consecutive misses.

Adds unit tests for netdb concurrency/ReservedAddrs, the conflict-free
reserve loop, and the reconciler, plus an iso regression test that a
running sandbox missing from the node index is kept.
@evanphx evanphx requested a review from a team as a code owner June 12, 2026 04:01
@coderabbitai

coderabbitai Bot commented Jun 12, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: adddca02-dfd3-4732-b932-7f7d4e29725e

📥 Commits

Reviewing files that changed from the base of the PR and between 194b69f and 01d7300.

📒 Files selected for processing (5)
  • controllers/sandbox/ipreconciler.go
  • controllers/sandbox/sandbox.go
  • controllers/sandbox/sandbox_frozen_test.go
  • controllers/sandbox/watchdog.go
  • network/config_test.go
✅ Files skipped from review due to trivial changes (1)
  • controllers/sandbox/sandbox_frozen_test.go
🚧 Files skipped from review as they are similar to previous changes (3)
  • controllers/sandbox/watchdog.go
  • controllers/sandbox/ipreconciler.go
  • controllers/sandbox/sandbox.go

📝 Walkthrough

Walkthrough

This PR adds IP reconciliation and conflict-aware allocation to keep netdb IP reservations aligned with actual live bridge usage. The changes introduce a new IPReconciler component that periodically repairs missing IP reservations and reaps stale ones, update network allocation to detect and skip conflicting addresses, integrate the reconciler into SandboxController with a live-IP callback, and refactor the watchdog's orphan cleanup to re-fetch sandboxes and gate IP release on container removal success, preventing premature cleanup during index staleness.


Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
controllers/sandbox/watchdog.go (1)

152-152: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix typo in comment.

"sandboes" should be "sandboxes".

📝 Proposed fix
-		// Consider all non DEAD sandboes as valid.
+		// Consider all non DEAD sandboxes as valid.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@controllers/sandbox/watchdog.go` at line 152, Fix the typo in the comment
that currently reads "Consider all non DEAD sandboes as valid."—change
"sandboes" to "sandboxes" so the comment reads "Consider all non DEAD sandboxes
as valid." Locate this comment string in controllers/sandbox/watchdog.go and
update the text accordingly.
🧹 Nitpick comments (1)
pkg/netdb/netdb.go (1)

436-440: ⚡ Quick win

Log unparseable IP addresses during scan.

Silently skipping rows that fail netip.ParseAddr could mask database corruption or stale/invalid data without alerting operators. Add a log statement before continue so administrators can investigate and repair the corrupted entries.

📊 Suggested logging for parse failures
 		addr, err := netip.ParseAddr(ipStr)
 		if err != nil {
-			// Skip rows that can't be parsed rather than failing the whole scan.
+			s.netdb.Log.Warn("skipping unparseable IP in reserved addresses scan",
+				"ip", ipStr, "subnet", s.net.String(), "error", err)
 			continue
 		}

Note: This assumes a logger is available on s.netdb. If not, consider passing a logger to ReservedAddrs() or using a package-level logger.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/netdb/netdb.go` around lines 436 - 440, In ReservedAddrs (or the function
containing the netip.ParseAddr call), add a log call that reports the
unparseable ipStr before the continue so parse failures are visible; reference
the netip.ParseAddr invocation and use the service/struct logger (s.netdb or
another logger passed into ReservedAddrs) to log a message including the
offending ipStr and the parse error to aid investigation of corrupted or stale
DB entries.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@controllers/sandbox/watchdog.go`:
- Line 152: Fix the typo in the comment that currently reads "Consider all non
DEAD sandboes as valid."—change "sandboes" to "sandboxes" so the comment reads
"Consider all non DEAD sandboxes as valid." Locate this comment string in
controllers/sandbox/watchdog.go and update the text accordingly.

---

Nitpick comments:
In `@pkg/netdb/netdb.go`:
- Around line 436-440: In ReservedAddrs (or the function containing the
netip.ParseAddr call), add a log call that reports the unparseable ipStr before
the continue so parse failures are visible; reference the netip.ParseAddr
invocation and use the service/struct logger (s.netdb or another logger passed
into ReservedAddrs) to log a message including the offending ipStr and the parse
error to aid investigation of corrupted or stale DB entries.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: acaa6416-6826-4322-ae31-38c0c6d92650

📥 Commits

Reviewing files that changed from the base of the PR and between 7a42b90 and 194b69f.

📒 Files selected for processing (10)
  • controllers/sandbox/ipreconciler.go
  • controllers/sandbox/ipreconciler_test.go
  • controllers/sandbox/sandbox.go
  • controllers/sandbox/sandbox_frozen_test.go
  • controllers/sandbox/watchdog.go
  • controllers/sandbox/watchdogtest/container_test.go
  • network/config.go
  • network/config_test.go
  • pkg/netdb/netdb.go
  • pkg/netdb/netdb_test.go

@miren-code-agent miren-code-agent Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: sandbox — prevent duplicate bridge IP assignment (MIR-1238)

This PR tackles a real, concrete bug: netdb lease bookkeeping can diverge from what's live on the bridge, and the divergence causes a second sandbox to receive an IP already in use. The fix is multi-layered — a conflict-check at allocation time, a repair/reap reconciler, and a hardening of the watchdog's orphan-detection against stale index reads. I read through all ten changed files, checked the entity-access plumbing, and exercised a few edge-case paths mentally. Here's what I found.


What the change does

graph LR
  A["AllocateOnBridge"] -->|"calls inUse() first"| B["reserveConflictFree<br/>(network/config.go)"]
  B -->|"skips live addrs,<br/>quarantines them"| C["netdb.Subnet.Reserve()"]
  D["IPReconciler<br/>(ipreconciler.go)"] -->|"every 5 min"| E["liveBridgeIPs<br/>(sandbox.go)"]
  E -->|"reads containerd labels"| F["Re-reserves lost addrs<br/>or reaps stale leases"]
  G["ContainerWatchdog<br/>(watchdog.go)"] -->|"per-container re-fetch"| H["sandboxStillValid<br/>(EAC.Get — linearizable)"]
  H -->|"stale index guard"| I["Only then mark orphaned<br/>and release IP after removal"]
Loading

Issues I found

1. CheckInterval is hard-coded in sandbox.go but IPReconciler already has a default (minor redundancy, not a bug)

In sandbox.go line 567, CheckInterval: 5 * time.Minute is set explicitly on the struct. The IPReconciler.Start() already supplies this default when CheckInterval == 0. The explicit value wins over the default, so behaviour is correct, but it's confusing to maintain in two places — a future reader might change the default in Start() and wonder why it has no effect.

2. reap logic: the miss counter is never cleared when ReleaseAddr fails

In ipreconciler.go lines 143–146:

if err := r.Subnet.ReleaseAddr(addr); err != nil {
    r.Log.Error(...)
    continue   // ← miss counter for this addr is NOT deleted
}
delete(r.misses, addr)

After a failed release, r.misses[addr] stays at ReleaseAfterMisses (the threshold). On the very next cycle, r.misses[addr]++ pushes it above the threshold again, and a release is attempted again. This is actually fine and arguably desirable (keep retrying), but the subsequent cleanup block at lines 154–158 will not remove it from the map either (because it's still in reservedSet). So for a persistently-failing release, the miss counter grows indefinitely. Not a correctness bug, but a map that grows without bound is worth noting.

3. sandboxStillValid — the errors.Is(err, cond.ErrNotFound{}) idiom

Looking at cond.ErrNotFound.Is():

func (e ErrNotFound) Is(target error) bool {
    _, ok := target.(ErrNotFound)
    return ok
}

errors.Is passes the target to the receiver's .Is() method. Here the receiver is err (the actual error), and target is cond.ErrNotFound{}. So the matching is checking target.(ErrNotFound) — if target is ErrNotFound, it returns true. This is the correct use of the pattern (same as all other callsites in the repo). No bug.

However, there's a subtler question: does EAC.Get ever return a cond.ErrNotFound for a genuinely absent entity, or does it return (resp, nil) with resp.HasEntity() == false? Looking at the generated EntityAccessClient.Get implementation (line 2875), it surfaces RPC errors including not-found via cond.RemoteError which maps "not-found" back to cond.ErrNotFound. The !resp.HasEntity() check on line 355 handles the success-but-absent case. Both paths return false (treat as gone), so the logic is correct.

4. liveBridgeIPs does not filter by container state — dead containers count as live

liveBridgeIPs (sandbox.go ~line 1390) iterates all containerd containers with our entity label and reads their IP label, regardless of whether the task is running or stopped. A container whose task has exited but whose containerd record still exists (e.g. between SIGKILL and removeContainer) will contribute its IP to the live set, causing the reconciler to re-reserve that address and possibly block its reuse. This is conservative (fail-safe direction), but could cause an address to stay reserved for an extra reconciler cycle (5–15 minutes by default) after the watchdog's removal attempt fails and before the reconciler reaps it. Given the conservative reap threshold (3 misses × 5 min = 15 min), this window is probably fine, but it's worth documenting the assumption.

5. allocBridgeMaxAttempts = 32 is not explicitly tested

The test TestReserveConflictFree covers the happy path and the nil-live-set path, but doesn't verify the allocBridgeMaxAttempts exhaustion path returns an error. A /24 subnet with 252 usable addresses has enough headroom that a conflict storm of 32+ is extremely unlikely, but the failure mode (returning an error that blocks sandbox creation) is production-impacting enough to warrant a test case.

6. Watchdog: sandboxStillValid is called for every container not in validContainers, including ones the index explicitly returned as invalid

The per-ID re-fetch is done for every container absent from validContainers. This means containers whose sandbox entities were explicitly returned by the index lookup as DEAD (past grace) also trigger a re-fetch. The current path is still correct (the re-fetch will return false for them), but it adds N extra RPCs where N is the number of stale containers — one per watchdog cycle. In a heavily-drained node this is harmless; on a busy system with many legitimately-dead sandboxes it's a small extra load. Not a blocker, but worth a comment.


What's done well

  • The decision to release the IP only after confirmed container removal (watchdog.go, lines 293–307) directly closes the specific race. This is the right ordering.
  • The reserveConflictFree quarantine strategy (leave conflicting IPs reserved rather than releasing them) is safe: it prevents re-issue of a potentially live address without requiring the caller to handle cleanup.
  • sandboxStillValid's fail-safe return true on any non-not-found error is the correct conservative choice for a component protecting live IP assignments.
  • IPReconciler.Start() runs an immediate initial reconcile before the first ticker fires, which means a restart heals any bookkeeping divergence that accumulated during downtime without waiting 5 minutes.
  • Test coverage is real: the three IPReconciler unit tests cover repair, threshold-based reaping, and the no-reap-while-live invariant. The watchdog integration test directly exercises the stale-index guard scenario.

Summary

The core logic is sound and the changes are well-motivated. The concerns above are mostly caveats rather than blockers: the most actionable ones are the growing miss-counter map (issue 2) and the lack of an exhaustion-path test for reserveConflictFree (issue 5). Neither prevents merging, but both are cheap to fix and I'd feel better seeing them addressed.

}

if err := r.Subnet.ReleaseAddr(addr); err != nil {
r.Log.Error("ip reconciler failed to release leaked address", "addr", addr, "error", err)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When ReleaseAddr fails, r.misses[addr] is left at the threshold value. On the next cycle, r.misses[addr]++ pushes it above the threshold and another release is attempted — which is probably the desired retry behaviour. However, the miss counter for this address will never be cleaned up by the guard block below (lines 154–158) as long as the address stays reserved, so it grows unboundedly on each failed cycle. Consider either capping the counter at ReleaseAfterMisses after a failure, or adding a comment explaining the retry-forever intent.

Log: c.Log.With("module", "ip-reconciler"),
Subnet: c.Subnet,
LiveIPs: func(ctx context.Context) (map[netip.Addr]bool, error) {
return c.liveBridgeIPs(ctx)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CheckInterval: 5 * time.Minute is already the default applied in IPReconciler.Start() when CheckInterval == 0. Setting it here makes the two places diverge silently when one is changed. Either remove the explicit assignment here and rely on the default, or add a named constant that both sites share.

Comment thread network/config.go
// already marked it reserved, so it is quarantined and won't be handed out
// again) and the next address is tried.
func reserveConflictFree(log *slog.Logger, bridge string, subnet *netdb.Subnet, live map[netip.Addr]bool) (netip.Prefix, error) {
for attempt := 0; ; attempt++ {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The allocBridgeMaxAttempts exhaustion path (returning an error after 32 conflicted addresses) has no test. This error blocks sandbox creation, so it would be useful to have a test that pre-fills the first 32 addresses as live and verifies the function returns a clear error rather than silently succeeding with an unexpected address.

Comment thread controllers/sandbox/watchdog.go Outdated
// directly by ID — a linearizable key lookup, not an index read. If the
// sandbox still exists and is not DEAD past the grace window, the index
// merely missed it: never kill it or release its IP (MIR-1238).
if w.sandboxStillValid(cleanupCtx, sc.sandboxID, now, graceWindow) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sandboxStillValid is called for every container not in validContainers, including those whose sandbox entities were explicitly returned by the node-scoped index as DEAD past the grace window. For those the re-fetch is redundant (it'll return false again). Consider only calling sandboxStillValid when the sandbox ID is absent from the index response entirely — i.e. when validContainers never included it — not when it was present and expired.

- ipreconciler: cap the miss counter on a failed ReleaseAddr so a
  persistently failing release retries every cycle without the counter
  (and map entry) growing without bound.
- watchdog: only re-fetch a sandbox entity when the node index has no
  record of it; when the index returned it but it is DEAD past grace,
  that verdict is authoritative, so skip the redundant per-cycle RPC.
- sandbox: rely on the IPReconciler's default CheckInterval instead of
  duplicating the 5-minute value; document that liveBridgeIPs does not
  filter on task state and why that is conservative.
- network: test the allocBridgeMaxAttempts exhaustion path returns a
  clear error.

@miren-code-agent miren-code-agent Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review: sandbox — prevent duplicate bridge IP assignment (MIR-1238)

I re-read this PR at the updated head (01d7300) — there are now two commits. All four of the specific issues I flagged as inline comments in my first review have been addressed. Here's the accounting:


What changed since my first review

Issue #1CheckInterval in two places (sandbox.go)
Fixed. The IPReconciler initialization in sandbox.go (around line 569) no longer hard-codes CheckInterval: 5 * time.Minute. The comment now reads // CheckInterval is left at the IPReconciler default (see Start). The default in Start() is the single source of truth.

Issue #2 — Miss counter grows without bound on persistent ReleaseAddr failure
Fixed. ipreconciler.go lines 145–148 now cap the counter:

// Cap the counter so a persistently failing release retries every
// cycle without letting the miss count (and the map entry) grow
// without bound.
r.misses[addr] = r.ReleaseAfterMisses

That's exactly what I asked for: retry-every-cycle semantics with a bounded map.

Issue #5allocBridgeMaxAttempts exhaustion path not tested
Fixed. network/config_test.go now contains "errors when every candidate up to the attempt limit is live", which pre-fills .2 through .2+32 as live and asserts reserveConflictFree returns an error containing "no conflict-free address". The quarantine behaviour (.2 stays reserved after being skipped) is also verified in the first sub-test.

Issue #6sandboxStillValid called redundantly for index-returned DEAD sandboxes
Fixed cleanly. The updated CleanupOrphanedContainers builds indexedSandboxes (a set of all sandbox IDs the index returned, regardless of state), and the re-fetch guard is:

if !indexedSandboxes[sc.sandboxID] && w.sandboxStillValid(...) {

sandboxStillValid is now only invoked for sandboxes completely absent from the index response. If the index returned a sandbox and it's DEAD past the grace window, the existing verdict is used directly — no extra RPC. That's the precise fix I requested.


Remaining observations from my first review (still fine)

  • Issue #3 (the errors.Is correctness question) was always correct — no change needed, confirmed again.
  • Issue #4 (liveBridgeIPs conservatively counts exiting containers as live) is now explicitly documented in a multi-line comment on liveBridgeIPs itself explaining why this is intentional and what the worst-case consequence is. The reasoning is sound.

A small residual note

The typo sandboes (for "sandboxes") at watchdog.go line 159 was flagged by the other bot reviewer and is still present. It's in a comment, not code, and it doesn't affect correctness at all — but it's cheap to fix.


Overall

The core logic is sound and was already so in the first review. This round of updates closes every actionable concern I raised. The test suite is substantively better: the exhaustion path test, the quarantine invariant, the concurrent-reservation stress test in netdb_test.go, and the integration test for index-missed live sandboxes all give real confidence in the properties this change is claiming. I'm comfortable marking this ready to merge.


indexedSandboxes[sb.ID.String()] = true

// Consider all non DEAD sandboes as valid.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in comment: sandboessandboxes. This was also flagged by the other reviewer; trivial to fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant