refactor(orchestrator): Fix leaking resources on shutdown and add framework for better lifecycle tracking by wj-e2b · Pull Request #2770 · e2b-dev/infra

wj-e2b · 2026-05-21T01:09:02Z

No description provided.

cursor · 2026-05-21T01:09:09Z

PR Summary

High Risk
High risk because it changes orchestrator shutdown ordering, sandbox termination behavior (including process-group signaling), and firewall/network teardown; failures here can strand VMs or break node networking. It also introduces new concurrency primitives and lifecycle tracking that must compile and behave correctly under load.

Overview
This PR appears to introduce a shutdown lifecycle manager and sandbox drain/force-stop flow, but there are likely build-breaking issues: sync.WaitGroup is used with a non-existent Go method in Server.ForceStopSandboxes and setupSandboxLifecycle, so the orchestrator will not compile as written.

It also changes teardown semantics in ways that can cause cross-sandbox impact if incorrect, notably deleting the nftables table by a constant name in Firewall.Close, and switching Firecracker termination to signal the whole process group, which can kill unintended descendants if PID reuse or grouping assumptions are wrong.

^{Reviewed by Cursor Bugbot for commit 6443f22. Bugbot is set up for automated code reviews on this repo. Configure here.}

codecov · 2026-05-21T01:10:01Z

❌ 3 Tests Failed:

Tests completed	Failed	Passed	Skipped
2683	3	2680	5

View the full list of 7 ❄️ flaky test(s)

github.com/e2b-dev/infra/tests/integration/internal/tests/api/sandboxes::TestSnapshotTemplateCreateSandbox

Flake rate in main: 49.84% (Passed 467 times, Failed 464 times)

Stack Traces | 0s run time

=== RUN   TestSnapshotTemplateCreateSandbox
=== PAUSE TestSnapshotTemplateCreateSandbox
=== CONT  TestSnapshotTemplateCreateSandbox
--- FAIL: TestSnapshotTemplateCreateSandbox (0.00s)

github.com/e2b-dev/infra/tests/integration/internal/tests/api/sandboxes::TestSnapshotTemplateCreateSandbox/overwritten_snapshot_build_is_served_immediately_on_sandbox_create

Flake rate in main: 49.78% (Passed 467 times, Failed 463 times)

Stack Traces | 17.3s run time

=== RUN   TestSnapshotTemplateCreateSandbox/overwritten_snapshot_build_is_served_immediately_on_sandbox_create
=== PAUSE TestSnapshotTemplateCreateSandbox/overwritten_snapshot_build_is_served_immediately_on_sandbox_create
=== CONT  TestSnapshotTemplateCreateSandbox/overwritten_snapshot_build_is_served_immediately_on_sandbox_create
    snapshot_template_test.go:283: 
        	Error Trace:	.../api/sandboxes/snapshot_template_test.go:37
        	            				.../api/sandboxes/snapshot_template_test.go:283
        	Error:      	Not equal: 
        	            	expected: 201
        	            	actual  : 500
        	Test:       	TestSnapshotTemplateCreateSandbox/overwritten_snapshot_build_is_served_immediately_on_sandbox_create
--- FAIL: TestSnapshotTemplateCreateSandbox/overwritten_snapshot_build_is_served_immediately_on_sandbox_create (17.29s)

github.com/e2b-dev/infra/tests/integration/internal/tests/api/sandboxes::TestSnapshotTemplateList

Flake rate in main: 49.95% (Passed 467 times, Failed 466 times)

Stack Traces | 0s run time

=== RUN   TestSnapshotTemplateList
=== PAUSE TestSnapshotTemplateList
=== CONT  TestSnapshotTemplateList
--- FAIL: TestSnapshotTemplateList (0.00s)

github.com/e2b-dev/infra/tests/integration/internal/tests/api/sandboxes::TestSnapshotTemplateList/list_snapshots_filtered_by_sandbox_ID

Flake rate in main: 49.95% (Passed 467 times, Failed 466 times)

Stack Traces | 16.7s run time

=== RUN   TestSnapshotTemplateList/list_snapshots_filtered_by_sandbox_ID
=== PAUSE TestSnapshotTemplateList/list_snapshots_filtered_by_sandbox_ID
=== CONT  TestSnapshotTemplateList/list_snapshots_filtered_by_sandbox_ID
    snapshot_template_test.go:146: 
        	Error Trace:	.../api/sandboxes/snapshot_template_test.go:37
        	            				.../api/sandboxes/snapshot_template_test.go:146
        	Error:      	Not equal: 
        	            	expected: 201
        	            	actual  : 500
        	Test:       	TestSnapshotTemplateList/list_snapshots_filtered_by_sandbox_ID
--- FAIL: TestSnapshotTemplateList/list_snapshots_filtered_by_sandbox_ID (16.73s)

github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestCACertInjection

Flake rate in main: 49.57% (Passed 467 times, Failed 459 times)

Stack Traces | 0.6s run time

=== RUN   TestCACertInjection
=== PAUSE TestCACertInjection
=== CONT  TestCACertInjection
    ca_cert_test.go:113: Command [cat] output: event:{start:{pid:1269}}
    ca_cert_test.go:113: 
        	Error Trace:	.../tests/envd/ca_cert_test.go:78
        	            				.../tests/envd/ca_cert_test.go:113
        	Error:      	Received unexpected error:
        	            	failed to execute command cat in sandbox inctvii2vkry1hdwg0f32: invalid_argument: protocol error: incomplete envelope: unexpected EOF
        	Test:       	TestCACertInjection
        	Messages:   	read CA bundle
--- FAIL: TestCACertInjection (0.60s)

github.com/e2b-dev/infra/tests/integration/internal/tests/orchestrator::TestSandboxMemoryIntegrity

Flake rate in main: 62.54% (Passed 472 times, Failed 788 times)

Stack Traces | 83.7s run time

=== RUN   TestSandboxMemoryIntegrity
=== PAUSE TestSandboxMemoryIntegrity
=== CONT  TestSandboxMemoryIntegrity
    sandbox_memory_integrity_test.go:27: Build completed successfully
--- FAIL: TestSandboxMemoryIntegrity (83.68s)

github.com/e2b-dev/infra/tests/integration/internal/tests/orchestrator::TestSandboxMemoryIntegrity/tmpfs_hash

Flake rate in main: 62.80% (Passed 462 times, Failed 780 times)

Stack Traces | 109s run time

=== RUN   TestSandboxMemoryIntegrity/tmpfs_hash
=== PAUSE TestSandboxMemoryIntegrity/tmpfs_hash
=== CONT  TestSandboxMemoryIntegrity/tmpfs_hash
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{start:{pid:1258}}
Executing command bash in sandbox i64g56zjl0rfuer3hg9yu (user: root)
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stdout:"Total memory: 985 MB\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stdout:"Used memory before tmpfs mount: 192 MB\nFree memory before tmpfs mount: 792 MB\nMemory to use in integrity test (80% of free, min 64MB): 633 MB\n"}}
Executing command bash in sandbox i64g56zjl0rfuer3hg9yu (user: root)
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"633+0 records in\n633+0 records out\n663748608 bytes (664 MB, 633 MiB) copied, 3.47998 s, 191 MB/s\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"\t"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"C"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"o"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"m"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"m"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"a"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"d"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:" "}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"b"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"e"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"i"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"g"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:" "}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"t"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"i"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"m"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"e"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"d"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:": \"dd"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:" if=/dev/urandom of=/mnt/testfile bs=1M count=633\"\n\tUse"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"r time ("}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"seconds"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"): "}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"0.00\n\tSystem time (seconds): 3.45\n\tPercent "}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"of CPU this "}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"job g"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"ot: 99%\n\tElap"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"sed ("}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"wall cl"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"ock) t"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"ime (h"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:":mm:ss or"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:" m:ss"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"): 0:03.48\n\t"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"Average shared text size (kbytes): 0\n\tAverage unshared data size (kbytes): 0\n\tAv"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"erage stac"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"k size (kby"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"tes): 0\n\t"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"Average"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:" tota"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"l size ("}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"kbytes"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"): 0\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"\tMaxim"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"um resi"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"dent set"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:" size (kbytes): 2620\n\tAverage resident set size (kbytes): 0\n\tMajor (requiring I/O) pag"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"e faults"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:": 2\n\t"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"Minor ("}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"reclaimin"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"g a frame)"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:" page "}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"fault"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"s: 344\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"\tVoluntary"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:" cont"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"ext s"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"witch"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"es: 3\n\tInvoluntary context switches: 77\n\tSwaps: 0\n\tFile system inputs: 176\n\tFile system outputs: 0\n\tSocket messages sent: 0\n\tSocket mess"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"ages received: "}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"0\n\tSig"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"nals d"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"eliv"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"ered: 0\n\tPage size (bytes): 4096\n\tExit status: 0\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stdout:"Used memory after tmpfs mount and file fill: 831 MB\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{end:{exited:true status:"exit status 0"}}
    sandbox_memory_integrity_test.go:70: Command [bash] completed successfully in sandbox i2zo7a5t2bc76csj0kcwm
Executing command bash in sandbox i2zo7a5t2bc76csj0kcwm (user: root)
    sandbox_memory_integrity_test.go:80: Command [bash] output: event:{start:{pid:1275}}
    sandbox_memory_integrity_test.go:80: Command [bash] output: event:{data:{stdout:"61a6d5d46b78411c88280155e23c4e6665bc4fefaf100ecf8b6dc91417929bc4\n"}}
    sandbox_memory_integrity_test.go:80: Command [bash] output: event:{end:{exited:true status:"exit status 0"}}
    sandbox_memory_integrity_test.go:80: Command [bash] completed successfully in sandbox i2zo7a5t2bc76csj0kcwm
Executing command bash in sandbox i2zo7a5t2bc76csj0kcwm (user: root)
    sandbox_memory_integrity_test.go:80: Command [bash] output: event:{start:{pid:1279}}
Executing command bash in sandbox i2zo7a5t2bc76csj0kcwm (user: root)
Executing command bash in sandbox i2zo7a5t2bc76csj0kcwm (user: root)
Executing command bash in sandbox i2zo7a5t2bc76csj0kcwm (user: root)
Executing command bash in sandbox i2zo7a5t2bc76csj0kcwm (user: root)
Executing command bash in sandbox i2zo7a5t2bc76csj0kcwm (user: root)
Executing command bash in sandbox i2zo7a5t2bc76csj0kcwm (user: root)
Executing command bash in sandbox i2zo7a5t2bc76csj0kcwm (user: root)
Executing command bash in sandbox i2zo7a5t2bc76csj0kcwm (user: root)
Executing command bash in sandbox i2zo7a5t2bc76csj0kcwm (user: root)
Executing command bash in sandbox i2zo7a5t2bc76csj0kcwm (user: root)
Executing command bash in sandbox i2zo7a5t2bc76csj0kcwm (user: root)
Executing command bash in sandbox i2zo7a5t2bc76csj0kcwm (user: root)
Executing command bash in sandbox i2zo7a5t2bc76csj0kcwm (user: root)
Executing command bash in sandbox i2zo7a5t2bc76csj0kcwm (user: root)
Executing command bash in sandbox i2zo7a5t2bc76csj0kcwm (user: root)
Executing command bash in sandbox i2zo7a5t2bc76csj0kcwm (user: root)
Executing command bash in sandbox i2zo7a5t2bc76csj0kcwm (user: root)
Executing command bash in sandbox i2zo7a5t2bc76csj0kcwm (user: root)
Executing command bash in sandbox i2zo7a5t2bc76csj0kcwm (user: root)
Executing command bash in sandbox i2zo7a5t2bc76csj0kcwm (user: root)
Executing command bash in sandbox i2zo7a5t2bc76csj0kcwm (user: root)
Executing command bash in sandbox i2zo7a5t2bc76csj0kcwm (user: root)
Executing command bash in sandbox i2zo7a5t2bc76csj0kcwm (user: root)
Executing command bash in sandbox i2zo7a5t2bc76csj0kcwm (user: root)
Executing command bash in sandbox i2zo7a5t2bc76csj0kcwm (user: root)
Executing command bash in sandbox i2zo7a5t2bc76csj0kcwm (user: root)
Executing command bash in sandbox i2zo7a5t2bc76csj0kcwm (user: root)
Executing command bash in sandbox i2zo7a5t2bc76csj0kcwm (user: root)
Executing command bash in sandbox i2zo7a5t2bc76csj0kcwm (user: root)
Executing command bash in sandbox i2zo7a5t2bc76csj0kcwm (user: root)
Executing command bash in sandbox i2zo7a5t2bc76csj0kcwm (user: root)
Executing command bash in sandbox i2zo7a5t2bc76csj0kcwm (user: root)
Executing command bash in sandbox i2zo7a5t2bc76csj0kcwm (user: root)
Executing command bash in sandbox i2zo7a5t2bc76csj0kcwm (user: root)
Executing command bash in sandbox i2zo7a5t2bc76csj0kcwm (user: root)
Executing command bash in sandbox i2zo7a5t2bc76csj0kcwm (user: root)
    sandbox_memory_integrity_test.go:110: 
        	Error Trace:	.../tests/orchestrator/sandbox_memory_integrity_test.go:81
        	            				.../hostedtoolcache/go/1.26.3.../src/runtime/asm_amd64.s:1771
        	Error:      	Received unexpected error:
        	            	failed to execute command bash in sandbox i2zo7a5t2bc76csj0kcwm: unavailable: HTTP status 502 Bad Gateway
    sandbox_memory_integrity_test.go:110: 
        	Error Trace:	.../tests/orchestrator/sandbox_memory_integrity_test.go:78
        	            				.../tests/orchestrator/sandbox_memory_integrity_test.go:110
        	Error:      	Condition never satisfied
        	Test:       	TestSandboxMemoryIntegrity/tmpfs_hash
--- FAIL: TestSandboxMemoryIntegrity/tmpfs_hash (109.48s)

To view more test analytics, go to the Test Analytics Dashboard
_{📋 Got 3 mins? Take this short survey to help us improve Test Analytics.}

gemini-code-assist

Code Review

The use of wg.Go in ForceStopSandboxes will cause a compilation error because sync.WaitGroup does not provide that method. In doStop, returning early when WaitWithContext fails prevents memory resource cleanup, which can lead to leaked userfaultfd processes.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6443f22c3b

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-05-21T23:35:32Z

-		if err := closer.close(closeCtx); err != nil {
-			clog.Error(ctx, "error during shutdown", zap.Error(err))
-			success = false
+	registerClose("sandbox drain", []string{"orchestrator server", "network pool", "device pool"}, func(context.Context) error {


Close request ingress before starting sandbox drain

shutdown.Stop runs units in reverse topological order, and this registers sandbox drain after grpc server without any dependency between them, so drain executes while gRPC is still serving. In that window new template build requests can still start sandboxes (the earlier tmpl.Wait only waits for currently running builds and does not stop new ones), which can keep the sandbox count non-zero until timeout and trigger unnecessary forced shutdown. Add ordering so gRPC/HTTP listeners are stopped before entering the drain phase.

Useful? React with 👍 / 👎.

To double down on codex's comment, this close needs to happen as literally the first thing to happen on any signals, before any other closing can happen, and it needs to happen on its own (no concurrent closing can happen at the same time). it's not clear that's happening here.

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 6443f22. Configure here.}

cursor · 2026-05-21T23:52:53Z

+				return s.waitSandboxLifecycles(ctx)
+			}
+		}
+	}


ForceStopSandboxes uses canceled closeCtx for waitSandboxLifecycles indirectly

Medium Severity

DrainSandboxes checks s.sandboxFactory.Sandboxes.Count() which only counts the live map. Once the count reaches zero, it calls waitSandboxLifecycles(ctx) to wait for lifecycle goroutines. However, the drain loop polls only every 5 seconds (sandboxDrainLogInterval). If the last sandbox exits immediately after a tick fires and shows remaining > 0, the drain will wait up to 5 additional seconds before detecting completion. More importantly, Count() reflects the live map (cleared by MarkStopping), but sandboxes could be removed from live by the normal stop path while their lifecycle goroutine is still running Close. The drain correctly calls waitSandboxLifecycles afterward, but the gap between "live count is 0" and "all lifecycle goroutines done" is only bridged if the drain context doesn't expire in between.

Additional Locations (1)

packages/orchestrator/pkg/factories/run.go#L820-L850

^{Reviewed by Cursor Bugbot for commit 6443f22. Configure here.}

cursor · 2026-05-21T23:52:53Z

 		}
+
+		return nil
+	})


Sandbox drain closure captures outer closeCtx ignoring parameter

Low Severity

The closure passed to registerClose for "sandbox drain" has an unnamed context.Context parameter but references closeCtx from the outer scope instead of using its parameter. While they happen to be the same value today (since shutdown.Stop(closeCtx) passes it through), this creates a fragile coupling. If the lifecycle manager's Stop method ever passes a different context (e.g., with additional deadline), the closure would ignore it.

^{Reviewed by Cursor Bugbot for commit 6443f22. Configure here.}

djeebus · 2026-05-22T00:31:57Z

+	registerUnit := func(unit lifecycle.Unit) {
+		if err := shutdown.Register(unit); err != nil {
+			logger.L().Fatal(ctx, "failed to register lifecycle unit", zap.String("unit", unit.Name), zap.Error(err))
+		}
+	}
+	registerClose := func(name string, after []string, closeFn func(context.Context) error) {
+		registerUnit(lifecycle.Unit{
+			Name:  name,
+			After: after,
+			Stop: func(closeCtx context.Context) error {
+				clog := globalLogger.With(zap.String("service", name), zap.Bool("forced", config.ForceStop))
+				clog.Info(ctx, "closing")
+
+				if err := closeFn(closeCtx); err != nil {
+					clog.Error(ctx, "error during shutdown", zap.Error(err))
+
+					return err
+				}
+
+				return nil
+			},
+		})
+	}


These feel like they should be part of the lifecycle.Manager struct, rather than helper functions here, no?

djeebus · 2026-05-22T00:34:45Z

 	}
 	templateCache.Start(ctx)
-	closers = append(closers, closer{"template cache", func(context.Context) error {
+	registerClose("template cache", []string{"feature flags", "limiter"}, func(context.Context) error {


The dependencies-as-strings make me a little nervous, in terms of resiliency to future changes. At the least they should be consts, but if you're going to build dependency injection from scratch, may as well use types. Or use fx!

djeebus · 2026-05-22T00:35:48Z

+			if unit.Stop == nil {
+				registerUnit(unit)
+			} else {
+				registerClose(unit.Name, unit.After, unit.Stop)
+			}


djeebus · 2026-05-22T00:38:42Z

+		}
+
+		if _, ok := m.names[unit.Name]; ok {
+			return fmt.Errorf("duplicate lifecycle unit %q", unit.Name)


If we allow these units to be replaced, it gives us an extension point to swap out implementations before executing in the future.

djeebus · 2026-05-22T00:39:17Z

+		}
+
+		m.names[unit.Name] = struct{}{}
+		unit.After = append([]string(nil), unit.After...)


djeebus · 2026-05-22T00:40:21Z

+	return errors.Join(errs...)
+}
+
+func (m *Manager) startOrder() ([]Unit, error) {


This function in particular makes me want to use a library sooner rather than later.

djeebus · 2026-05-22T18:21:22Z

-	live    *smap.Map[*Sandbox]
-	network *smap.Map[*Sandbox]
+	live       *smap.Map[*Sandbox]
+	lifecycles *smap.Map[lifecycleEntry]


Is this a superset of the live and network maps? Can we remove either of those and exclusively use this new map?

djeebus · 2026-05-22T20:18:29Z

 // MaxStartingInstancesPerNode feature flag and resize the semaphore.
 const startingSandboxesLimitRefreshInterval = 30 * time.Second

+const sandboxDrainLogInterval = 5 * time.Second


For sandboxes that may remain up for 48 hours or more, logging every 5 seconds "still not done" is a lot of logs. maybe once a minute is better? or maybe a sliding scale:

every 5 seconds for a minute

every minute for an hour

every 15 minutes for the next 48 hours

djeebus · 2026-05-22T20:54:29Z

+			if err := s.proxy.RemoveFromPool(sbx.LifecycleID); err != nil {
+				sbxLog.Warn(ctx, "failed to remove sandbox from proxy pool after force stop", zap.Error(err))
+			}


Why is this necessary on force stop but not drain?

djeebus · 2026-05-22T22:24:33Z

-		if err := closer.close(closeCtx); err != nil {
-			clog.Error(ctx, "error during shutdown", zap.Error(err))
-			success = false
+	registerClose("sandbox drain", []string{"orchestrator server", "network pool", "device pool"}, func(context.Context) error {


To double down on codex's comment, this close needs to happen as literally the first thing to happen on any signals, before any other closing can happen, and it needs to happen on its own (no concurrent closing can happen at the same time). it's not clear that's happening here.

djeebus · 2026-05-22T22:44:26Z

Before merging, we should do the following:

update traffic simulator to add some slow sandboxes (even just sleep $(( (RANDOM % 1800) + 1 )) would probably work for this purpose)
deploy this branch to somewhere that can be hit by traffic simulator (not sure how to deploy to a dev cluster, but that would work)
randomly restart one orchestrator nomad allocation every ~15 minutes
wait 24 hours
ensure that every sandbox ran correctly

That should give us enough confidence to say "this is fixed"

wj-e2b requested review from ValentaTomas, dobrac and jakubno as code owners May 21, 2026 01:09

cla-bot Bot added the cla-signed label May 21, 2026

gemini-code-assist Bot reviewed May 21, 2026

View reviewed changes

Comment thread packages/orchestrator/pkg/server/main.go

Comment thread packages/orchestrator/pkg/sandbox/sandbox.go

cursor Bot reviewed May 21, 2026

View reviewed changes

Comment thread packages/orchestrator/pkg/server/main.go

wj-e2b force-pushed the infra-orchestrator branch from 7324b01 to 8c5b901 Compare May 21, 2026 02:25

cursor Bot reviewed May 21, 2026

View reviewed changes

Comment thread packages/orchestrator/pkg/sandbox/fc/process.go

wj-e2b added 7 commits May 21, 2026 13:58

feat(orchestrator): Add draining to orchestrator

1fec065

fix(orchestrator): track sandbox lifecycles for shutdown

ab8ab6d

fix(orchestrator): ignore expected shutdown errors

ed0101a

feat(orchestrator): add lifecycle manager

5618fe8

refactor(orchestrator): use lifecycle stop ordering

3c3f9cc

fix(orchestrator): clean partial network setup

ca1012c

fix(orchestrator): harden network teardown

6443f22

wj-e2b force-pushed the infra-orchestrator branch from 8c5b901 to 6443f22 Compare May 21, 2026 23:27

chatgpt-codex-connector Bot reviewed May 21, 2026

View reviewed changes

cursor Bot reviewed May 21, 2026

View reviewed changes

djeebus reviewed May 22, 2026

View reviewed changes

Conversation

wj-e2b commented May 21, 2026

Uh oh!

cursor Bot commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Summary

Uh oh!

codecov Bot commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

❌ 3 Tests Failed:

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot May 21, 2026

Choose a reason for hiding this comment

ForceStopSandboxes uses canceled closeCtx for waitSandboxLifecycles indirectly

Uh oh!

cursor Bot May 21, 2026

Choose a reason for hiding this comment

Sandbox drain closure captures outer closeCtx ignoring parameter

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

djeebus commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cursor Bot commented May 21, 2026 •

edited

Loading

codecov Bot commented May 21, 2026 •

edited

Loading