Skip to content

refactor(orchestrator): Fix leaking resources on shutdown and add framework for better lifecycle tracking#2770

Open
wj-e2b wants to merge 7 commits into
mainfrom
infra-orchestrator
Open

refactor(orchestrator): Fix leaking resources on shutdown and add framework for better lifecycle tracking#2770
wj-e2b wants to merge 7 commits into
mainfrom
infra-orchestrator

Conversation

@wj-e2b
Copy link
Copy Markdown
Contributor

@wj-e2b wj-e2b commented May 21, 2026

No description provided.

@cursor
Copy link
Copy Markdown

cursor Bot commented May 21, 2026

PR Summary

High Risk
High risk because it changes orchestrator shutdown ordering, sandbox termination behavior (including process-group signaling), and firewall/network teardown; failures here can strand VMs or break node networking. It also introduces new concurrency primitives and lifecycle tracking that must compile and behave correctly under load.

Overview
This PR appears to introduce a shutdown lifecycle manager and sandbox drain/force-stop flow, but there are likely build-breaking issues: sync.WaitGroup is used with a non-existent Go method in Server.ForceStopSandboxes and setupSandboxLifecycle, so the orchestrator will not compile as written.

It also changes teardown semantics in ways that can cause cross-sandbox impact if incorrect, notably deleting the nftables table by a constant name in Firewall.Close, and switching Firecracker termination to signal the whole process group, which can kill unintended descendants if PID reuse or grouping assumptions are wrong.

Reviewed by Cursor Bugbot for commit 6443f22. Bugbot is set up for automated code reviews on this repo. Configure here.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 21, 2026

❌ 3 Tests Failed:

Tests completed Failed Passed Skipped
2683 3 2680 5
View the full list of 7 ❄️ flaky test(s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/sandboxes::TestSnapshotTemplateCreateSandbox

Flake rate in main: 49.84% (Passed 467 times, Failed 464 times)

Stack Traces | 0s run time
=== RUN   TestSnapshotTemplateCreateSandbox
=== PAUSE TestSnapshotTemplateCreateSandbox
=== CONT  TestSnapshotTemplateCreateSandbox
--- FAIL: TestSnapshotTemplateCreateSandbox (0.00s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/sandboxes::TestSnapshotTemplateCreateSandbox/overwritten_snapshot_build_is_served_immediately_on_sandbox_create

Flake rate in main: 49.78% (Passed 467 times, Failed 463 times)

Stack Traces | 17.3s run time
=== RUN   TestSnapshotTemplateCreateSandbox/overwritten_snapshot_build_is_served_immediately_on_sandbox_create
=== PAUSE TestSnapshotTemplateCreateSandbox/overwritten_snapshot_build_is_served_immediately_on_sandbox_create
=== CONT  TestSnapshotTemplateCreateSandbox/overwritten_snapshot_build_is_served_immediately_on_sandbox_create
    snapshot_template_test.go:283: 
        	Error Trace:	.../api/sandboxes/snapshot_template_test.go:37
        	            				.../api/sandboxes/snapshot_template_test.go:283
        	Error:      	Not equal: 
        	            	expected: 201
        	            	actual  : 500
        	Test:       	TestSnapshotTemplateCreateSandbox/overwritten_snapshot_build_is_served_immediately_on_sandbox_create
--- FAIL: TestSnapshotTemplateCreateSandbox/overwritten_snapshot_build_is_served_immediately_on_sandbox_create (17.29s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/sandboxes::TestSnapshotTemplateList

Flake rate in main: 49.95% (Passed 467 times, Failed 466 times)

Stack Traces | 0s run time
=== RUN   TestSnapshotTemplateList
=== PAUSE TestSnapshotTemplateList
=== CONT  TestSnapshotTemplateList
--- FAIL: TestSnapshotTemplateList (0.00s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/sandboxes::TestSnapshotTemplateList/list_snapshots_filtered_by_sandbox_ID

Flake rate in main: 49.95% (Passed 467 times, Failed 466 times)

Stack Traces | 16.7s run time
=== RUN   TestSnapshotTemplateList/list_snapshots_filtered_by_sandbox_ID
=== PAUSE TestSnapshotTemplateList/list_snapshots_filtered_by_sandbox_ID
=== CONT  TestSnapshotTemplateList/list_snapshots_filtered_by_sandbox_ID
    snapshot_template_test.go:146: 
        	Error Trace:	.../api/sandboxes/snapshot_template_test.go:37
        	            				.../api/sandboxes/snapshot_template_test.go:146
        	Error:      	Not equal: 
        	            	expected: 201
        	            	actual  : 500
        	Test:       	TestSnapshotTemplateList/list_snapshots_filtered_by_sandbox_ID
--- FAIL: TestSnapshotTemplateList/list_snapshots_filtered_by_sandbox_ID (16.73s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestCACertInjection

Flake rate in main: 49.57% (Passed 467 times, Failed 459 times)

Stack Traces | 0.6s run time
=== RUN   TestCACertInjection
=== PAUSE TestCACertInjection
=== CONT  TestCACertInjection
    ca_cert_test.go:113: Command [cat] output: event:{start:{pid:1269}}
    ca_cert_test.go:113: 
        	Error Trace:	.../tests/envd/ca_cert_test.go:78
        	            				.../tests/envd/ca_cert_test.go:113
        	Error:      	Received unexpected error:
        	            	failed to execute command cat in sandbox inctvii2vkry1hdwg0f32: invalid_argument: protocol error: incomplete envelope: unexpected EOF
        	Test:       	TestCACertInjection
        	Messages:   	read CA bundle
--- FAIL: TestCACertInjection (0.60s)
github.com/e2b-dev/infra/tests/integration/internal/tests/orchestrator::TestSandboxMemoryIntegrity

Flake rate in main: 62.54% (Passed 472 times, Failed 788 times)

Stack Traces | 83.7s run time
=== RUN   TestSandboxMemoryIntegrity
=== PAUSE TestSandboxMemoryIntegrity
=== CONT  TestSandboxMemoryIntegrity
    sandbox_memory_integrity_test.go:27: Build completed successfully
--- FAIL: TestSandboxMemoryIntegrity (83.68s)
github.com/e2b-dev/infra/tests/integration/internal/tests/orchestrator::TestSandboxMemoryIntegrity/tmpfs_hash

Flake rate in main: 62.80% (Passed 462 times, Failed 780 times)

Stack Traces | 109s run time
=== RUN   TestSandboxMemoryIntegrity/tmpfs_hash
=== PAUSE TestSandboxMemoryIntegrity/tmpfs_hash
=== CONT  TestSandboxMemoryIntegrity/tmpfs_hash
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{start:{pid:1258}}
Executing command bash in sandbox i64g56zjl0rfuer3hg9yu (user: root)
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stdout:"Total memory: 985 MB\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stdout:"Used memory before tmpfs mount: 192 MB\nFree memory before tmpfs mount: 792 MB\nMemory to use in integrity test (80% of free, min 64MB): 633 MB\n"}}
Executing command bash in sandbox i64g56zjl0rfuer3hg9yu (user: root)
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"633+0 records in\n633+0 records out\n663748608 bytes (664 MB, 633 MiB) copied, 3.47998 s, 191 MB/s\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"\t"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"C"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"o"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"m"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"m"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"a"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"d"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:" "}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"b"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"e"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"i"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"g"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:" "}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"t"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"i"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"m"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"e"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"d"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:": \"dd"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:" if=/dev/urandom of=/mnt/testfile bs=1M count=633\"\n\tUse"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"r time ("}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"seconds"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"): "}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"0.00\n\tSystem time (seconds): 3.45\n\tPercent "}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"of CPU this "}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"job g"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"ot: 99%\n\tElap"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"sed ("}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"wall cl"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"ock) t"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"ime (h"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:":mm:ss or"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:" m:ss"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"): 0:03.48\n\t"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"Average shared text size (kbytes): 0\n\tAverage unshared data size (kbytes): 0\n\tAv"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"erage stac"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"k size (kby"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"tes): 0\n\t"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"Average"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:" tota"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"l size ("}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"kbytes"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"): 0\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"\tMaxim"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"um resi"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"dent set"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:" size (kbytes): 2620\n\tAverage resident set size (kbytes): 0\n\tMajor (requiring I/O) pag"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"e faults"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:": 2\n\t"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"Minor ("}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"reclaimin"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"g a frame)"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:" page "}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"fault"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"s: 344\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"\tVoluntary"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:" cont"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"ext s"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"witch"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"es: 3\n\tInvoluntary context switches: 77\n\tSwaps: 0\n\tFile system inputs: 176\n\tFile system outputs: 0\n\tSocket messages sent: 0\n\tSocket mess"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"ages received: "}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"0\n\tSig"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"nals d"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"eliv"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"ered: 0\n\tPage size (bytes): 4096\n\tExit status: 0\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stdout:"Used memory after tmpfs mount and file fill: 831 MB\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{end:{exited:true status:"exit status 0"}}
    sandbox_memory_integrity_test.go:70: Command [bash] completed successfully in sandbox i2zo7a5t2bc76csj0kcwm
Executing command bash in sandbox i2zo7a5t2bc76csj0kcwm (user: root)
    sandbox_memory_integrity_test.go:80: Command [bash] output: event:{start:{pid:1275}}
    sandbox_memory_integrity_test.go:80: Command [bash] output: event:{data:{stdout:"61a6d5d46b78411c88280155e23c4e6665bc4fefaf100ecf8b6dc91417929bc4\n"}}
    sandbox_memory_integrity_test.go:80: Command [bash] output: event:{end:{exited:true status:"exit status 0"}}
    sandbox_memory_integrity_test.go:80: Command [bash] completed successfully in sandbox i2zo7a5t2bc76csj0kcwm
Executing command bash in sandbox i2zo7a5t2bc76csj0kcwm (user: root)
    sandbox_memory_integrity_test.go:80: Command [bash] output: event:{start:{pid:1279}}
Executing command bash in sandbox i2zo7a5t2bc76csj0kcwm (user: root)
Executing command bash in sandbox i2zo7a5t2bc76csj0kcwm (user: root)
Executing command bash in sandbox i2zo7a5t2bc76csj0kcwm (user: root)
Executing command bash in sandbox i2zo7a5t2bc76csj0kcwm (user: root)
Executing command bash in sandbox i2zo7a5t2bc76csj0kcwm (user: root)
Executing command bash in sandbox i2zo7a5t2bc76csj0kcwm (user: root)
Executing command bash in sandbox i2zo7a5t2bc76csj0kcwm (user: root)
Executing command bash in sandbox i2zo7a5t2bc76csj0kcwm (user: root)
Executing command bash in sandbox i2zo7a5t2bc76csj0kcwm (user: root)
Executing command bash in sandbox i2zo7a5t2bc76csj0kcwm (user: root)
Executing command bash in sandbox i2zo7a5t2bc76csj0kcwm (user: root)
Executing command bash in sandbox i2zo7a5t2bc76csj0kcwm (user: root)
Executing command bash in sandbox i2zo7a5t2bc76csj0kcwm (user: root)
Executing command bash in sandbox i2zo7a5t2bc76csj0kcwm (user: root)
Executing command bash in sandbox i2zo7a5t2bc76csj0kcwm (user: root)
Executing command bash in sandbox i2zo7a5t2bc76csj0kcwm (user: root)
Executing command bash in sandbox i2zo7a5t2bc76csj0kcwm (user: root)
Executing command bash in sandbox i2zo7a5t2bc76csj0kcwm (user: root)
Executing command bash in sandbox i2zo7a5t2bc76csj0kcwm (user: root)
Executing command bash in sandbox i2zo7a5t2bc76csj0kcwm (user: root)
Executing command bash in sandbox i2zo7a5t2bc76csj0kcwm (user: root)
Executing command bash in sandbox i2zo7a5t2bc76csj0kcwm (user: root)
Executing command bash in sandbox i2zo7a5t2bc76csj0kcwm (user: root)
Executing command bash in sandbox i2zo7a5t2bc76csj0kcwm (user: root)
Executing command bash in sandbox i2zo7a5t2bc76csj0kcwm (user: root)
Executing command bash in sandbox i2zo7a5t2bc76csj0kcwm (user: root)
Executing command bash in sandbox i2zo7a5t2bc76csj0kcwm (user: root)
Executing command bash in sandbox i2zo7a5t2bc76csj0kcwm (user: root)
Executing command bash in sandbox i2zo7a5t2bc76csj0kcwm (user: root)
Executing command bash in sandbox i2zo7a5t2bc76csj0kcwm (user: root)
Executing command bash in sandbox i2zo7a5t2bc76csj0kcwm (user: root)
Executing command bash in sandbox i2zo7a5t2bc76csj0kcwm (user: root)
Executing command bash in sandbox i2zo7a5t2bc76csj0kcwm (user: root)
Executing command bash in sandbox i2zo7a5t2bc76csj0kcwm (user: root)
Executing command bash in sandbox i2zo7a5t2bc76csj0kcwm (user: root)
Executing command bash in sandbox i2zo7a5t2bc76csj0kcwm (user: root)
    sandbox_memory_integrity_test.go:110: 
        	Error Trace:	.../tests/orchestrator/sandbox_memory_integrity_test.go:81
        	            				.../hostedtoolcache/go/1.26.3.../src/runtime/asm_amd64.s:1771
        	Error:      	Received unexpected error:
        	            	failed to execute command bash in sandbox i2zo7a5t2bc76csj0kcwm: unavailable: HTTP status 502 Bad Gateway
    sandbox_memory_integrity_test.go:110: 
        	Error Trace:	.../tests/orchestrator/sandbox_memory_integrity_test.go:78
        	            				.../tests/orchestrator/sandbox_memory_integrity_test.go:110
        	Error:      	Condition never satisfied
        	Test:       	TestSandboxMemoryIntegrity/tmpfs_hash
--- FAIL: TestSandboxMemoryIntegrity/tmpfs_hash (109.48s)

To view more test analytics, go to the Test Analytics Dashboard
📋 Got 3 mins? Take this short survey to help us improve Test Analytics.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The use of wg.Go in ForceStopSandboxes will cause a compilation error because sync.WaitGroup does not provide that method. In doStop, returning early when WaitWithContext fails prevents memory resource cleanup, which can lead to leaked userfaultfd processes.

Comment thread packages/orchestrator/pkg/server/main.go
Comment thread packages/orchestrator/pkg/sandbox/sandbox.go
Comment thread packages/orchestrator/pkg/server/main.go
@wj-e2b wj-e2b force-pushed the infra-orchestrator branch from 7324b01 to 8c5b901 Compare May 21, 2026 02:25
Comment thread packages/orchestrator/pkg/sandbox/fc/process.go
@wj-e2b wj-e2b force-pushed the infra-orchestrator branch from 8c5b901 to 6443f22 Compare May 21, 2026 23:27
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6443f22c3b

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

if err := closer.close(closeCtx); err != nil {
clog.Error(ctx, "error during shutdown", zap.Error(err))
success = false
registerClose("sandbox drain", []string{"orchestrator server", "network pool", "device pool"}, func(context.Context) error {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Close request ingress before starting sandbox drain

shutdown.Stop runs units in reverse topological order, and this registers sandbox drain after grpc server without any dependency between them, so drain executes while gRPC is still serving. In that window new template build requests can still start sandboxes (the earlier tmpl.Wait only waits for currently running builds and does not stop new ones), which can keep the sandbox count non-zero until timeout and trigger unnecessary forced shutdown. Add ordering so gRPC/HTTP listeners are stopped before entering the drain phase.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To double down on codex's comment, this close needs to happen as literally the first thing to happen on any signals, before any other closing can happen, and it needs to happen on its own (no concurrent closing can happen at the same time). it's not clear that's happening here.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 6443f22. Configure here.

return s.waitSandboxLifecycles(ctx)
}
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ForceStopSandboxes uses canceled closeCtx for waitSandboxLifecycles indirectly

Medium Severity

DrainSandboxes checks s.sandboxFactory.Sandboxes.Count() which only counts the live map. Once the count reaches zero, it calls waitSandboxLifecycles(ctx) to wait for lifecycle goroutines. However, the drain loop polls only every 5 seconds (sandboxDrainLogInterval). If the last sandbox exits immediately after a tick fires and shows remaining > 0, the drain will wait up to 5 additional seconds before detecting completion. More importantly, Count() reflects the live map (cleared by MarkStopping), but sandboxes could be removed from live by the normal stop path while their lifecycle goroutine is still running Close. The drain correctly calls waitSandboxLifecycles afterward, but the gap between "live count is 0" and "all lifecycle goroutines done" is only bridged if the drain context doesn't expire in between.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 6443f22. Configure here.

}

return nil
})
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sandbox drain closure captures outer closeCtx ignoring parameter

Low Severity

The closure passed to registerClose for "sandbox drain" has an unnamed context.Context parameter but references closeCtx from the outer scope instead of using its parameter. While they happen to be the same value today (since shutdown.Stop(closeCtx) passes it through), this creates a fragile coupling. If the lifecycle manager's Stop method ever passes a different context (e.g., with additional deadline), the closure would ignore it.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 6443f22. Configure here.

Comment on lines +356 to +378
registerUnit := func(unit lifecycle.Unit) {
if err := shutdown.Register(unit); err != nil {
logger.L().Fatal(ctx, "failed to register lifecycle unit", zap.String("unit", unit.Name), zap.Error(err))
}
}
registerClose := func(name string, after []string, closeFn func(context.Context) error) {
registerUnit(lifecycle.Unit{
Name: name,
After: after,
Stop: func(closeCtx context.Context) error {
clog := globalLogger.With(zap.String("service", name), zap.Bool("forced", config.ForceStop))
clog.Info(ctx, "closing")

if err := closeFn(closeCtx); err != nil {
clog.Error(ctx, "error during shutdown", zap.Error(err))

return err
}

return nil
},
})
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These feel like they should be part of the lifecycle.Manager struct, rather than helper functions here, no?

}
templateCache.Start(ctx)
closers = append(closers, closer{"template cache", func(context.Context) error {
registerClose("template cache", []string{"feature flags", "limiter"}, func(context.Context) error {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dependencies-as-strings make me a little nervous, in terms of resiliency to future changes. At the least they should be consts, but if you're going to build dependency injection from scratch, may as well use types. Or use fx!

Comment on lines +640 to +644
if unit.Stop == nil {
registerUnit(unit)
} else {
registerClose(unit.Name, unit.After, unit.Stop)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

???

}

if _, ok := m.names[unit.Name]; ok {
return fmt.Errorf("duplicate lifecycle unit %q", unit.Name)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we allow these units to be replaced, it gives us an extension point to swap out implementations before executing in the future.

}

m.names[unit.Name] = struct{}{}
unit.After = append([]string(nil), unit.After...)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wtf??

return errors.Join(errs...)
}

func (m *Manager) startOrder() ([]Unit, error) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function in particular makes me want to use a library sooner rather than later.

live *smap.Map[*Sandbox]
network *smap.Map[*Sandbox]
live *smap.Map[*Sandbox]
lifecycles *smap.Map[lifecycleEntry]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a superset of the live and network maps? Can we remove either of those and exclusively use this new map?

// MaxStartingInstancesPerNode feature flag and resize the semaphore.
const startingSandboxesLimitRefreshInterval = 30 * time.Second

const sandboxDrainLogInterval = 5 * time.Second
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For sandboxes that may remain up for 48 hours or more, logging every 5 seconds "still not done" is a lot of logs. maybe once a minute is better? or maybe a sliding scale:

  • every 5 seconds for a minute
  • every minute for an hour
  • every 15 minutes for the next 48 hours

Comment on lines +258 to +260
if err := s.proxy.RemoveFromPool(sbx.LifecycleID); err != nil {
sbxLog.Warn(ctx, "failed to remove sandbox from proxy pool after force stop", zap.Error(err))
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this necessary on force stop but not drain?

if err := closer.close(closeCtx); err != nil {
clog.Error(ctx, "error during shutdown", zap.Error(err))
success = false
registerClose("sandbox drain", []string{"orchestrator server", "network pool", "device pool"}, func(context.Context) error {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To double down on codex's comment, this close needs to happen as literally the first thing to happen on any signals, before any other closing can happen, and it needs to happen on its own (no concurrent closing can happen at the same time). it's not clear that's happening here.

@djeebus
Copy link
Copy Markdown
Contributor

djeebus commented May 22, 2026

Before merging, we should do the following:

  • update traffic simulator to add some slow sandboxes (even just sleep $(( (RANDOM % 1800) + 1 )) would probably work for this purpose)
  • deploy this branch to somewhere that can be hit by traffic simulator (not sure how to deploy to a dev cluster, but that would work)
  • randomly restart one orchestrator nomad allocation every ~15 minutes
  • wait 24 hours
  • ensure that every sandbox ran correctly

That should give us enough confidence to say "this is fixed"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants