Skip to content

feat: add scheduled cleanup for orphaned Firecracker processes#2853

Draft
AdaAibaby wants to merge 3 commits into
e2b-dev:mainfrom
AdaAibaby:orch-fc-orphan
Draft

feat: add scheduled cleanup for orphaned Firecracker processes#2853
AdaAibaby wants to merge 3 commits into
e2b-dev:mainfrom
AdaAibaby:orch-fc-orphan

Conversation

@AdaAibaby
Copy link
Copy Markdown
Contributor

Add periodic cleanup task to remove orphaned Firecracker processes on orchestrator nodes.

Orphaned FC processes will be automatically cleared regularly to prevent resource leakage and reduce host resource occupation.

@cla-bot cla-bot Bot added the cla-signed label May 29, 2026
@AdaAibaby AdaAibaby marked this pull request as draft May 29, 2026 03:50
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The containsInterface function uses substring matching to check if an iptables rule references an interface, which can incorrectly match a prefix of a live interface name and lead to accidental deletion of live rules.

The parseIPTablesRule function does not verify that the rule starts with the expected prefix before slicing, which can cause incorrect slicing and parsing of unrelated rules.

The stop channel in the reconciler is unbuffered, which can cause Stop() to drop the signal if Start() is currently executing a sweep.

Using candidate.Add(24 * time.Hour) to advance to the next day can shift the scheduled sweep time across daylight saving time transitions, whereas candidate.AddDate(0, 0, 1) should be used to preserve local wall-clock time.

Comment on lines +310 to +322
func containsInterface(rule, iface string) bool {
for _, flag := range []string{"-i " + iface, "-o " + iface, "--in-interface " + iface, "--out-interface " + iface} {
if len(rule) >= len(flag) {
for i := 0; i <= len(rule)-len(flag); i++ {
if rule[i:i+len(flag)] == flag {
return true
}
}
}
}

return false
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The containsInterface function uses substring matching to check if a rule references an interface. If an orphaned interface name is a prefix of a live interface name (for example, veth-4 and veth-42), containsInterface will incorrectly return true for the live interface's rules, leading to their deletion and network disruption for live sandboxes.

func containsInterface(rule, iface string) bool {
	var args []string
	var cur []byte
	for i := 0; i < len(rule); i++ {
		if rule[i] == ' ' {
			if len(cur) > 0 {
				args = append(args, string(cur))
				cur = cur[:0]
			}
		} else {
			cur = append(cur, rule[i])
		}
	}
	if len(cur) > 0 {
		args = append(args, string(cur))
	}

	for i := 0; i < len(args)-1; i++ {
		arg := args[i]
		if arg == "-i" || arg == "-o" || arg == "--in-interface" || arg == "--out-interface" {
			if args[i+1] == iface {
				return true
			}
		}
	}

	return false
}

Comment on lines +326 to +332
func parseIPTablesRule(rule, chain string) []string {
prefix := "-A " + chain + " "
if len(rule) <= len(prefix) {
return nil
}

rest := rule[len(prefix):]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The parseIPTablesRule function does not verify that the rule actually starts with the expected prefix before slicing. This can lead to incorrect slicing and parsing of unrelated rules (such as policy rules or other chain rules), resulting in malformed arguments being passed to DeleteIfExists.

Suggested change
func parseIPTablesRule(rule, chain string) []string {
prefix := "-A " + chain + " "
if len(rule) <= len(prefix) {
return nil
}
rest := rule[len(prefix):]
func parseIPTablesRule(rule, chain string) []string {
prefix := "-A " + chain + " "
if len(rule) <= len(prefix) || rule[:len(prefix)] != prefix {
return nil
}
rest := rule[len(prefix):]

Comment on lines +66 to +75
func NewReconciler(cfg Config, sandboxes *sandbox.Map) *Reconciler {
cfg.setDefaults()

return &Reconciler{
cfg: cfg,
sandboxes: sandboxes,
orchestratorPID: int32(os.Getpid()),
stop: make(chan struct{}),
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The stop channel is unbuffered, which causes Stop() to silently drop the stop signal if Start() is currently executing a sweep or is between loop iterations. Making the stop channel buffered with a capacity of 1 ensures that the stop signal is preserved and processed immediately when the loop returns to the select block.

Suggested change
func NewReconciler(cfg Config, sandboxes *sandbox.Map) *Reconciler {
cfg.setDefaults()
return &Reconciler{
cfg: cfg,
sandboxes: sandboxes,
orchestratorPID: int32(os.Getpid()),
stop: make(chan struct{}),
}
}
func NewReconciler(cfg Config, sandboxes *sandbox.Map) *Reconciler {
cfg.setDefaults()
return &Reconciler{
cfg: cfg,
sandboxes: sandboxes,
orchestratorPID: int32(os.Getpid()),
stop: make(chan struct{}, 1),
}
}

Comment on lines +327 to +329
if !candidate.After(now) {
candidate = candidate.Add(24 * time.Hour)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using candidate.Add(24 * time.Hour) to advance to the next day adds exactly 24 hours of physical time, which can shift the scheduled sweep time across daylight saving time (DST) transitions. Using candidate.AddDate(0, 0, 1) correctly preserves the local wall-clock time.

Suggested change
if !candidate.After(now) {
candidate = candidate.Add(24 * time.Hour)
}
if !candidate.After(now) {
candidate = candidate.AddDate(0, 0, 1)
}

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 92ae181127

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +247 to +249
for _, sbx := range r.sandboxes.Items() {
if sbx.Slot != nil {
idxs[sbx.Slot.Idx] = struct{}{}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Do not treat pooled slots as orphaned

When the daily sweep runs, this only preserves slots attached to live sandboxes, but the network pool pre-creates slots with CreateNetwork and holds them in newSlots/reusedSlots before any sandbox is running (packages/orchestrator/pkg/sandbox/network/pool.go:139-157). Those warmed slots have veth-N interfaces but are absent from sandboxes.Items(), so scanOrphanedVeths marks them orphaned and cleanOrphanedVeths deletes their interfaces; the pool can later hand out the same *Slot, leaving new sandboxes with a broken/missing network device.

Useful? React with 👍 / 👎.

Comment on lines +240 to +247
{
"filter", "FORWARD",
[]string{"-i", vethName, "-j", "ACCEPT"},
},
{
"filter", "FORWARD",
[]string{"-o", vethName, "-j", "ACCEPT"},
},
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Match the FORWARD rules created for each slot

For an actually orphaned veth, these deletions do not match the rules added by CreateNetwork: the real rules include the default gateway side (-i veth-N -o ens5 -j ACCEPT and -i ens5 -o veth-N -j ACCEPT in packages/orchestrator/pkg/sandbox/network/network.go:203-212), while these calls omit that extra interface and there is no later scan of filter/FORWARD. As a result the cleanup reports/removes the veth but leaves stale forwarding rules behind for every cleaned slot.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants