feat: add scheduled cleanup for orphaned Firecracker processes#2853
feat: add scheduled cleanup for orphaned Firecracker processes#2853AdaAibaby wants to merge 3 commits into
Conversation
There was a problem hiding this comment.
Code Review
The containsInterface function uses substring matching to check if an iptables rule references an interface, which can incorrectly match a prefix of a live interface name and lead to accidental deletion of live rules.
The parseIPTablesRule function does not verify that the rule starts with the expected prefix before slicing, which can cause incorrect slicing and parsing of unrelated rules.
The stop channel in the reconciler is unbuffered, which can cause Stop() to drop the signal if Start() is currently executing a sweep.
Using candidate.Add(24 * time.Hour) to advance to the next day can shift the scheduled sweep time across daylight saving time transitions, whereas candidate.AddDate(0, 0, 1) should be used to preserve local wall-clock time.
| func containsInterface(rule, iface string) bool { | ||
| for _, flag := range []string{"-i " + iface, "-o " + iface, "--in-interface " + iface, "--out-interface " + iface} { | ||
| if len(rule) >= len(flag) { | ||
| for i := 0; i <= len(rule)-len(flag); i++ { | ||
| if rule[i:i+len(flag)] == flag { | ||
| return true | ||
| } | ||
| } | ||
| } | ||
| } | ||
|
|
||
| return false | ||
| } |
There was a problem hiding this comment.
The containsInterface function uses substring matching to check if a rule references an interface. If an orphaned interface name is a prefix of a live interface name (for example, veth-4 and veth-42), containsInterface will incorrectly return true for the live interface's rules, leading to their deletion and network disruption for live sandboxes.
func containsInterface(rule, iface string) bool {
var args []string
var cur []byte
for i := 0; i < len(rule); i++ {
if rule[i] == ' ' {
if len(cur) > 0 {
args = append(args, string(cur))
cur = cur[:0]
}
} else {
cur = append(cur, rule[i])
}
}
if len(cur) > 0 {
args = append(args, string(cur))
}
for i := 0; i < len(args)-1; i++ {
arg := args[i]
if arg == "-i" || arg == "-o" || arg == "--in-interface" || arg == "--out-interface" {
if args[i+1] == iface {
return true
}
}
}
return false
}| func parseIPTablesRule(rule, chain string) []string { | ||
| prefix := "-A " + chain + " " | ||
| if len(rule) <= len(prefix) { | ||
| return nil | ||
| } | ||
|
|
||
| rest := rule[len(prefix):] |
There was a problem hiding this comment.
The parseIPTablesRule function does not verify that the rule actually starts with the expected prefix before slicing. This can lead to incorrect slicing and parsing of unrelated rules (such as policy rules or other chain rules), resulting in malformed arguments being passed to DeleteIfExists.
| func parseIPTablesRule(rule, chain string) []string { | |
| prefix := "-A " + chain + " " | |
| if len(rule) <= len(prefix) { | |
| return nil | |
| } | |
| rest := rule[len(prefix):] | |
| func parseIPTablesRule(rule, chain string) []string { | |
| prefix := "-A " + chain + " " | |
| if len(rule) <= len(prefix) || rule[:len(prefix)] != prefix { | |
| return nil | |
| } | |
| rest := rule[len(prefix):] |
| func NewReconciler(cfg Config, sandboxes *sandbox.Map) *Reconciler { | ||
| cfg.setDefaults() | ||
|
|
||
| return &Reconciler{ | ||
| cfg: cfg, | ||
| sandboxes: sandboxes, | ||
| orchestratorPID: int32(os.Getpid()), | ||
| stop: make(chan struct{}), | ||
| } | ||
| } |
There was a problem hiding this comment.
The stop channel is unbuffered, which causes Stop() to silently drop the stop signal if Start() is currently executing a sweep or is between loop iterations. Making the stop channel buffered with a capacity of 1 ensures that the stop signal is preserved and processed immediately when the loop returns to the select block.
| func NewReconciler(cfg Config, sandboxes *sandbox.Map) *Reconciler { | |
| cfg.setDefaults() | |
| return &Reconciler{ | |
| cfg: cfg, | |
| sandboxes: sandboxes, | |
| orchestratorPID: int32(os.Getpid()), | |
| stop: make(chan struct{}), | |
| } | |
| } | |
| func NewReconciler(cfg Config, sandboxes *sandbox.Map) *Reconciler { | |
| cfg.setDefaults() | |
| return &Reconciler{ | |
| cfg: cfg, | |
| sandboxes: sandboxes, | |
| orchestratorPID: int32(os.Getpid()), | |
| stop: make(chan struct{}, 1), | |
| } | |
| } |
| if !candidate.After(now) { | ||
| candidate = candidate.Add(24 * time.Hour) | ||
| } |
There was a problem hiding this comment.
Using candidate.Add(24 * time.Hour) to advance to the next day adds exactly 24 hours of physical time, which can shift the scheduled sweep time across daylight saving time (DST) transitions. Using candidate.AddDate(0, 0, 1) correctly preserves the local wall-clock time.
| if !candidate.After(now) { | |
| candidate = candidate.Add(24 * time.Hour) | |
| } | |
| if !candidate.After(now) { | |
| candidate = candidate.AddDate(0, 0, 1) | |
| } |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 92ae181127
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| for _, sbx := range r.sandboxes.Items() { | ||
| if sbx.Slot != nil { | ||
| idxs[sbx.Slot.Idx] = struct{}{} |
There was a problem hiding this comment.
Do not treat pooled slots as orphaned
When the daily sweep runs, this only preserves slots attached to live sandboxes, but the network pool pre-creates slots with CreateNetwork and holds them in newSlots/reusedSlots before any sandbox is running (packages/orchestrator/pkg/sandbox/network/pool.go:139-157). Those warmed slots have veth-N interfaces but are absent from sandboxes.Items(), so scanOrphanedVeths marks them orphaned and cleanOrphanedVeths deletes their interfaces; the pool can later hand out the same *Slot, leaving new sandboxes with a broken/missing network device.
Useful? React with 👍 / 👎.
| { | ||
| "filter", "FORWARD", | ||
| []string{"-i", vethName, "-j", "ACCEPT"}, | ||
| }, | ||
| { | ||
| "filter", "FORWARD", | ||
| []string{"-o", vethName, "-j", "ACCEPT"}, | ||
| }, |
There was a problem hiding this comment.
Match the FORWARD rules created for each slot
For an actually orphaned veth, these deletions do not match the rules added by CreateNetwork: the real rules include the default gateway side (-i veth-N -o ens5 -j ACCEPT and -i ens5 -o veth-N -j ACCEPT in packages/orchestrator/pkg/sandbox/network/network.go:203-212), while these calls omit that extra interface and there is no later scan of filter/FORWARD. As a result the cleanup reports/removes the veth but leaves stale forwarding rules behind for every cleaned slot.
Useful? React with 👍 / 👎.
Add periodic cleanup task to remove orphaned Firecracker processes on orchestrator nodes.
Orphaned FC processes will be automatically cleared regularly to prevent resource leakage and reduce host resource occupation.