Add rootfs reboot fallback for snapshots#2874
Conversation
cachedSeekable.StoreFile previously skipped NFS write-through whenever compression was enabled, so even the orchestrator that just built a template would miss NFS on the first read and hit GCS. Add a per-frame FrameSink PutOption and hook it from the compressed StoreFile path so each compressed frame is teed into a .frm file at its C-space offset as it is produced, matching the layout used by the read-miss writeback. Gated by the existing write-to-cache-on-writes feature flag.
Match the uncompressed write-through path: read MaxCacheWriterConcurrencyFlag and gate concurrent writeToCache calls through a per-upload semaphore. Same operator knob, same fallback-to-1 warning, no timeout, still fire-and-forget via goCtx so the upload doesn't wait on NFS. Also widen FrameSink to take context.Context so the callback is a plain method signature rather than something that has to be wrapped in a closure just to carry ctx.
Upload.Run now adds CompressUseCaseContext(u.useCase) to the ctx that flows
through the whole upload, so downstream flag reads can target template builds
("build") and snapshot pauses ("pause") independently. Enables turning on
EnableWriteThroughCacheFlag for template builds only.
Uses the same LD kind already wired through resolveCompressConfig.
TestCachedSeekable_FrameSinkPopulatesNFS exercises the sink directly but skips the StoreFile gating branch. Add two tests modeled on the existing uncompressed TestCachedFileObjectProvider_WriteFromFileSystem: - _WriteThrough: routes through StoreFile with compressed cfg + FF on, verifies every frame in the returned FrameTable lands at its expected .frm path on the temp NFS dir. Mock inner runs compressStream with the sink pulled from opts, mirroring fs/GCS backends. - _FlagOff_NoSink: asserts no FrameSink is attached when the EnableWriteThroughCacheFlag is false.
PR SummaryHigh Risk Overview Reviewed by Cursor Bugbot for commit 21bbc9f. Bugbot is set up for automated code reviews on this repo. Configure here. |
There was a problem hiding this comment.
Code Review
In packages/orchestrator/pkg/server/sandboxes.go, the constructed SandboxCreateRequest in the non-memory snapshot checkpoint path is missing the Sandbox configuration field, which will cause a nil pointer dereference when createSandboxFromRootfs attempts to retrieve the sandbox configuration via req.GetSandbox().
Reuse the stored sandbox config when checkpoint resumes through the rootfs reboot path.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix prepared a fix for the issue found in the latest run.
- ✅ Fixed: Resume races rootfs reboot
- Added check for memorySnapshotEnabled before falling back to rootfs reboot, preventing race condition when memory artifacts are still uploading.
Or push these changes by commenting:
@cursor push 665ade9913
Preview (665ade9913)
diff --git a/packages/orchestrator/pkg/server/sandboxes.go b/packages/orchestrator/pkg/server/sandboxes.go
--- a/packages/orchestrator/pkg/server/sandboxes.go
+++ b/packages/orchestrator/pkg/server/sandboxes.go
@@ -208,10 +208,12 @@
req.GetSandbox(),
)
if errors.Is(err, storage.ErrObjectNotExist) {
- telemetry.ReportEvent(ctx, "memory snapshot files missing, rebooting from rootfs")
- rebootFromRootfs = true
- childSpan.SetAttributes(attribute.Bool("sandbox.reboot_from_rootfs", true))
- sbx, err = s.createSandboxFromRootfs(ctx, template, config, runtime, req)
+ if !memorySnapshotEnabled(ctx) {
+ telemetry.ReportEvent(ctx, "memory snapshot disabled, rebooting from rootfs")
+ rebootFromRootfs = true
+ childSpan.SetAttributes(attribute.Bool("sandbox.reboot_from_rootfs", true))
+ sbx, err = s.createSandboxFromRootfs(ctx, template, config, runtime, req)
+ }
}
}
if err != nil {You can send follow-ups to the cloud agent here.
Reviewed by Cursor Bugbot for commit 31db71f. Configure here.
❌ 4 Tests Failed:
View the top 1 failed test(s) by shortest run time
View the full list of 3 ❄️ flaky test(s)
To view more test analytics, go to the Test Analytics Dashboard |
|
Superseded by the smaller clean-branch PR #2875. |


Add a rootfs reboot fallback so selected snapshot flows can skip memory restore while preserving normal resume behavior by default.