perf(orch): eliminate balance_dirty_pages throttle on large VM resume#2858
perf(orch): eliminate balance_dirty_pages throttle on large VM resume#2858kalyazin wants to merge 5 commits into
Conversation
PR SummaryMedium Risk Overview Reviewed by Cursor Bugbot for commit 6ca28a2. Bugbot is set up for automated code reviews on this repo. Configure here. |
❌ 3 Tests Failed:
View the full list of 4 ❄️ flaky test(s)
To view more test analytics, go to the Test Analytics Dashboard |
There was a problem hiding this comment.
Code Review
Calling FileSize concurrently with Close can lead to a data race on the internal state of the os.File descriptor or cause a bad file descriptor error if the file is closed. To prevent this, acquire a read lock and check if the cache is closed before accessing the file descriptor.
c7bd7d5 to
403a42e
Compare
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 403a42e. Configure here.
62c9de5 to
8430477
Compare
…ks.chunks.store Add orchestrator.blocks.chunks.store native histogram that records the duration and byte count of each batch written into the mmap cache by the streaming chunker. The metric reveals balance_dirty_pages throttle events as p99 latency spikes (300–600 ms) while p50 remains low. Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
8430477 to
a580b03
Compare
Add orchestrator.host.balance_dirty_pages.threads counter, incremented by the stalled thread count at every 1 s poll for the lifetime of the process. rate() of this counter gives real-time throttle intensity (threads/s) that is non-zero only when dirty-page throttle is active. The poller runs unconditionally via init(): the signal is simpler to reason about, carries no per-resume lifecycle complexity, and in practice the throttle is only visible during resume-time mmap writes anyway. Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
Add NewMemfdCache(size, blockSize, name) that creates a Cache backed by an anonymous memfd (unix.MemfdCreate) instead of a named sparse file on disk. Writes go to RAM-backed tmpfs pages and therefore never enter the block-device dirty-page accounting path that triggers balance_dirty_pages throttling during large VM resumes. Structural changes to Cache to support the memfd path: - Add file *os.File field: kept alive for memfd caches so ExportToDiff can read from the fd without re-opening by path, and so the GC cannot close the fd while the mmap is in use. - Add isMemfd bool field: gates the memfd-specific branches in Close, ExportToDiff, and FileSize. ExportToDiff: use c.file directly when isMemfd=true instead of os.Open(c.filePath). SyncFileRange on tmpfs returns EINVAL; skip it entirely for memfd-backed caches (tmpfs pages are already in RAM, no writeback to schedule). CopyFileRange works correctly between tmpfs and the destination filesystem. FileSize: acquire the read lock and check isClosed() before accessing c.file, preventing a race with concurrent Close. For memfd caches, fstat the stored fd and return stat.Blocks*512 (stat.Blocks is always in 512-byte units on Linux). Guard against c.file==nil for zero-size caches, where the fd is closed at construction and never stored. Close: for memfd caches, close c.file to release the anonymous pages; skip os.RemoveAll since there is no filesystem path. No callers are changed in this commit; behavior for existing NewCache paths is unchanged. Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
Add LocalCacheMemfdFlag ("local-cache-memfd", default false) to control
whether the per-sandbox in-process mmap caches use anonymous memfds
instead of sparse files on md0.
Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
…caches Wire NewMemfdCache into the two block.Cache creation sites that run on every sandbox resume, behind the local-cache-memfd feature flag: - block.NewChunker (streaming_chunk.go): the read-through cache that materialises memfile chunks fetched from GCS / NFS into a local mmap. - rootfs.NewNBDProvider (nbd.go): the COW overlay cache that captures rootfs writes from the running VM. Both caches were previously backed by sparse files on md0 (NVMe RAID0). Writing ~30 GB of dirty pages into those files crossed the per-BDI dirty-page threshold and parked dozens of goroutines in balance_dirty_pages, stalling resumes of large-memory sandboxes for ~64 s. With memfd backing the writes go to anonymous tmpfs pages, which are invisible to the block-device dirty-page accounting path. Measured improvement on a 32 GB memfile snapshot: ~64 s → ~16 s (~4×). ExportToDiff on the rootfs cache is unaffected: the stored c.file fd is used in place of re-opening by path; CopyFileRange works correctly between tmpfs and the destination filesystem. Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
a580b03 to
6ca28a2
Compare

Summary
Problem
When resuming a sandbox with a large memory snapshot (e.g. 32 GB), the
orchestrator creates two mmap-backed caches backed by sparse files on the
host's NVMe RAID:
remote storage as Firecracker takes page faults;
Writing tens of gigabytes of dirty pages into block-device-backed files
triggers Linux's per-BDI dirty-page throttle (
balance_dirty_pages).The kernel stalls the goroutines doing the writes until the writeback
daemon catches up, blocking the entire resume for ~64 s — roughly 4× longer than the
actual I/O work.
What this PR adds
Metrics (always on, no flag required):
orchestrator.blocks.chunks.store— native histogram of the time tofetch each batch of data from remote storage and write it into the
local mmap cache. During a throttled resume the p99 spikes to
300–600 ms while the p50 stays low; this ratio is the primary
diagnostic signal.
orchestrator.host.balance_dirty_pages.threads— counter of kerneldirty-page throttle polls observed by the process. A non-zero rate
confirms the throttle is active; zero rules it out as the cause of
slow resumes.
Fix (behind
local-cache-memfdfeature flag, default off):Replace the sparse files on md0 with anonymous
memfd_create()mappingsfor both caches. Anonymous tmpfs pages are invisible to the block-device
dirty-page accounting path, so the throttle never fires.
The
ExportToDiffpath (used when snapshotting a running sandbox) isunaffected: the memfd file descriptor is kept open on the cache struct
and used directly with
CopyFileRange, which works correctly acrosstmpfs→ext4.
Measured improvement
Tested on a 32 GB memfile snapshot with a cold NFS chunk cache: