Skip to content

perf(orch): eliminate balance_dirty_pages throttle on large VM resume#2858

Open
kalyazin wants to merge 5 commits into
mainfrom
kalyazin/dirty-page-metrics-fix
Open

perf(orch): eliminate balance_dirty_pages throttle on large VM resume#2858
kalyazin wants to merge 5 commits into
mainfrom
kalyazin/dirty-page-metrics-fix

Conversation

@kalyazin
Copy link
Copy Markdown
Contributor

@kalyazin kalyazin commented May 29, 2026

Summary

Problem

When resuming a sandbox with a large memory snapshot (e.g. 32 GB), the
orchestrator creates two mmap-backed caches backed by sparse files on the
host's NVMe RAID:

  • a read-through cache that materialises memfile chunks fetched from
    remote storage as Firecracker takes page faults;
  • a copy-on-write overlay cache that captures guest writes to the rootfs.

Writing tens of gigabytes of dirty pages into block-device-backed files
triggers Linux's per-BDI dirty-page throttle (balance_dirty_pages).
The kernel stalls the goroutines doing the writes until the writeback
daemon catches up, blocking the entire resume for ~64 s — roughly 4× longer than the
actual I/O work.

What this PR adds

Metrics (always on, no flag required):

  • orchestrator.blocks.chunks.store — native histogram of the time to
    fetch each batch of data from remote storage and write it into the
    local mmap cache. During a throttled resume the p99 spikes to
    300–600 ms while the p50 stays low; this ratio is the primary
    diagnostic signal.
  • orchestrator.host.balance_dirty_pages.threads — counter of kernel
    dirty-page throttle polls observed by the process. A non-zero rate
    confirms the throttle is active; zero rules it out as the cause of
    slow resumes.

Fix (behind local-cache-memfd feature flag, default off):

Replace the sparse files on md0 with anonymous memfd_create() mappings
for both caches. Anonymous tmpfs pages are invisible to the block-device
dirty-page accounting path, so the throttle never fires.

The ExportToDiff path (used when snapshotting a running sandbox) is
unaffected: the memfd file descriptor is kept open on the cache struct
and used directly with CopyFileRange, which works correctly across
tmpfs→ext4.

Measured improvement

Tested on a 32 GB memfile snapshot with a cold NFS chunk cache:

Time
Before (file-backed) ~64 s
After (memfd) ~16 s

@cla-bot cla-bot Bot added the cla-signed label May 29, 2026
@cursor
Copy link
Copy Markdown

cursor Bot commented May 29, 2026

PR Summary

Medium Risk
Resume and snapshot export paths change when the flag is enabled; default-off limits blast radius, but memfd ExportToDiff/CopyFileRange behavior should be validated on production filesystems.

Overview
Large VM resumes were slowed by Linux dirty-page throttling when memfile and rootfs mmap caches wrote through sparse NVMe-backed files. This PR adds always-on observability—a host counter sampling threads stalled in balance_dirty_pages and per-batch timing on chunk writes into the mmap cache—and an opt-in fix (local-cache-memfd) that backs those caches with anonymous memfds so resume traffic avoids block-device dirty accounting. Memfd caches keep an open fd for ExportToDiff (skipping sync_file_range on tmpfs) and close cleanly without removing a path; the resume-build CLI can force the flag for local testing.

Reviewed by Cursor Bugbot for commit 6ca28a2. Bugbot is set up for automated code reviews on this repo. Configure here.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 29, 2026

❌ 3 Tests Failed:

Tests completed Failed Passed Skipped
2686 3 2683 7
View the full list of 4 ❄️ flaky test(s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/metrics::TestSandboxMetrics

Flake rate in main: 44.94% (Passed 696 times, Failed 568 times)

Stack Traces | 19.7s run time
=== RUN   TestSandboxMetrics
=== PAUSE TestSandboxMetrics
=== CONT  TestSandboxMetrics
    sandbox_metrics_test.go:47: 
        	Error Trace:	.../api/metrics/sandbox_metrics_test.go:47
        	Error:      	Should NOT be empty, but was 0
        	Test:       	TestSandboxMetrics
--- FAIL: TestSandboxMetrics (19.73s)
github.com/e2b-dev/infra/tests/integration/internal/tests/orchestrator::TestSandboxMemoryIntegrity

Flake rate in main: 57.92% (Passed 701 times, Failed 965 times)

Stack Traces | 65.2s run time
=== RUN   TestSandboxMemoryIntegrity
=== PAUSE TestSandboxMemoryIntegrity
=== CONT  TestSandboxMemoryIntegrity
    sandbox_memory_integrity_test.go:27: Build completed successfully
--- FAIL: TestSandboxMemoryIntegrity (65.24s)
github.com/e2b-dev/infra/tests/integration/internal/tests/orchestrator::TestSandboxMemoryIntegrity/tmpfs_hash

Flake rate in main: 58.04% (Passed 691 times, Failed 956 times)

Stack Traces | 194s run time
=== RUN   TestSandboxMemoryIntegrity/tmpfs_hash
=== PAUSE TestSandboxMemoryIntegrity/tmpfs_hash
=== CONT  TestSandboxMemoryIntegrity/tmpfs_hash
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{start:{pid:1254}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stdout:"Total memory: 985 MB\nUsed memory before tmpfs mount: 189 MB\nFree memory before tmpfs mount: 795 MB\nMemory to use in integrity test (60% of free, min 64MB): 477 MB\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"477+0 records in\n477+0 records out\n500170752 bytes (500 MB, 477 MiB) copied, 1.97708 s, 253 MB/s\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"\tCommand being timed: \"dd if=/dev/urandom of=/mnt/testfile bs=1M count=477\"\n\tUser time (seconds): 0.00\n\tSystem time (seconds): 1.97\n\tPercent of CPU this job got: 99%\n\tElapsed (wall clock) time (h:mm:ss or m:ss): 0:01.98\n\tAverage shared text size (kbytes): 0\n\tAverage unshared data size (kbytes): 0\n\tAverage stack size (kbytes): 0\n\tAverage total size (kbytes): 0\n\tMaximum resident set size (kbytes): 2604\n\tAverage resident set size (kbytes): 0\n\tMajor (requiring I/O) page faults: 3\n\tMinor (reclaiming a frame) page faults: 339\n\tVoluntary context swi"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"tches: 4\n\tInvoluntary context switches: 19\n\tSwaps: 0\n\tFile system inputs: 176\n\tFile system outputs: 0\n\tSocket messages sent: 0\n\tSocket messages received: 0\n\tSignals delivered: 0\n\tPage size (bytes): 4096\n\tExit status: 0\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stdout:"Used memory after tmpfs mount and file fill: 671 MB\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{end:{exited:true  status:"exit status 0"}}
    sandbox_memory_integrity_test.go:70: Command [bash] completed successfully in sandbox ifsnmr6rb2idm6t3grsru
Executing command bash in sandbox ifsnmr6rb2idm6t3grsru (user: root)
    sandbox_memory_integrity_test.go:80: Command [bash] output: event:{start:{pid:1270}}
Executing command bash in sandbox i8rz6jv8hn61lctb4dwop (user: root)
    sandbox_memory_integrity_test.go:80: Command [bash] output: event:{data:{stdout:"6a75e6f94ef6aa304b5a4115e52eaf82bf01b88b136395707f11708b1ff388f8\n"}}
    sandbox_memory_integrity_test.go:80: Command [bash] output: event:{end:{exited:true  status:"exit status 0"}}
    sandbox_memory_integrity_test.go:80: Command [bash] completed successfully in sandbox ifsnmr6rb2idm6t3grsru
    sandbox_memory_integrity_test.go:80: Command [bash] output: event:{start:{pid:1273}}
Executing command bash in sandbox ifsnmr6rb2idm6t3grsru (user: root)
Executing command bash in sandbox ifsnmr6rb2idm6t3grsru (user: root)
Executing command bash in sandbox ifsnmr6rb2idm6t3grsru (user: root)
Executing command bash in sandbox ifsnmr6rb2idm6t3grsru (user: root)
Executing command bash in sandbox ifsnmr6rb2idm6t3grsru (user: root)
Executing command bash in sandbox ifsnmr6rb2idm6t3grsru (user: root)
Executing command bash in sandbox ifsnmr6rb2idm6t3grsru (user: root)
Executing command bash in sandbox ifsnmr6rb2idm6t3grsru (user: root)
Executing command bash in sandbox ifsnmr6rb2idm6t3grsru (user: root)
Executing command bash in sandbox ifsnmr6rb2idm6t3grsru (user: root)
Executing command bash in sandbox ifsnmr6rb2idm6t3grsru (user: root)
Executing command bash in sandbox ifsnmr6rb2idm6t3grsru (user: root)
Executing command bash in sandbox ifsnmr6rb2idm6t3grsru (user: root)
Executing command bash in sandbox ifsnmr6rb2idm6t3grsru (user: root)
Executing command bash in sandbox ifsnmr6rb2idm6t3grsru (user: root)
Executing command bash in sandbox ifsnmr6rb2idm6t3grsru (user: root)
Executing command bash in sandbox ifsnmr6rb2idm6t3grsru (user: root)
Executing command bash in sandbox ifsnmr6rb2idm6t3grsru (user: root)
Executing command bash in sandbox ifsnmr6rb2idm6t3grsru (user: root)
Executing command bash in sandbox ifsnmr6rb2idm6t3grsru (user: root)
Executing command bash in sandbox ifsnmr6rb2idm6t3grsru (user: root)
Executing command bash in sandbox ifsnmr6rb2idm6t3grsru (user: root)
Executing command bash in sandbox ifsnmr6rb2idm6t3grsru (user: root)
Executing command bash in sandbox ifsnmr6rb2idm6t3grsru (user: root)
Executing command bash in sandbox ifsnmr6rb2idm6t3grsru (user: root)
Executing command bash in sandbox ifsnmr6rb2idm6t3grsru (user: root)
Executing command bash in sandbox ifsnmr6rb2idm6t3grsru (user: root)
Executing command bash in sandbox ifsnmr6rb2idm6t3grsru (user: root)
Executing command bash in sandbox ifsnmr6rb2idm6t3grsru (user: root)
Executing command bash in sandbox ifsnmr6rb2idm6t3grsru (user: root)
Executing command bash in sandbox ifsnmr6rb2idm6t3grsru (user: root)
Executing command bash in sandbox ifsnmr6rb2idm6t3grsru (user: root)
Executing command bash in sandbox ifsnmr6rb2idm6t3grsru (user: root)
Executing command bash in sandbox ifsnmr6rb2idm6t3grsru (user: root)
Executing command bash in sandbox ifsnmr6rb2idm6t3grsru (user: root)
Executing command bash in sandbox ifsnmr6rb2idm6t3grsru (user: root)
Executing command bash in sandbox ifsnmr6rb2idm6t3grsru (user: root)
Executing command bash in sandbox ifsnmr6rb2idm6t3grsru (user: root)
Executing command bash in sandbox ifsnmr6rb2idm6t3grsru (user: root)
Executing command bash in sandbox ifsnmr6rb2idm6t3grsru (user: root)
Executing command bash in sandbox ifsnmr6rb2idm6t3grsru (user: root)
Executing command bash in sandbox ifsnmr6rb2idm6t3grsru (user: root)
Executing command bash in sandbox ifsnmr6rb2idm6t3grsru (user: root)
Executing command bash in sandbox ifsnmr6rb2idm6t3grsru (user: root)
Executing command bash in sandbox ifsnmr6rb2idm6t3grsru (user: root)
Executing command bash in sandbox ifsnmr6rb2idm6t3grsru (user: root)
Executing command bash in sandbox ifsnmr6rb2idm6t3grsru (user: root)
Executing command bash in sandbox ifsnmr6rb2idm6t3grsru (user: root)
Executing command bash in sandbox ifsnmr6rb2idm6t3grsru (user: root)
Executing command bash in sandbox ifsnmr6rb2idm6t3grsru (user: root)
Executing command bash in sandbox ifsnmr6rb2idm6t3grsru (user: root)
Executing command bash in sandbox ifsnmr6rb2idm6t3grsru (user: root)
Executing command bash in sandbox ifsnmr6rb2idm6t3grsru (user: root)
Executing command bash in sandbox ifsnmr6rb2idm6t3grsru (user: root)
Executing command bash in sandbox ifsnmr6rb2idm6t3grsru (user: root)
Executing command bash in sandbox ifsnmr6rb2idm6t3grsru (user: root)
Executing command bash in sandbox ifsnmr6rb2idm6t3grsru (user: root)
Executing command bash in sandbox ifsnmr6rb2idm6t3grsru (user: root)
Executing command bash in sandbox ifsnmr6rb2idm6t3grsru (user: root)
Executing command bash in sandbox ifsnmr6rb2idm6t3grsru (user: root)
Executing command bash in sandbox ifsnmr6rb2idm6t3grsru (user: root)
Executing command bash in sandbox ifsnmr6rb2idm6t3grsru (user: root)
Executing command bash in sandbox ifsnmr6rb2idm6t3grsru (user: root)
Executing command bash in sandbox ifsnmr6rb2idm6t3grsru (user: root)
Executing command bash in sandbox ifsnmr6rb2idm6t3grsru (user: root)
Executing command bash in sandbox ifsnmr6rb2idm6t3grsru (user: root)
Executing command bash in sandbox ifsnmr6rb2idm6t3grsru (user: root)
Executing command bash in sandbox ifsnmr6rb2idm6t3grsru (user: root)
Executing command bash in sandbox ifsnmr6rb2idm6t3grsru (user: root)
Executing command bash in sandbox ifsnmr6rb2idm6t3grsru (user: root)
Executing command bash in sandbox ifsnmr6rb2idm6t3grsru (user: root)
Executing command bash in sandbox ifsnmr6rb2idm6t3grsru (user: root)
Executing command bash in sandbox ifsnmr6rb2idm6t3grsru (user: root)
Executing command bash in sandbox ifsnmr6rb2idm6t3grsru (user: root)
Executing command bash in sandbox ifsnmr6rb2idm6t3grsru (user: root)
Executing command bash in sandbox ifsnmr6rb2idm6t3grsru (user: root)
Executing command bash in sandbox ifsnmr6rb2idm6t3grsru (user: root)
Executing command bash in sandbox ifsnmr6rb2idm6t3grsru (user: root)
Executing command bash in sandbox ifsnmr6rb2idm6t3grsru (user: root)
Executing command bash in sandbox ifsnmr6rb2idm6t3grsru (user: root)
Executing command bash in sandbox ifsnmr6rb2idm6t3grsru (user: root)
Executing command bash in sandbox ifsnmr6rb2idm6t3grsru (user: root)
Executing command bash in sandbox ifsnmr6rb2idm6t3grsru (user: root)
    sandbox_memory_integrity_test.go:110: 
        	Error Trace:	.../tests/orchestrator/sandbox_memory_integrity_test.go:81
        	            				.../hostedtoolcache/go/1.26.3.../src/runtime/asm_amd64.s:1771
        	Error:      	Received unexpected error:
        	            	failed to execute command bash in sandbox ifsnmr6rb2idm6t3grsru: unavailable: HTTP status 502 Bad Gateway
    sandbox_memory_integrity_test.go:110: 
        	Error Trace:	.../tests/orchestrator/sandbox_memory_integrity_test.go:78
        	            				.../tests/orchestrator/sandbox_memory_integrity_test.go:110
        	Error:      	Condition never satisfied
        	Test:       	TestSandboxMemoryIntegrity/tmpfs_hash
--- FAIL: TestSandboxMemoryIntegrity/tmpfs_hash (193.51s)
github.com/e2b-dev/infra/tests/integration/internal/tests/proxies::TestMaskRequestHostAPIParameter

Flake rate in main: 42.44% (Passed 697 times, Failed 514 times)

Stack Traces | 4.8s run time
=== RUN   TestMaskRequestHostAPIParameter
=== PAUSE TestMaskRequestHostAPIParameter
=== CONT  TestMaskRequestHostAPIParameter
    mask_request_host_test.go:38: Command [apt-get] output: event:{start:{pid:1275}}
Executing command ls in sandbox ivd0whon4hdt2239fn3qs
    mask_request_host_test.go:38: Command [apt-get] output: event:{data:{stdout:"Hit:1 http://deb.debian.org/debian bookworm InRelease\n"}}
    mask_request_host_test.go:38: Command [apt-get] output: event:{data:{stdout:"Hit:2 http://deb.debian.org/debian bookworm-updates InRelease\n"}}
    mask_request_host_test.go:38: Command [apt-get] output: event:{data:{stdout:"Hit:3 http://deb.debian.org/debian-security bookworm-security InRelease\n"}}
    mask_request_host_test.go:38: Command [apt-get] output: event:{data:{stdout:"Reading package lists..."}}
    mask_request_host_test.go:38: Command [apt-get] output: event:{data:{stdout:"\n"}}
    mask_request_host_test.go:38: Command [apt-get] output: event:{end:{exited:true  status:"exit status 0"}}
    mask_request_host_test.go:38: Command [apt-get] completed successfully in sandbox iyurhfh6gxu9zjeet816j
Executing command apt-get in sandbox iyurhfh6gxu9zjeet816j (user: root)
    mask_request_host_test.go:40: Command [apt-get] output: event:{start:{pid:1376}}
    mask_request_host_test.go:41: 
        	Error Trace:	.../tests/proxies/mask_request_host_test.go:41
        	Error:      	Received unexpected error:
        	            	failed to execute command apt-get in sandbox iyurhfh6gxu9zjeet816j: invalid_argument: protocol error: incomplete envelope: unexpected EOF
        	Test:       	TestMaskRequestHostAPIParameter
--- FAIL: TestMaskRequestHostAPIParameter (4.80s)

To view more test analytics, go to the Test Analytics Dashboard
📋 Got 3 mins? Take this short survey to help us improve Test Analytics.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Calling FileSize concurrently with Close can lead to a data race on the internal state of the os.File descriptor or cause a bad file descriptor error if the file is closed. To prevent this, acquire a read lock and check if the cache is closed before accessing the file descriptor.

Comment thread packages/orchestrator/pkg/sandbox/block/cache.go
@kalyazin kalyazin force-pushed the kalyazin/dirty-page-metrics-fix branch from c7bd7d5 to 403a42e Compare May 29, 2026 15:40
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 403a42e. Configure here.

Comment thread packages/orchestrator/pkg/sandbox/block/cache.go Outdated
@kalyazin kalyazin force-pushed the kalyazin/dirty-page-metrics-fix branch from 62c9de5 to 8430477 Compare May 29, 2026 15:52
…ks.chunks.store

Add orchestrator.blocks.chunks.store native histogram that records the
duration and byte count of each batch written into the mmap cache by the
streaming chunker. The metric reveals balance_dirty_pages throttle events
as p99 latency spikes (300–600 ms) while p50 remains low.

Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
@kalyazin kalyazin force-pushed the kalyazin/dirty-page-metrics-fix branch from 8430477 to a580b03 Compare May 29, 2026 16:20
kalyazin added 4 commits May 29, 2026 17:34
Add orchestrator.host.balance_dirty_pages.threads counter, incremented
by the stalled thread count at every 1 s poll for the lifetime of the
process. rate() of this counter gives real-time throttle intensity
(threads/s) that is non-zero only when dirty-page throttle is active.

The poller runs unconditionally via init(): the signal is simpler to
reason about, carries no per-resume lifecycle complexity, and in
practice the throttle is only visible during resume-time mmap writes
anyway.

Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
Add NewMemfdCache(size, blockSize, name) that creates a Cache backed by
an anonymous memfd (unix.MemfdCreate) instead of a named sparse file on
disk. Writes go to RAM-backed tmpfs pages and therefore never enter the
block-device dirty-page accounting path that triggers balance_dirty_pages
throttling during large VM resumes.

Structural changes to Cache to support the memfd path:
- Add file *os.File field: kept alive for memfd caches so ExportToDiff
  can read from the fd without re-opening by path, and so the GC cannot
  close the fd while the mmap is in use.
- Add isMemfd bool field: gates the memfd-specific branches in Close,
  ExportToDiff, and FileSize.

ExportToDiff: use c.file directly when isMemfd=true instead of
os.Open(c.filePath). SyncFileRange on tmpfs returns EINVAL; skip it
entirely for memfd-backed caches (tmpfs pages are already in RAM, no
writeback to schedule). CopyFileRange works correctly between tmpfs and
the destination filesystem.

FileSize: acquire the read lock and check isClosed() before accessing
c.file, preventing a race with concurrent Close. For memfd caches, fstat
the stored fd and return stat.Blocks*512 (stat.Blocks is always in
512-byte units on Linux). Guard against c.file==nil for zero-size caches,
where the fd is closed at construction and never stored.

Close: for memfd caches, close c.file to release the anonymous pages;
skip os.RemoveAll since there is no filesystem path.

No callers are changed in this commit; behavior for existing NewCache
paths is unchanged.

Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
Add LocalCacheMemfdFlag ("local-cache-memfd", default false) to control
whether the per-sandbox in-process mmap caches use anonymous memfds
instead of sparse files on md0.

Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
…caches

Wire NewMemfdCache into the two block.Cache creation sites that run on
every sandbox resume, behind the local-cache-memfd feature flag:

- block.NewChunker (streaming_chunk.go): the read-through cache that
  materialises memfile chunks fetched from GCS / NFS into a local mmap.
- rootfs.NewNBDProvider (nbd.go): the COW overlay cache that captures
  rootfs writes from the running VM.

Both caches were previously backed by sparse files on md0 (NVMe RAID0).
Writing ~30 GB of dirty pages into those files crossed the per-BDI
dirty-page threshold and parked dozens of goroutines in
balance_dirty_pages, stalling resumes of large-memory sandboxes for
~64 s.

With memfd backing the writes go to anonymous tmpfs pages, which are
invisible to the block-device dirty-page accounting path.  Measured
improvement on a 32 GB memfile snapshot: ~64 s → ~16 s (~4×).

ExportToDiff on the rootfs cache is unaffected: the stored c.file fd is
used in place of re-opening by path; CopyFileRange works correctly
between tmpfs and the destination filesystem.

Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
@kalyazin kalyazin force-pushed the kalyazin/dirty-page-metrics-fix branch from a580b03 to 6ca28a2 Compare May 29, 2026 16:48
@kalyazin kalyazin marked this pull request as ready for review May 29, 2026 17:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant