perf(iac): tune NFS caching and RPC parallelism for persistent volumes#2871
perf(iac): tune NFS caching and RPC parallelism for persistent volumes#2871tomassrnka wants to merge 2 commits into
Conversation
Two changes targeting NFS-volume read latency: 1. Relax the persistent-volume mount cache options from `noac, lookupcache=none` to `actimeo=1, lookupcache=positive`. The current options force every stat() and lookup to round-trip to Filestore, which dominates metadata-heavy workloads. `actimeo=1` bounds cross-host attribute staleness to ~1s — below the VM-side NFS client's own cache floor — while eliminating the bulk of redundant GETATTR RPCs. `lookupcache=positive` keeps negative lookups uncached so new files created by peer sandboxes still appear promptly. Cross-sandbox strict coherency, when actually needed, should use NLM locks (already enabled via `lock,local_lock=none`) or out-of-band signaling. 2. Raise `sunrpc.tcp_slot_table_entries` to 128 (default 2) via modprobe.d on client nodes. With `nconnect=7` the host can open 7 TCP connections to Filestore, but the default 2 in-flight RPCs per connection caps total in-flight RPCs at 14 — far too low for many concurrent sandboxes generating uncached metadata RPCs. `cmd/simulate-nfs-traffic` already validated this experimentally; this wires it into the production startup script.
PR SummaryMedium Risk Overview Reviewed by Cursor Bugbot for commit 9f90ac9. Bugbot is set up for automated code reviews on this repo. Configure here. |
The [ -d /proc/sys/sunrpc ] guard already handles the "module not loaded" case; `tcp_slot_table_entries` and `tcp_max_slot_table_entries` have been stable kernel sysctls for over a decade. A write failure past the guard is a real misconfiguration we want to surface at boot, not silently swallow.
There was a problem hiding this comment.
Code Review
The sunrpc slot table settings in iac/provider-gcp/nomad-cluster/scripts/start-client.sh are module parameters located under /sys/module/sunrpc/parameters/ rather than sysctl parameters under /proc/sys/sunrpc/. Checking /proc/sys/sunrpc and using sysctl -w will fail to apply the runtime settings, so they should be applied by writing directly to the module parameter files if the module is loaded.
| if [ -d /proc/sys/sunrpc ]; then | ||
| sysctl -w sunrpc.tcp_slot_table_entries=128 | ||
| sysctl -w sunrpc.tcp_max_slot_table_entries=128 | ||
| fi |
This comment was marked as low quality.
This comment was marked as low quality.
Sorry, something went wrong.
There was a problem hiding this comment.
Gemini is mistaken here. On modern Linux kernels these are exposed via both interfaces — they read/write the same underlying variables in net/sunrpc/xprtsock.c:
- module parameters:
/sys/module/sunrpc/parameters/tcp_{slot,max_slot}_table_entries - sysctls:
/proc/sys/sunrpc/tcp_{slot,max_slot}_table_entries(i.e.sysctl sunrpc.tcp_slot_table_entries)
The project's own benchmark already uses the sysctl path successfully — see packages/orchestrator/cmd/simulate-nfs-traffic/experiments.go:72-75, which shells out to sysctl -w sunrpc.tcp_slot_table_entries=128 via setSysFs. That validates the path works on the kernels we run.
Also: the suggested replacement reintroduces || true, which was explicitly removed in the second commit on this branch — the [ -d ... ] guard already covers the "module not loaded" case, and silently swallowing errors past it would hide real misconfiguration.
Keeping the current implementation.
❌ 4 Tests Failed:
View the full list of 4 ❄️ flaky test(s)
To view more test analytics, go to the Test Analytics Dashboard |
Summary
Two changes targeting NFS-volume read latency for sandboxes mounting persistent volumes:
default_persistent_volume_type_nfs_mount_options: replacenoac, lookupcache=nonewithactimeo=1, lookupcache=positive. The current options force everystat()/lookup()to round-trip to Filestore, which dominates metadata-heavy workloads.actimeo=1bounds cross-host attribute staleness to ~1s — below the VM-side NFS client's own cache floor — while eliminating the bulk of redundantGETATTRRPCs. Negative lookups stay uncached so new files created by peer sandboxes still appear promptly. Per-volume_typemount_optionsoverrides remain available for workloads that explicitly need stricter coherency.sunrpc.tcp_slot_table_entriesto 128 (default2) via/etc/modprobe.d/sunrpc.confon client nodes. Withnconnect=7, the host opens 7 TCP connections to Filestore, but the default 2 in-flight RPCs per connection caps total in-flight RPCs at 14 — far too low for many concurrent sandboxes generating uncached metadata RPCs.cmd/simulate-nfs-trafficalready validated this experimentally; this wires it into the production startup script. A runtimesysctl -wcovers the case where the module is already loaded.Rationale on the cache change
The previous
noac, lookupcache=nonedefaults were inherited from the original chunks-cache mount options (commita56c28d, Sep 2025), with the inline comment// disable attribute caching. slower, but more reliable. That mount was later flipped to caching in #1429 (Nov 2025). The persistent-volumes path added in #1893 (Mar 2026, "Add files API") reused the pre-#1429 conservative options without re-evaluating them for the new use case.In the multi-mount case (N sandboxes attaching the same volume across multiple hosts),
noacdoes bound host-side staleness — but the VM-side NFS client has its own attribute cache that the orchestrator cannot control. Real cross-sandbox staleness is bounded by the VM cache, not the host cache, so removingnoacdoes not meaningfully degrade observable consistency. It does eliminate a Filestore RPC on every metadata op.lookupcache=positive(notnone) preserves prompt discovery of newly-created files from peer sandboxes — important for shared-state coordination workloads — while still caching successful lookups for hot paths.Why this matters
Sandbox volume reads currently traverse: VM kernel NFSv3 → orchestrator go-nfs proxy → host kernel NFSv3 → Filestore. With
noac, lookupcache=none, everyls,stat,open(path), and directory walk on the hot path is a Filestore round-trip (~1–3ms in-zone). For metadata-heavy workloads this dominates end-to-end latency.Test plan
cat /sys/module/sunrpc/parameters/tcp_slot_table_entriesreports 128mount | grep persistent-volume-typeson a staging client showsactimeo=1andlookupcache=posin the active mount optionscmd/simulate-nfs-trafficagainst a staging persistent volume; compare p50/p99 read latency before/after for metadata-heavy scenariosorchestrator.chroot.request.latencyand Filestore-side metrics over 24h post-rollout🤖 Generated with Claude Code