fix: add cgroup v2 preflight check with platform-specific guidance#657
fix: add cgroup v2 preflight check with platform-specific guidance#657
Conversation
Detect cgroup v2 misconfiguration before gateway startup and provide fix instructions for Linux (setup-spark), Docker Desktop (settings UI), and Colima (restart with --cgroupns-mode host). Closes #136
📝 WalkthroughWalkthroughAdds cgroup v2 detection and Docker cgroupns validation to onboarding. The preflight library exposes Changes
Sequence Diagram(s)mermaid Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
test/preflight.test.js (1)
180-186: Test result depends on host environment.The test at lines 180-186 checks if the return is
nullorobject, which accommodates environments where/etc/docker/daemon.jsonexists. For deterministic test results, consider using a non-existent platform value.💡 More deterministic test
- it("returns null when no content provided and no file exists", () => { - // Use a platform where the daemon.json path won't exist - const result = readDaemonJson({ platform: "linux" }); - // On CI/dev machines /etc/docker/daemon.json may or may not exist, - // but we can at least verify the return type is object or null. - assert.ok(result === null || typeof result === "object"); + it("returns null when no content provided and no file exists", () => { + // Use a non-existent platform to ensure no path is tried + const result = readDaemonJson({ platform: "freebsd" }); + assert.equal(result, null); });🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/preflight.test.js` around lines 180 - 186, The test is non-deterministic because it relies on the host's /etc/docker/daemon.json; update the test for readDaemonJson to use a guaranteed-nonexistent platform string (e.g., "nonexistent-platform") or otherwise mock/stub filesystem access so the function cannot find a file, then assert the deterministic expected return (null) instead of allowing object|null — change the call to readDaemonJson({ platform: "nonexistent-platform" }) and assert result === null.bin/lib/preflight.js (1)
146-152: Consider movingrequirestatements to the top of the file.The inline
require("fs")andrequire("path").joincalls work but are inconsistent with the file's pattern wherenetis required at the top. Moving these to the module header improves readability and follows Node.js conventions.♻️ Proposed fix
Add at the top of the file (after line 7):
const net = require("net"); const { runCapture } = require("./runner"); +const fs = require("fs"); +const path = require("path");Then update the function:
- const fs = require("fs"); const paths = []; if (platform === "linux") { paths.push("/etc/docker/daemon.json"); } else if (platform === "darwin") { const home = process.env.HOME || "/tmp"; - paths.push(require("path").join(home, ".docker", "daemon.json")); + paths.push(path.join(home, ".docker", "daemon.json")); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bin/lib/preflight.js` around lines 146 - 152, Move the inline requires to the module header: add top-level requires for fs and path (e.g., const fs = require("fs") and const path = require("path")). Then update the function that currently does const fs = require("fs") and uses require("path").join(...) to use the top-level fs and path variables (replace require("path").join with path.join). This keeps the existing variable names (fs, path) and preserves the logic that pushes "/etc/docker/daemon.json" for platform === "linux" and path.join(home, ".docker", "daemon.json") for platform === "darwin".
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@bin/lib/preflight.js`:
- Around line 146-152: Move the inline requires to the module header: add
top-level requires for fs and path (e.g., const fs = require("fs") and const
path = require("path")). Then update the function that currently does const fs =
require("fs") and uses require("path").join(...) to use the top-level fs and
path variables (replace require("path").join with path.join). This keeps the
existing variable names (fs, path) and preserves the logic that pushes
"/etc/docker/daemon.json" for platform === "linux" and path.join(home,
".docker", "daemon.json") for platform === "darwin".
In `@test/preflight.test.js`:
- Around line 180-186: The test is non-deterministic because it relies on the
host's /etc/docker/daemon.json; update the test for readDaemonJson to use a
guaranteed-nonexistent platform string (e.g., "nonexistent-platform") or
otherwise mock/stub filesystem access so the function cannot find a file, then
assert the deterministic expected return (null) instead of allowing object|null
— change the call to readDaemonJson({ platform: "nonexistent-platform" }) and
assert result === null.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 4c665242-6682-4fbd-937a-8e7b889714ca
📒 Files selected for processing (3)
bin/lib/onboard.jsbin/lib/preflight.jstest/preflight.test.js
…or Docker Desktop
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
bin/lib/onboard.js (1)
392-397: Recheck should tolerate daemon startup lag after remediation.At Line 393,
checkCgroupConfig()runs immediately after restart/reconfigure. This can fail transiently before Docker/Colima is fully ready, causing a false hard failure.Proposed hardening
- // Re-verify after auto-fix - const recheck = checkCgroupConfig({ runtime }); - if (!recheck.ok) { - console.error(` !! Fix applied but cgroup check still failing: ${recheck.reason}`); - process.exit(1); - } + // Re-verify after auto-fix (allow daemon startup time) + let recheck = { ok: false, reason: "not checked" }; + for (let attempt = 0; attempt < 15; attempt += 1) { + recheck = checkCgroupConfig({ runtime }); + if (recheck.ok) break; + sleep(2); + } + if (!recheck.ok) { + console.error(` !! Fix applied but cgroup check still failing: ${recheck.reason}`); + process.exit(1); + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bin/lib/onboard.js` around lines 392 - 397, The immediate call to checkCgroupConfig({ runtime }) after remediation can fail transiently while the daemon (Docker/Colima) is still starting; replace the single immediate recheck (recheck and recheck.ok handling) with a short retry loop that polls checkCgroupConfig({ runtime }) with a small sleep/backoff (e.g., try every 1–3s up to ~5–10 attempts) and only call process.exit(1) if all retries still return !recheck.ok; keep the existing error message (including recheck.reason) on the final failure so the behavior around checkCgroupConfig, runtime, recheck.ok and process.exit is preserved but tolerant to startup lag.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@bin/lib/onboard.js`:
- Around line 377-390: The fallback branch that runs spawnSync("sudo", ["-E",
"bash", path.join(SCRIPTS, "setup-spark.sh")]) is assuming
non-"colima"/non-"docker-desktop" runtimes are Linux; update the logic in the
block that currently uses the else branch (the Docker runtime check handling) to
first assert process.platform === "linux" before executing the setup-spark
remediation, and for non-Linux platforms (e.g., macOS where runtime may be
"unknown") print a clear message that automatic remediation is not available and
advise the user to run platform-specific steps; adjust the condition around the
spawnSync call and the messages (references: the variables/runtime-check, the
spawnSync call, SCRIPTS, setup-spark.sh) accordingly.
---
Nitpick comments:
In `@bin/lib/onboard.js`:
- Around line 392-397: The immediate call to checkCgroupConfig({ runtime })
after remediation can fail transiently while the daemon (Docker/Colima) is still
starting; replace the single immediate recheck (recheck and recheck.ok handling)
with a short retry loop that polls checkCgroupConfig({ runtime }) with a small
sleep/backoff (e.g., try every 1–3s up to ~5–10 attempts) and only call
process.exit(1) if all retries still return !recheck.ok; keep the existing error
message (including recheck.reason) on the final failure so the behavior around
checkCgroupConfig, runtime, recheck.ok and process.exit is preserved but
tolerant to startup lag.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 9ba52ea1-cce8-491d-9415-2096e4a72146
📒 Files selected for processing (1)
bin/lib/onboard.js
| } else { | ||
| // Linux — auto-fix via setup-spark | ||
| console.log(" Fixing: running setup-spark to configure Docker for cgroupns=host..."); | ||
| const sparkResult = spawnSync("sudo", ["-E", "bash", path.join(SCRIPTS, "setup-spark.sh")], { | ||
| stdio: "inherit", | ||
| timeout: 120000, | ||
| }); | ||
| if (sparkResult.status !== 0) { | ||
| console.error(" Failed to configure Docker. Please run manually:"); | ||
| console.error(" sudo nemoclaw setup-spark"); | ||
| process.exit(1); | ||
| } | ||
| console.log(" ✓ Docker configured for cgroupns=host"); | ||
| } |
There was a problem hiding this comment.
Guard Linux-only remediation by OS, not just runtime label.
At Line 377, the fallback branch treats every non-colima/non-docker-desktop runtime as Linux. If runtime detection returns "unknown" on macOS, this will run setup-spark.sh and show incorrect remediation.
Proposed fix
- } else {
- // Linux — auto-fix via setup-spark
- console.log(" Fixing: running setup-spark to configure Docker for cgroupns=host...");
- const sparkResult = spawnSync("sudo", ["-E", "bash", path.join(SCRIPTS, "setup-spark.sh")], {
- stdio: "inherit",
- timeout: 120000,
- });
- if (sparkResult.status !== 0) {
- console.error(" Failed to configure Docker. Please run manually:");
- console.error(" sudo nemoclaw setup-spark");
- process.exit(1);
- }
- console.log(" ✓ Docker configured for cgroupns=host");
+ } else if (process.platform === "linux") {
+ // Linux — auto-fix via setup-spark
+ console.log(" Fixing: running setup-spark to configure Docker for cgroupns=host...");
+ const sparkResult = spawnSync("sudo", ["-E", "bash", path.join(SCRIPTS, "setup-spark.sh")], {
+ stdio: "inherit",
+ timeout: 120000,
+ });
+ if (sparkResult.status !== 0) {
+ console.error(" Failed to configure Docker. Please run manually:");
+ console.error(" sudo nemoclaw setup-spark");
+ process.exit(1);
+ }
+ console.log(" ✓ Docker configured for cgroupns=host");
+ } else {
+ console.error(" Could not auto-fix cgroup configuration for this platform/runtime.");
+ console.error(" Please configure Docker to use cgroupns=host and re-run onboard.");
+ process.exit(1);
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@bin/lib/onboard.js` around lines 377 - 390, The fallback branch that runs
spawnSync("sudo", ["-E", "bash", path.join(SCRIPTS, "setup-spark.sh")]) is
assuming non-"colima"/non-"docker-desktop" runtimes are Linux; update the logic
in the block that currently uses the else branch (the Docker runtime check
handling) to first assert process.platform === "linux" before executing the
setup-spark remediation, and for non-Linux platforms (e.g., macOS where runtime
may be "unknown") print a clear message that automatic remediation is not
available and advise the user to run platform-specific steps; adjust the
condition around the spawnSync call and the messages (references: the
variables/runtime-check, the spawnSync call, SCRIPTS, setup-spark.sh)
accordingly.
|
Note: OpenShell fixed this server-side in v0.0.7+ (NVIDIA/OpenShell#329) — the gateway container now explicitly requests Users on OpenShell 0.0.7+ should not hit this. The preflight check in this PR is belt-and-suspenders for users on older OpenShell versions. Upgrading is the simplest fix: bash scripts/install-openshell.sh
openshell --version # should be 0.0.7+
nemoclaw onboard |
|
Closing — the root fix is upgrading OpenShell to 0.0.7+ which handles cgroupns server-side. Replacing with a PR that upgrades openshell when the installed version is too old. |
Summary
Detect cgroup v2 misconfiguration during onboard preflight, before the gateway starts, with platform-specific fix instructions. Previously the gateway would crash with a cryptic k3s error deep in startup.
Changes
bin/lib/preflight.js— three new functions:isCgroupV2()— detects cgroup v2 viastat(Linux) ordocker info(macOS)readDaemonJson()— reads Docker daemon config from platform-appropriate pathcheckCgroupConfig()— orchestrates the check, returns platform-specific fixbin/lib/onboard.js— inserts cgroup check after runtime detection, before OpenShell CLI checktest/preflight.test.js— 14 new test cases using opts injection (no real Docker/cgroup interaction)Fix guidance by platform
sudo nemoclaw setup-spark"default-cgroupns-mode": "host"→ Apply & restartcolima stop && colima start --cgroupns-mode hostCloses #136
Test plan
npm test— 233/233 passSummary by CodeRabbit
New Features
Tests