Skip to content

fix: add cgroup v2 preflight check with platform-specific guidance#657

Closed
ericksoa wants to merge 2 commits intomainfrom
fix/cgroup-v2-preflight
Closed

fix: add cgroup v2 preflight check with platform-specific guidance#657
ericksoa wants to merge 2 commits intomainfrom
fix/cgroup-v2-preflight

Conversation

@ericksoa
Copy link
Contributor

@ericksoa ericksoa commented Mar 22, 2026

Summary

Detect cgroup v2 misconfiguration during onboard preflight, before the gateway starts, with platform-specific fix instructions. Previously the gateway would crash with a cryptic k3s error deep in startup.

Changes

bin/lib/preflight.js — three new functions:

  • isCgroupV2() — detects cgroup v2 via stat (Linux) or docker info (macOS)
  • readDaemonJson() — reads Docker daemon config from platform-appropriate path
  • checkCgroupConfig() — orchestrates the check, returns platform-specific fix

bin/lib/onboard.js — inserts cgroup check after runtime detection, before OpenShell CLI check

test/preflight.test.js — 14 new test cases using opts injection (no real Docker/cgroup interaction)

Fix guidance by platform

Platform Fix
Linux sudo nemoclaw setup-spark
Docker Desktop Settings → Docker Engine → add "default-cgroupns-mode": "host" → Apply & restart
Colima colima stop && colima start --cgroupns-mode host

Closes #136

Test plan

  • npm test — 233/233 pass
  • Manual on macOS Docker Desktop: verify preflight detects cgroup v2 and shows correct instructions
  • Manual on Linux without cgroupns=host: verify preflight detects and shows setup-spark instructions

Summary by CodeRabbit

  • New Features

    • Added cgroup v2 validation during onboarding with clear success/failure messaging.
    • When misconfigured, onboarding now offers runtime-specific guidance and attempts fixes where possible (automatic attempt for Colima, scripted fix for other runtimes, explicit GUI instructions for Docker Desktop).
    • Continues with installation if cgroup check passes.
  • Tests

    • Added tests covering cgroup detection and configuration validation across Linux and macOS.

Detect cgroup v2 misconfiguration before gateway startup and provide
fix instructions for Linux (setup-spark), Docker Desktop (settings UI),
and Colima (restart with --cgroupns-mode host).

Closes #136
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 22, 2026

📝 Walkthrough

Walkthrough

Adds cgroup v2 detection and Docker cgroupns validation to onboarding. The preflight library exposes isCgroupV2, readDaemonJson, and checkCgroupConfig; onboarding invokes checkCgroupConfig({ runtime }), attempts runtime-specific fixes when needed, and exits on unresolved misconfiguration.

Changes

Cohort / File(s) Summary
Preflight cgroup validation
bin/lib/preflight.js
Added exports: isCgroupV2(opts) (detects cgroup v2 via Linux stat or docker info), readDaemonJson(opts) (reads/parses Docker daemon.json from platform-appropriate path or provided content), and checkCgroupConfig(opts) (evaluates cgroup v2 compatibility and returns {ok, runtime, reason, fix} with runtime-specific guidance).
Onboarding integration
bin/lib/onboard.js
Imported checkCgroupConfig; added a preflight step that checks cgroup config, prints detailed failure message, attempts runtime-specific fixes (colima stop && colima start --cgroupns-mode host, GUI instructions for Docker Desktop, or runs setup-spark.sh), re-checks, and exits with code 1 on persistent failure; prints success and continues on pass.
Tests
test/preflight.test.js
Expanded unit tests to cover isCgroupV2, readDaemonJson, and checkCgroupConfig across Linux and macOS detection paths, valid/invalid/missing daemon.json, and decision branches for Docker, Colima, and Docker Desktop scenarios.

Sequence Diagram(s)

mermaid
sequenceDiagram
participant User as User/Shell
participant Onboard as onboard.js
participant Preflight as preflight.js
participant Runtime as Runtime (Docker / Colima / Docker Desktop)
participant System as /etc/docker/daemon.json

User->>Onboard: start onboard
Onboard->>Preflight: checkCgroupConfig({ runtime })
Preflight->>System: readDaemonJson() (if Linux/macOS)
Preflight->>Runtime: query docker info / stat / injected outputs
Runtime-->>Preflight: cgroup v2 detected? + runtime id
alt cgroup v2 OK (default-cgroupns-mode: host)
    Preflight-->>Onboard: { ok: true }
    Onboard->>Onboard: print "✓ cgroup configuration OK"
    Onboard->>Preflight: continue other preflight checks (ports, CLI)
else cgroup v2 misconfigured
    Preflight-->>Onboard: { ok: false, runtime, reason, fix }
    Onboard->>User: print failure message and suggested fix
    alt runtime == colima
        Onboard->>Runtime: run "colima stop && colima start --cgroupns-mode host"
    else runtime == docker-desktop
        Onboard->>User: print GUI instructions (no automatic fix)
    else
        Onboard->>System: run "setup-spark.sh" (sudo)
    end
    Onboard->>Preflight: re-run checkCgroupConfig
    alt still failing
        Onboard->>User: log failure and exit(1)
    else success
        Onboard->>Onboard: continue onboarding
    end
end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰
I sniff the cgroups, small and spry,
Two versions hop where processes lie.
If Docker’s cage won’t let k3s play,
I nudge the defaults, hop, and say:
Fix it, retry — and onboard away! 🥕

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title accurately summarizes the main change: adding a cgroup v2 preflight check with platform-specific guidance, which is the primary objective of this PR.
Linked Issues check ✅ Passed The code changes fully address the requirements from issue #136: detecting cgroup v2 misconfiguration, providing actionable platform-specific instructions, and implementing automated remediation where appropriate.
Out of Scope Changes check ✅ Passed All code changes are directly aligned with the PR objectives and issue #136 requirements; no out-of-scope modifications were detected.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/cgroup-v2-preflight

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
test/preflight.test.js (1)

180-186: Test result depends on host environment.

The test at lines 180-186 checks if the return is null or object, which accommodates environments where /etc/docker/daemon.json exists. For deterministic test results, consider using a non-existent platform value.

💡 More deterministic test
-  it("returns null when no content provided and no file exists", () => {
-    // Use a platform where the daemon.json path won't exist
-    const result = readDaemonJson({ platform: "linux" });
-    // On CI/dev machines /etc/docker/daemon.json may or may not exist,
-    // but we can at least verify the return type is object or null.
-    assert.ok(result === null || typeof result === "object");
+  it("returns null when no content provided and no file exists", () => {
+    // Use a non-existent platform to ensure no path is tried
+    const result = readDaemonJson({ platform: "freebsd" });
+    assert.equal(result, null);
   });
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/preflight.test.js` around lines 180 - 186, The test is non-deterministic
because it relies on the host's /etc/docker/daemon.json; update the test for
readDaemonJson to use a guaranteed-nonexistent platform string (e.g.,
"nonexistent-platform") or otherwise mock/stub filesystem access so the function
cannot find a file, then assert the deterministic expected return (null) instead
of allowing object|null — change the call to readDaemonJson({ platform:
"nonexistent-platform" }) and assert result === null.
bin/lib/preflight.js (1)

146-152: Consider moving require statements to the top of the file.

The inline require("fs") and require("path").join calls work but are inconsistent with the file's pattern where net is required at the top. Moving these to the module header improves readability and follows Node.js conventions.

♻️ Proposed fix

Add at the top of the file (after line 7):

 const net = require("net");
 const { runCapture } = require("./runner");
+const fs = require("fs");
+const path = require("path");

Then update the function:

-  const fs = require("fs");
   const paths = [];
   if (platform === "linux") {
     paths.push("/etc/docker/daemon.json");
   } else if (platform === "darwin") {
     const home = process.env.HOME || "/tmp";
-    paths.push(require("path").join(home, ".docker", "daemon.json"));
+    paths.push(path.join(home, ".docker", "daemon.json"));
   }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/preflight.js` around lines 146 - 152, Move the inline requires to the
module header: add top-level requires for fs and path (e.g., const fs =
require("fs") and const path = require("path")). Then update the function that
currently does const fs = require("fs") and uses require("path").join(...) to
use the top-level fs and path variables (replace require("path").join with
path.join). This keeps the existing variable names (fs, path) and preserves the
logic that pushes "/etc/docker/daemon.json" for platform === "linux" and
path.join(home, ".docker", "daemon.json") for platform === "darwin".
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@bin/lib/preflight.js`:
- Around line 146-152: Move the inline requires to the module header: add
top-level requires for fs and path (e.g., const fs = require("fs") and const
path = require("path")). Then update the function that currently does const fs =
require("fs") and uses require("path").join(...) to use the top-level fs and
path variables (replace require("path").join with path.join). This keeps the
existing variable names (fs, path) and preserves the logic that pushes
"/etc/docker/daemon.json" for platform === "linux" and path.join(home,
".docker", "daemon.json") for platform === "darwin".

In `@test/preflight.test.js`:
- Around line 180-186: The test is non-deterministic because it relies on the
host's /etc/docker/daemon.json; update the test for readDaemonJson to use a
guaranteed-nonexistent platform string (e.g., "nonexistent-platform") or
otherwise mock/stub filesystem access so the function cannot find a file, then
assert the deterministic expected return (null) instead of allowing object|null
— change the call to readDaemonJson({ platform: "nonexistent-platform" }) and
assert result === null.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 4c665242-6682-4fbd-937a-8e7b889714ca

📥 Commits

Reviewing files that changed from the base of the PR and between 5eb4b71 and 335ef85.

📒 Files selected for processing (3)
  • bin/lib/onboard.js
  • bin/lib/preflight.js
  • test/preflight.test.js

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
bin/lib/onboard.js (1)

392-397: Recheck should tolerate daemon startup lag after remediation.

At Line 393, checkCgroupConfig() runs immediately after restart/reconfigure. This can fail transiently before Docker/Colima is fully ready, causing a false hard failure.

Proposed hardening
-    // Re-verify after auto-fix
-    const recheck = checkCgroupConfig({ runtime });
-    if (!recheck.ok) {
-      console.error(`  !! Fix applied but cgroup check still failing: ${recheck.reason}`);
-      process.exit(1);
-    }
+    // Re-verify after auto-fix (allow daemon startup time)
+    let recheck = { ok: false, reason: "not checked" };
+    for (let attempt = 0; attempt < 15; attempt += 1) {
+      recheck = checkCgroupConfig({ runtime });
+      if (recheck.ok) break;
+      sleep(2);
+    }
+    if (!recheck.ok) {
+      console.error(`  !! Fix applied but cgroup check still failing: ${recheck.reason}`);
+      process.exit(1);
+    }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/onboard.js` around lines 392 - 397, The immediate call to
checkCgroupConfig({ runtime }) after remediation can fail transiently while the
daemon (Docker/Colima) is still starting; replace the single immediate recheck
(recheck and recheck.ok handling) with a short retry loop that polls
checkCgroupConfig({ runtime }) with a small sleep/backoff (e.g., try every 1–3s
up to ~5–10 attempts) and only call process.exit(1) if all retries still return
!recheck.ok; keep the existing error message (including recheck.reason) on the
final failure so the behavior around checkCgroupConfig, runtime, recheck.ok and
process.exit is preserved but tolerant to startup lag.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bin/lib/onboard.js`:
- Around line 377-390: The fallback branch that runs spawnSync("sudo", ["-E",
"bash", path.join(SCRIPTS, "setup-spark.sh")]) is assuming
non-"colima"/non-"docker-desktop" runtimes are Linux; update the logic in the
block that currently uses the else branch (the Docker runtime check handling) to
first assert process.platform === "linux" before executing the setup-spark
remediation, and for non-Linux platforms (e.g., macOS where runtime may be
"unknown") print a clear message that automatic remediation is not available and
advise the user to run platform-specific steps; adjust the condition around the
spawnSync call and the messages (references: the variables/runtime-check, the
spawnSync call, SCRIPTS, setup-spark.sh) accordingly.

---

Nitpick comments:
In `@bin/lib/onboard.js`:
- Around line 392-397: The immediate call to checkCgroupConfig({ runtime })
after remediation can fail transiently while the daemon (Docker/Colima) is still
starting; replace the single immediate recheck (recheck and recheck.ok handling)
with a short retry loop that polls checkCgroupConfig({ runtime }) with a small
sleep/backoff (e.g., try every 1–3s up to ~5–10 attempts) and only call
process.exit(1) if all retries still return !recheck.ok; keep the existing error
message (including recheck.reason) on the final failure so the behavior around
checkCgroupConfig, runtime, recheck.ok and process.exit is preserved but
tolerant to startup lag.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 9ba52ea1-cce8-491d-9415-2096e4a72146

📥 Commits

Reviewing files that changed from the base of the PR and between 335ef85 and c430582.

📒 Files selected for processing (1)
  • bin/lib/onboard.js

Comment on lines +377 to +390
} else {
// Linux — auto-fix via setup-spark
console.log(" Fixing: running setup-spark to configure Docker for cgroupns=host...");
const sparkResult = spawnSync("sudo", ["-E", "bash", path.join(SCRIPTS, "setup-spark.sh")], {
stdio: "inherit",
timeout: 120000,
});
if (sparkResult.status !== 0) {
console.error(" Failed to configure Docker. Please run manually:");
console.error(" sudo nemoclaw setup-spark");
process.exit(1);
}
console.log(" ✓ Docker configured for cgroupns=host");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Guard Linux-only remediation by OS, not just runtime label.

At Line 377, the fallback branch treats every non-colima/non-docker-desktop runtime as Linux. If runtime detection returns "unknown" on macOS, this will run setup-spark.sh and show incorrect remediation.

Proposed fix
-    } else {
-      // Linux — auto-fix via setup-spark
-      console.log("  Fixing: running setup-spark to configure Docker for cgroupns=host...");
-      const sparkResult = spawnSync("sudo", ["-E", "bash", path.join(SCRIPTS, "setup-spark.sh")], {
-        stdio: "inherit",
-        timeout: 120000,
-      });
-      if (sparkResult.status !== 0) {
-        console.error("  Failed to configure Docker. Please run manually:");
-        console.error("    sudo nemoclaw setup-spark");
-        process.exit(1);
-      }
-      console.log("  ✓ Docker configured for cgroupns=host");
+    } else if (process.platform === "linux") {
+      // Linux — auto-fix via setup-spark
+      console.log("  Fixing: running setup-spark to configure Docker for cgroupns=host...");
+      const sparkResult = spawnSync("sudo", ["-E", "bash", path.join(SCRIPTS, "setup-spark.sh")], {
+        stdio: "inherit",
+        timeout: 120000,
+      });
+      if (sparkResult.status !== 0) {
+        console.error("  Failed to configure Docker. Please run manually:");
+        console.error("    sudo nemoclaw setup-spark");
+        process.exit(1);
+      }
+      console.log("  ✓ Docker configured for cgroupns=host");
+    } else {
+      console.error("  Could not auto-fix cgroup configuration for this platform/runtime.");
+      console.error("  Please configure Docker to use cgroupns=host and re-run onboard.");
+      process.exit(1);
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/onboard.js` around lines 377 - 390, The fallback branch that runs
spawnSync("sudo", ["-E", "bash", path.join(SCRIPTS, "setup-spark.sh")]) is
assuming non-"colima"/non-"docker-desktop" runtimes are Linux; update the logic
in the block that currently uses the else branch (the Docker runtime check
handling) to first assert process.platform === "linux" before executing the
setup-spark remediation, and for non-Linux platforms (e.g., macOS where runtime
may be "unknown") print a clear message that automatic remediation is not
available and advise the user to run platform-specific steps; adjust the
condition around the spawnSync call and the messages (references: the
variables/runtime-check, the spawnSync call, SCRIPTS, setup-spark.sh)
accordingly.

@ericksoa
Copy link
Contributor Author

Note: OpenShell fixed this server-side in v0.0.7+ (NVIDIA/OpenShell#329) — the gateway container now explicitly requests cgroupns_mode: host regardless of Docker daemon config.

Users on OpenShell 0.0.7+ should not hit this. The preflight check in this PR is belt-and-suspenders for users on older OpenShell versions. Upgrading is the simplest fix:

bash scripts/install-openshell.sh
openshell --version  # should be 0.0.7+
nemoclaw onboard

@ericksoa
Copy link
Contributor Author

Closing — the root fix is upgrading OpenShell to 0.0.7+ which handles cgroupns server-side. Replacing with a PR that upgrades openshell when the installed version is too old.

@ericksoa ericksoa closed this Mar 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Gateway failure on cgroup v2: openat2 /sys/fs/cgroup/kubepods/pids.max: no such file or directory

1 participant