Skip to content

fix: install failed#288

Closed
kagura-agent wants to merge 4 commits intoNVIDIA:mainfrom
kagura-agent:fix/280-install-failed
Closed

fix: install failed#288
kagura-agent wants to merge 4 commits intoNVIDIA:mainfrom
kagura-agent:fix/280-install-failed

Conversation

@kagura-agent
Copy link
Contributor

@kagura-agent kagura-agent commented Mar 18, 2026

Fixes #280

Automated PR via GoGetAJob

Summary by CodeRabbit

  • New Features

    • Added cgroup v2 compatibility detection with automated and manual remediation guidance.
    • Extended GPU detection messaging for DGX Spark configurations.
  • Tests

    • Added comprehensive test coverage for CLI connect, logs, and configuration features.
    • Introduced test helper utilities for improved test infrastructure.

Kagura Chen added 4 commits March 18, 2026 11:08
Add 35 new tests across 4 test files covering critical paths
that previously had zero test coverage:

- onboard/validate: API key validation (success, HTTP errors,
  timeouts, network failures) and key masking
- onboard/config: save/load round-trip, missing file handling,
  config clearing
- commands/logs: sandbox state detection, spawn argument
  construction, --follow/--lines/--run-id flags, error handling
- commands/connect: spawn arguments, openshell errors, exit codes

All tests follow the existing status.test.ts conventions (vitest,
vi.mock, captureLogger). Total suite: 57 tests (22 existing + 35 new).

No source files modified.
Address code review feedback:
- Extract captureLogger and mockSpawnProc into src/test-helpers.ts
  to eliminate duplication between connect.test.ts and logs.test.ts
- Replace realistic API key fixture ('nvapi-...') in validate.test.ts
  with 'test-api-key-placeholder' to avoid secret scanner triggers
Address second CodeRabbit review: spawn errors resolve with exit
code 1, which triggers the follow-up 'exited with code' guidance.
Tests now assert both the error message and the exit guidance.

Also explain why mockExec's callback shape is correct — Node's
promisify(exec) resolves with { stdout, stderr }, matching the
existing status.test.ts pattern.
…tibility

On cgroup v2 hosts (Ubuntu 24.04, DGX Spark), k3s-in-Docker requires
cgroupns=host in the Docker daemon config. Without this, kubelet fails
to initialize, the 'openshell' namespace is never created, and gateway
startup times out with:
  'timed out waiting for namespace openshell to exist'

This was previously caught by a preflight check (NVIDIA#62) that was removed
in NVIDIA#248. This commit restores the check in the onboard preflight:

- Detects cgroup v2 by reading /proc/mounts for 'cgroup2'
- Checks /etc/docker/daemon.json for 'default-cgroupns-mode: host'
- On mismatch, exits with a clear error and remediation steps
  (run 'sudo nemoclaw setup-spark' or manually configure daemon.json)
- Shows a success message when cgroup v2 is properly configured

Also improves GPU display for DGX Spark: shows 'unified memory' instead
of 'VRAM' when the Spark GPU is detected (since GB10 uses unified memory
architecture where nvidia-smi cannot report dedicated VRAM).

Fixes NVIDIA#280
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 18, 2026

📝 Walkthrough

Walkthrough

The changes introduce Docker cgroup v2 compatibility detection within the onboarding preflight checks, extend GPU detection messaging to handle DGX Spark systems, and add comprehensive test coverage for the CLI connect and logs commands alongside configuration and validation functionality testing.

Changes

Cohort / File(s) Summary
Onboarding Cgroup & GPU Detection
bin/lib/onboard.js
Adds checkDockerCgroupCompat() to detect cgroup v2 and missing cgroupns=host configuration. Integrates check into preflight() to exit on incompatibility or log success. Extends GPU detection to log DGX Spark-specific messaging with GPU count and unified memory info.
CLI Command Tests
nemoclaw/src/commands/connect.test.ts, nemoclaw/src/commands/logs.test.ts
Comprehensive test suites for connect and logs CLI functionality. Mocks child_process.spawn, validates correct invocation with arguments, stdio inheritance, error handling (ENOENT and non-ENOENT paths), exit code reporting, and logging behavior.
Onboarding Configuration & Validation Tests
nemoclaw/src/onboard/config.test.ts, nemoclaw/src/onboard/validate.test.ts
Test modules for config persistence (load/save/clear roundtrips) and API key validation with fetch mocking, error handling (401/403/500), timeout/network failures, and masking logic for various key formats.
Test Utilities
nemoclaw/src/test-helpers.ts
Exports captureLogger() for capturing log messages and mockSpawnProc() for simulating child process behavior in tests.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Poem

🐰 Hooray for tests and Docker checks so fine,
Cgroups aligned in perfect line,
DGX Spark now shines so bright,
Logs and configs pass the test's light,
NemoClaw hops forward with delight!

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Title check ⚠️ Warning The title 'fix: install failed' is vague and generic. It doesn't convey the specific root cause (cgroup v2 preflight check) or the actual fix implemented. Use a more descriptive title that reflects the actual fix, such as 'fix: add cgroup v2 compatibility check for Docker namespace configuration' or 'fix: restore cgroup preflight validation to prevent gateway timeout.'
Linked Issues check ❓ Inconclusive The linked issue #280 describes a namespace timeout during gateway initialization. The PR changes include cgroup v2 checks and test coverage, but it's unclear whether the cgroup fix directly resolves the namespace timeout root cause. Clarify in the PR description how the cgroup v2 preflight check prevents the 'timed out waiting for namespace' error, or verify that the Docker environment prerequisite was the actual cause of the failure.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Out of Scope Changes check ✅ Passed All changes are scoped to onboarding infrastructure: cgroup checks, GPU detection messaging refinements, and comprehensive test coverage for onboard/logs/connect modules. No unrelated modifications detected.
Docstring Coverage ✅ Passed Docstring coverage is 80.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
nemoclaw/src/onboard/validate.test.ts (1)

54-58: Make the headers matcher less brittle.

headers is currently asserted as an exact object value. If request options later include extra headers, this test can fail for a non-functional change. Prefer nested partial matching.

♻️ Proposed change
     expect(mockFetch).toHaveBeenCalledWith(
       "https://integrate.api.nvidia.com/v1/models",
       expect.objectContaining({
-        headers: { Authorization: `Bearer ${apiKey}` },
+        headers: expect.objectContaining({
+          Authorization: `Bearer ${apiKey}`,
+        }),
       }),
     );
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemoclaw/src/onboard/validate.test.ts` around lines 54 - 58, The test
currently asserts headers as an exact object which is brittle; update the
assertion in the expect(mockFetch).toHaveBeenCalledWith call to use nested
partial matching so only the Authorization header is required — replace the
headers value with expect.objectContaining({ Authorization: `Bearer ${apiKey}`
}) (i.e., use expect.objectContaining for the overall options and for the
headers field) so additional headers won't break the test while still verifying
Authorization; locate the call referencing mockFetch and adjust the matcher
accordingly.
nemoclaw/src/commands/logs.test.ts (1)

260-264: Strengthen sandbox-resolution verification by asserting the spawn target too.

This test currently verifies exec contains my-sandbox, but it can still miss regressions where spawn("openshell", ["sandbox","connect", ...]) uses the default sandbox.

Suggested assertion addition
       expect(exec).toHaveBeenCalledWith(
         expect.stringContaining("my-sandbox"),
         expect.anything(),
         expect.anything(),
       );
+      expect(spawn).toHaveBeenCalledWith(
+        "openshell",
+        expect.arrayContaining(["sandbox", "connect", "my-sandbox"]),
+        expect.anything(),
+      );
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemoclaw/src/commands/logs.test.ts` around lines 260 - 264, The test
currently only checks exec was called with a string containing "my-sandbox";
also assert the spawn target/args to ensure the correct sandbox is passed: add
an assertion on the exec call arguments (the array/args parameter passed to the
spawn-like call) to include "sandbox", "connect", and "my-sandbox" (e.g., using
expect.arrayContaining on the second argument to exec), referencing the existing
exec call and the expected spawn invocation ("openshell", ["sandbox","connect",
...]) to catch regressions that use the default sandbox.
nemoclaw/src/test-helpers.ts (1)

11-11: Update the docstring to match actual logger capture behavior.

The helper captures info, warn, and error (not only info). A quick wording tweak will avoid confusion.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemoclaw/src/test-helpers.ts` at line 11, Docstring mismatch: the comment
"Create a logger that captures all info() calls into an array." is inaccurate
because the test logger captures info(), warn(), and error(); update that
docstring in nemoclaw/src/test-helpers.ts to explicitly state it captures
info(), warn(), and error() calls into the array (i.e., replace "info() calls"
with "info(), warn(), and error() calls") so the comment matches the actual
behavior of the test logger helper.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bin/lib/onboard.js`:
- Around line 45-52: The current logic infers isCgroupV2 by reading /proc/mounts
via mountInfo and readFileSync("/proc/mounts"), which gives false positives on
hybrid hosts; replace that heuristic in the isCgroupV2 determination with
Docker's reported cgroup mode by invoking docker info --format
'{{.CgroupVersion}}' and treating "2" as v2, and additionally inspect daemon
config files (/etc/docker/daemon.json and ~/.config/docker/daemon.json) for
cgroupns/cgroup_version settings to respect explicit daemon config; update the
code paths that set isCgroupV2 and the returned needsFix flag (symbols:
isCgroupV2, the block that currently reads mountInfo and the surrounding
try/catch) to use Docker's output and daemon.json checks and preserve existing
error handling and fallback behavior.

---

Nitpick comments:
In `@nemoclaw/src/commands/logs.test.ts`:
- Around line 260-264: The test currently only checks exec was called with a
string containing "my-sandbox"; also assert the spawn target/args to ensure the
correct sandbox is passed: add an assertion on the exec call arguments (the
array/args parameter passed to the spawn-like call) to include "sandbox",
"connect", and "my-sandbox" (e.g., using expect.arrayContaining on the second
argument to exec), referencing the existing exec call and the expected spawn
invocation ("openshell", ["sandbox","connect", ...]) to catch regressions that
use the default sandbox.

In `@nemoclaw/src/onboard/validate.test.ts`:
- Around line 54-58: The test currently asserts headers as an exact object which
is brittle; update the assertion in the expect(mockFetch).toHaveBeenCalledWith
call to use nested partial matching so only the Authorization header is required
— replace the headers value with expect.objectContaining({ Authorization:
`Bearer ${apiKey}` }) (i.e., use expect.objectContaining for the overall options
and for the headers field) so additional headers won't break the test while
still verifying Authorization; locate the call referencing mockFetch and adjust
the matcher accordingly.

In `@nemoclaw/src/test-helpers.ts`:
- Line 11: Docstring mismatch: the comment "Create a logger that captures all
info() calls into an array." is inaccurate because the test logger captures
info(), warn(), and error(); update that docstring in
nemoclaw/src/test-helpers.ts to explicitly state it captures info(), warn(), and
error() calls into the array (i.e., replace "info() calls" with "info(), warn(),
and error() calls") so the comment matches the actual behavior of the test
logger helper.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: add686c4-8bfc-41a1-bf82-615e9ad24081

📥 Commits

Reviewing files that changed from the base of the PR and between 1e23347 and 6ba3337.

📒 Files selected for processing (6)
  • bin/lib/onboard.js
  • nemoclaw/src/commands/connect.test.ts
  • nemoclaw/src/commands/logs.test.ts
  • nemoclaw/src/onboard/config.test.ts
  • nemoclaw/src/onboard/validate.test.ts
  • nemoclaw/src/test-helpers.ts

Comment on lines +45 to +52
const mountInfo = fs.readFileSync("/proc/mounts", "utf-8");
// cgroup v2 mounts as "cgroup2" type on /sys/fs/cgroup
isCgroupV2 = mountInfo.includes("cgroup2");
} catch {
return { needsFix: false, isCgroupV2: false };
}

if (!isCgroupV2) return { needsFix: false, isCgroupV2: false };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "Docker-reported cgroup version:"
docker info --format '{{.CgroupVersion}}' || echo "docker_info_unavailable"

echo
echo "cgroup2 entries in /proc/mounts:"
grep -n 'cgroup2' /proc/mounts || true

echo
echo "default-cgroupns-mode in common daemon configs:"
for f in /etc/docker/daemon.json "$HOME/.config/docker/daemon.json"; do
  if [ -r "$f" ]; then
    val=$(jq -r '."default-cgroupns-mode" // "unset"' "$f" 2>/dev/null || echo "parse_error")
    echo "$f -> $val"
  else
    echo "$f -> missing_or_unreadable"
  fi
done

Repository: NVIDIA/NemoClaw

Length of output: 375


🏁 Script executed:

# First, locate and examine the onboard.js file
fd -t f "onboard.js" --exec cat -n {} \;

Repository: NVIDIA/NemoClaw

Length of output: 50371


🏁 Script executed:

# Find and examine runner.js
fd -t f "runner.js" --exec head -100 {} \;

Repository: NVIDIA/NemoClaw

Length of output: 1576


🏁 Script executed:

# Search for docker info usage
rg "docker info" --type js --type ts -B 2 -A 2

Repository: NVIDIA/NemoClaw

Length of output: 274


Use Docker-reported cgroup mode to avoid false preflight failures.

Line 47 infers cgroup v2 from /proc/mounts (includes("cgroup2")), which can be true on hybrid hosts even when Docker is not using cgroup v2. This causes a false-positive needsFix=true that incorrectly blocks onboarding on systems like Ubuntu 24.04 or DGX Spark where the kernel supports cgroup v2 but Docker is configured to use v1.

Replace the /proc/mounts heuristic with docker info --format '{{.CgroupVersion}}' to detect Docker's actual cgroup mode. Also check both /etc/docker/daemon.json and ~/.config/docker/daemon.json for the cgroupns setting, as daemon configs can be in multiple standard locations.

Suggested patch
 function checkDockerCgroupCompat() {
   if (process.platform !== "linux") return { needsFix: false, isCgroupV2: false };

-  // Check if cgroup v2 (unified hierarchy)
-  let isCgroupV2 = false;
-  try {
-    const mountInfo = fs.readFileSync("/proc/mounts", "utf-8");
-    // cgroup v2 mounts as "cgroup2" type on /sys/fs/cgroup
-    isCgroupV2 = mountInfo.includes("cgroup2");
-  } catch {
-    return { needsFix: false, isCgroupV2: false };
-  }
+  // Prefer Docker's actual runtime cgroup mode over host mount heuristics.
+  const cgroupVersion = runCapture(`docker info --format '{{.CgroupVersion}}'`, {
+    ignoreError: true,
+  }).trim();
+  const isCgroupV2 = cgroupVersion === "2";

   if (!isCgroupV2) return { needsFix: false, isCgroupV2: false };

   // Check Docker daemon config for default-cgroupns-mode
   try {
-    // Check daemon.json directly for the cgroupns setting
     let cgroupnsHost = false;
-    try {
-      const daemonJson = JSON.parse(fs.readFileSync("/etc/docker/daemon.json", "utf-8"));
-      cgroupnsHost = daemonJson["default-cgroupns-mode"] === "host";
-    } catch {
-      // No daemon.json or parse error — not configured
-    }
+    const daemonConfigPaths = ["/etc/docker/daemon.json"];
+    if (process.env.HOME) {
+      daemonConfigPaths.push(path.join(process.env.HOME, ".config", "docker", "daemon.json"));
+    }
+    for (const configPath of daemonConfigPaths) {
+      try {
+        const daemonJson = JSON.parse(fs.readFileSync(configPath, "utf-8"));
+        cgroupnsHost = daemonJson["default-cgroupns-mode"] === "host";
+        if (cgroupnsHost) break;
+      } catch {
+        // Try next known daemon config location
+      }
+    }

     if (cgroupnsHost) return { needsFix: false, isCgroupV2: true };

     // Cgroup v2 without cgroupns=host — this will break k3s-in-Docker
     return { needsFix: true, isCgroupV2: true };

Also applies to: 59-60

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/onboard.js` around lines 45 - 52, The current logic infers isCgroupV2
by reading /proc/mounts via mountInfo and readFileSync("/proc/mounts"), which
gives false positives on hybrid hosts; replace that heuristic in the isCgroupV2
determination with Docker's reported cgroup mode by invoking docker info
--format '{{.CgroupVersion}}' and treating "2" as v2, and additionally inspect
daemon config files (/etc/docker/daemon.json and ~/.config/docker/daemon.json)
for cgroupns/cgroup_version settings to respect explicit daemon config; update
the code paths that set isCgroupV2 and the returned needsFix flag (symbols:
isCgroupV2, the block that currently reads mountInfo and the surrounding
try/catch) to use Docker's output and daemon.json checks and preserve existing
error handling and fallback behavior.

@kagura-agent
Copy link
Contributor Author

Closing — this issue was already resolved by #248 (merged). The cgroup v2 preflight check has been removed entirely, making this fix redundant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

install failed

1 participant