fix: install failed by kagura-agent · Pull Request #288 · NVIDIA/NemoClaw

kagura-agent · 2026-03-18T05:04:50Z

Fixes #280

Automated PR via GoGetAJob

Summary by CodeRabbit

New Features
- Added cgroup v2 compatibility detection with automated and manual remediation guidance.
- Extended GPU detection messaging for DGX Spark configurations.
Tests
- Added comprehensive test coverage for CLI connect, logs, and configuration features.
- Introduced test helper utilities for improved test infrastructure.

Add 35 new tests across 4 test files covering critical paths that previously had zero test coverage: - onboard/validate: API key validation (success, HTTP errors, timeouts, network failures) and key masking - onboard/config: save/load round-trip, missing file handling, config clearing - commands/logs: sandbox state detection, spawn argument construction, --follow/--lines/--run-id flags, error handling - commands/connect: spawn arguments, openshell errors, exit codes All tests follow the existing status.test.ts conventions (vitest, vi.mock, captureLogger). Total suite: 57 tests (22 existing + 35 new). No source files modified.

Address code review feedback: - Extract captureLogger and mockSpawnProc into src/test-helpers.ts to eliminate duplication between connect.test.ts and logs.test.ts - Replace realistic API key fixture ('nvapi-...') in validate.test.ts with 'test-api-key-placeholder' to avoid secret scanner triggers

Address second CodeRabbit review: spawn errors resolve with exit code 1, which triggers the follow-up 'exited with code' guidance. Tests now assert both the error message and the exit guidance. Also explain why mockExec's callback shape is correct — Node's promisify(exec) resolves with { stdout, stderr }, matching the existing status.test.ts pattern.

…tibility On cgroup v2 hosts (Ubuntu 24.04, DGX Spark), k3s-in-Docker requires cgroupns=host in the Docker daemon config. Without this, kubelet fails to initialize, the 'openshell' namespace is never created, and gateway startup times out with: 'timed out waiting for namespace openshell to exist' This was previously caught by a preflight check (NVIDIA#62) that was removed in NVIDIA#248. This commit restores the check in the onboard preflight: - Detects cgroup v2 by reading /proc/mounts for 'cgroup2' - Checks /etc/docker/daemon.json for 'default-cgroupns-mode: host' - On mismatch, exits with a clear error and remediation steps (run 'sudo nemoclaw setup-spark' or manually configure daemon.json) - Shows a success message when cgroup v2 is properly configured Also improves GPU display for DGX Spark: shows 'unified memory' instead of 'VRAM' when the Spark GPU is detected (since GB10 uses unified memory architecture where nvidia-smi cannot report dedicated VRAM). Fixes NVIDIA#280

coderabbitai · 2026-03-18T05:05:07Z

📝 Walkthrough

Walkthrough

The changes introduce Docker cgroup v2 compatibility detection within the onboarding preflight checks, extend GPU detection messaging to handle DGX Spark systems, and add comprehensive test coverage for the CLI connect and logs commands alongside configuration and validation functionality testing.

Changes

Cohort / File(s)	Summary
Onboarding Cgroup & GPU Detection `bin/lib/onboard.js`	Adds checkDockerCgroupCompat() to detect cgroup v2 and missing cgroupns=host configuration. Integrates check into preflight() to exit on incompatibility or log success. Extends GPU detection to log DGX Spark-specific messaging with GPU count and unified memory info.
CLI Command Tests `nemoclaw/src/commands/connect.test.ts`, `nemoclaw/src/commands/logs.test.ts`	Comprehensive test suites for connect and logs CLI functionality. Mocks child_process.spawn, validates correct invocation with arguments, stdio inheritance, error handling (ENOENT and non-ENOENT paths), exit code reporting, and logging behavior.
Onboarding Configuration & Validation Tests `nemoclaw/src/onboard/config.test.ts`, `nemoclaw/src/onboard/validate.test.ts`	Test modules for config persistence (load/save/clear roundtrips) and API key validation with fetch mocking, error handling (401/403/500), timeout/network failures, and masking logic for various key formats.
Test Utilities `nemoclaw/src/test-helpers.ts`	Exports captureLogger() for capturing log messages and mockSpawnProc() for simulating child process behavior in tests.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Poem

🐰 Hooray for tests and Docker checks so fine,
Cgroups aligned in perfect line,
DGX Spark now shines so bright,
Logs and configs pass the test's light,
NemoClaw hops forward with delight!

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Title check	⚠️ Warning	The title 'fix: install failed' is vague and generic. It doesn't convey the specific root cause (cgroup v2 preflight check) or the actual fix implemented.	Use a more descriptive title that reflects the actual fix, such as 'fix: add cgroup v2 compatibility check for Docker namespace configuration' or 'fix: restore cgroup preflight validation to prevent gateway timeout.'
Linked Issues check	❓ Inconclusive	The linked issue `#280` describes a namespace timeout during gateway initialization. The PR changes include cgroup v2 checks and test coverage, but it's unclear whether the cgroup fix directly resolves the namespace timeout root cause.	Clarify in the PR description how the cgroup v2 preflight check prevents the 'timed out waiting for namespace' error, or verify that the Docker environment prerequisite was the actual cause of the failure.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Out of Scope Changes check	✅ Passed	All changes are scoped to onboarding infrastructure: cgroup checks, GPU detection messaging refinements, and comprehensive test coverage for onboard/logs/connect modules. No unrelated modifications detected.
Docstring Coverage	✅ Passed	Docstring coverage is 80.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

📝 Coding Plan

Generate coding plan for human review comments

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (3)

nemoclaw/src/onboard/validate.test.ts (1)

54-58: Make the headers matcher less brittle.

headers is currently asserted as an exact object value. If request options later include extra headers, this test can fail for a non-functional change. Prefer nested partial matching.

♻️ Proposed change

     expect(mockFetch).toHaveBeenCalledWith(
       "https://integrate.api.nvidia.com/v1/models",
       expect.objectContaining({
-        headers: { Authorization: `Bearer ${apiKey}` },
+        headers: expect.objectContaining({
+          Authorization: `Bearer ${apiKey}`,
+        }),
       }),
     );

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemoclaw/src/onboard/validate.test.ts` around lines 54 - 58, The test
currently asserts headers as an exact object which is brittle; update the
assertion in the expect(mockFetch).toHaveBeenCalledWith call to use nested
partial matching so only the Authorization header is required — replace the
headers value with expect.objectContaining({ Authorization: `Bearer ${apiKey}`
}) (i.e., use expect.objectContaining for the overall options and for the
headers field) so additional headers won't break the test while still verifying
Authorization; locate the call referencing mockFetch and adjust the matcher
accordingly.

nemoclaw/src/commands/logs.test.ts (1)

260-264: Strengthen sandbox-resolution verification by asserting the spawn target too.

This test currently verifies exec contains my-sandbox, but it can still miss regressions where spawn("openshell", ["sandbox","connect", ...]) uses the default sandbox.

Suggested assertion addition

       expect(exec).toHaveBeenCalledWith(
         expect.stringContaining("my-sandbox"),
         expect.anything(),
         expect.anything(),
       );
+      expect(spawn).toHaveBeenCalledWith(
+        "openshell",
+        expect.arrayContaining(["sandbox", "connect", "my-sandbox"]),
+        expect.anything(),
+      );

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemoclaw/src/commands/logs.test.ts` around lines 260 - 264, The test
currently only checks exec was called with a string containing "my-sandbox";
also assert the spawn target/args to ensure the correct sandbox is passed: add
an assertion on the exec call arguments (the array/args parameter passed to the
spawn-like call) to include "sandbox", "connect", and "my-sandbox" (e.g., using
expect.arrayContaining on the second argument to exec), referencing the existing
exec call and the expected spawn invocation ("openshell", ["sandbox","connect",
...]) to catch regressions that use the default sandbox.

nemoclaw/src/test-helpers.ts (1)

11-11: Update the docstring to match actual logger capture behavior.

The helper captures info, warn, and error (not only info). A quick wording tweak will avoid confusion.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemoclaw/src/test-helpers.ts` at line 11, Docstring mismatch: the comment
"Create a logger that captures all info() calls into an array." is inaccurate
because the test logger captures info(), warn(), and error(); update that
docstring in nemoclaw/src/test-helpers.ts to explicitly state it captures
info(), warn(), and error() calls into the array (i.e., replace "info() calls"
with "info(), warn(), and error() calls") so the comment matches the actual
behavior of the test logger helper.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bin/lib/onboard.js`:
- Around line 45-52: The current logic infers isCgroupV2 by reading /proc/mounts
via mountInfo and readFileSync("/proc/mounts"), which gives false positives on
hybrid hosts; replace that heuristic in the isCgroupV2 determination with
Docker's reported cgroup mode by invoking docker info --format
'{{.CgroupVersion}}' and treating "2" as v2, and additionally inspect daemon
config files (/etc/docker/daemon.json and ~/.config/docker/daemon.json) for
cgroupns/cgroup_version settings to respect explicit daemon config; update the
code paths that set isCgroupV2 and the returned needsFix flag (symbols:
isCgroupV2, the block that currently reads mountInfo and the surrounding
try/catch) to use Docker's output and daemon.json checks and preserve existing
error handling and fallback behavior.

---

Nitpick comments:
In `@nemoclaw/src/commands/logs.test.ts`:
- Around line 260-264: The test currently only checks exec was called with a
string containing "my-sandbox"; also assert the spawn target/args to ensure the
correct sandbox is passed: add an assertion on the exec call arguments (the
array/args parameter passed to the spawn-like call) to include "sandbox",
"connect", and "my-sandbox" (e.g., using expect.arrayContaining on the second
argument to exec), referencing the existing exec call and the expected spawn
invocation ("openshell", ["sandbox","connect", ...]) to catch regressions that
use the default sandbox.

In `@nemoclaw/src/onboard/validate.test.ts`:
- Around line 54-58: The test currently asserts headers as an exact object which
is brittle; update the assertion in the expect(mockFetch).toHaveBeenCalledWith
call to use nested partial matching so only the Authorization header is required
— replace the headers value with expect.objectContaining({ Authorization:
`Bearer ${apiKey}` }) (i.e., use expect.objectContaining for the overall options
and for the headers field) so additional headers won't break the test while
still verifying Authorization; locate the call referencing mockFetch and adjust
the matcher accordingly.

In `@nemoclaw/src/test-helpers.ts`:
- Line 11: Docstring mismatch: the comment "Create a logger that captures all
info() calls into an array." is inaccurate because the test logger captures
info(), warn(), and error(); update that docstring in
nemoclaw/src/test-helpers.ts to explicitly state it captures info(), warn(), and
error() calls into the array (i.e., replace "info() calls" with "info(), warn(),
and error() calls") so the comment matches the actual behavior of the test
logger helper.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: add686c4-8bfc-41a1-bf82-615e9ad24081

📥 Commits

Reviewing files that changed from the base of the PR and between 1e23347 and 6ba3337.

📒 Files selected for processing (6)

bin/lib/onboard.js
nemoclaw/src/commands/connect.test.ts
nemoclaw/src/commands/logs.test.ts
nemoclaw/src/onboard/config.test.ts
nemoclaw/src/onboard/validate.test.ts
nemoclaw/src/test-helpers.ts

coderabbitai · 2026-03-18T05:10:06Z

bin/lib/onboard.js

+    const mountInfo = fs.readFileSync("/proc/mounts", "utf-8");
+    // cgroup v2 mounts as "cgroup2" type on /sys/fs/cgroup
+    isCgroupV2 = mountInfo.includes("cgroup2");
+  } catch {
+    return { needsFix: false, isCgroupV2: false };
+  }
+
+  if (!isCgroupV2) return { needsFix: false, isCgroupV2: false };


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "Docker-reported cgroup version:" docker info --format '{{.CgroupVersion}}' || echo "docker_info_unavailable" echo echo "cgroup2 entries in /proc/mounts:" grep -n 'cgroup2' /proc/mounts || true echo echo "default-cgroupns-mode in common daemon configs:" for f in /etc/docker/daemon.json "$HOME/.config/docker/daemon.json"; do if [ -r "$f" ]; then val=$(jq -r '."default-cgroupns-mode" // "unset"' "$f" 2>/dev/null || echo "parse_error") echo "$f -> $val" else echo "$f -> missing_or_unreadable" fi done

Repository: NVIDIA/NemoClaw

Length of output: 375

🏁 Script executed:

# First, locate and examine the onboard.js file fd -t f "onboard.js" --exec cat -n {} \;

Repository: NVIDIA/NemoClaw

Length of output: 50371

🏁 Script executed:

# Find and examine runner.js fd -t f "runner.js" --exec head -100 {} \;

Repository: NVIDIA/NemoClaw

Length of output: 1576

🏁 Script executed:

# Search for docker info usage rg "docker info" --type js --type ts -B 2 -A 2

Repository: NVIDIA/NemoClaw

Length of output: 274

Use Docker-reported cgroup mode to avoid false preflight failures.

Line 47 infers cgroup v2 from /proc/mounts (includes("cgroup2")), which can be true on hybrid hosts even when Docker is not using cgroup v2. This causes a false-positive needsFix=true that incorrectly blocks onboarding on systems like Ubuntu 24.04 or DGX Spark where the kernel supports cgroup v2 but Docker is configured to use v1.

Replace the /proc/mounts heuristic with docker info --format '{{.CgroupVersion}}' to detect Docker's actual cgroup mode. Also check both /etc/docker/daemon.json and ~/.config/docker/daemon.json for the cgroupns setting, as daemon configs can be in multiple standard locations.

Suggested patch

function checkDockerCgroupCompat() { if (process.platform !== "linux") return { needsFix: false, isCgroupV2: false }; - // Check if cgroup v2 (unified hierarchy) - let isCgroupV2 = false; - try { - const mountInfo = fs.readFileSync("/proc/mounts", "utf-8"); - // cgroup v2 mounts as "cgroup2" type on /sys/fs/cgroup - isCgroupV2 = mountInfo.includes("cgroup2"); - } catch { - return { needsFix: false, isCgroupV2: false }; - } + // Prefer Docker's actual runtime cgroup mode over host mount heuristics. + const cgroupVersion = runCapture(`docker info --format '{{.CgroupVersion}}'`, { + ignoreError: true, + }).trim(); + const isCgroupV2 = cgroupVersion === "2"; if (!isCgroupV2) return { needsFix: false, isCgroupV2: false }; // Check Docker daemon config for default-cgroupns-mode try { - // Check daemon.json directly for the cgroupns setting let cgroupnsHost = false; - try { - const daemonJson = JSON.parse(fs.readFileSync("/etc/docker/daemon.json", "utf-8")); - cgroupnsHost = daemonJson["default-cgroupns-mode"] === "host"; - } catch { - // No daemon.json or parse error — not configured - } + const daemonConfigPaths = ["/etc/docker/daemon.json"]; + if (process.env.HOME) { + daemonConfigPaths.push(path.join(process.env.HOME, ".config", "docker", "daemon.json")); + } + for (const configPath of daemonConfigPaths) { + try { + const daemonJson = JSON.parse(fs.readFileSync(configPath, "utf-8")); + cgroupnsHost = daemonJson["default-cgroupns-mode"] === "host"; + if (cgroupnsHost) break; + } catch { + // Try next known daemon config location + } + } if (cgroupnsHost) return { needsFix: false, isCgroupV2: true }; // Cgroup v2 without cgroupns=host — this will break k3s-in-Docker return { needsFix: true, isCgroupV2: true };

Also applies to: 59-60

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@bin/lib/onboard.js` around lines 45 - 52, The current logic infers isCgroupV2 by reading /proc/mounts via mountInfo and readFileSync("/proc/mounts"), which gives false positives on hybrid hosts; replace that heuristic in the isCgroupV2 determination with Docker's reported cgroup mode by invoking docker info --format '{{.CgroupVersion}}' and treating "2" as v2, and additionally inspect daemon config files (/etc/docker/daemon.json and ~/.config/docker/daemon.json) for cgroupns/cgroup_version settings to respect explicit daemon config; update the code paths that set isCgroupV2 and the returned needsFix flag (symbols: isCgroupV2, the block that currently reads mountInfo and the surrounding try/catch) to use Docker's output and daemon.json checks and preserve existing error handling and fallback behavior.

kagura-agent · 2026-03-18T05:23:24Z

Closing — this issue was already resolved by #248 (merged). The cgroup v2 preflight check has been removed entirely, making this fix redundant.

Kagura Chen added 4 commits March 18, 2026 11:08

coderabbitai bot reviewed Mar 18, 2026

View reviewed changes

kagura-agent closed this Mar 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: install failed#288

fix: install failed#288
kagura-agent wants to merge 4 commits intoNVIDIA:mainfrom
kagura-agent:fix/280-install-failed

kagura-agent commented Mar 18, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Mar 18, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 18, 2026

Uh oh!

kagura-agent commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kagura-agent commented Mar 18, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

kagura-agent commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kagura-agent commented Mar 18, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 18, 2026 •

edited

Loading