Skip to content

fix: remove cgroup v2 preflight check#248

Merged
ericksoa merged 1 commit intoNVIDIA:mainfrom
brevdev:remove-cgroup-preflight-check
Mar 18, 2026
Merged

fix: remove cgroup v2 preflight check#248
ericksoa merged 1 commit intoNVIDIA:mainfrom
brevdev:remove-cgroup-preflight-check

Conversation

@theFong
Copy link
Contributor

@theFong theFong commented Mar 17, 2026

Summary

  • Remove the cgroup v2 / cgroupns=host preflight check from onboarding — this issue has been fixed on the OpenShell side
  • Delete bin/lib/preflight.js and test/preflight.test.js
  • Remove the cgroup check block and import from bin/lib/onboard.js

Per @drew feedback: the cgroup fix is handled upstream, so NemoClaw doesn't need to check for it.

The cgroup v2 / cgroupns=host issue has been fixed on the OpenShell
side, so NemoClaw no longer needs to check for it during onboarding.

Removes bin/lib/preflight.js, its tests, and the check in onboard.js.
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 17, 2026

📝 Walkthrough

Walkthrough

This change removes the cgroup v2 and Docker cgroupns validation logic from the onboarding system. The preflight.js module containing cgroup detection and daemon.json parsing is deleted entirely, along with its integration in onboard.js and corresponding test suite.

Changes

Cohort / File(s) Summary
Preflight Module Deletion
bin/lib/preflight.js
Entire module deleted, eliminating cgroup v2 detection (isCgroupV2), daemon.json reading (readDaemonJson), and cgroup configuration validation (checkCgroupConfig) functions and their exports.
Onboarding Integration
bin/lib/onboard.js
Removed import of checkCgroupConfig from preflight; removed the entire cgroup v2 + Docker cgroupns validation block including console.error messages and process.exit(1) handling.
Test Suite Cleanup
test/preflight.test.js
Entire test file deleted, eliminating unit tests for cgroup detection, daemon.json parsing, and cgroup configuration validation scenarios.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

🐰 Hopping past the cgroup checks we go,
No more daemon.json strings to slow,
Preflight fades like morning dew,
Simpler onboarding, fresh and new! 🌱

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The pull request title 'fix: remove cgroup v2 preflight check' accurately describes the main change: removing the cgroup v2 preflight validation logic from the codebase.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

You can customize the tone of the review comments and chat replies.

Configure the tone_instructions setting to customize the tone of the review comments and chat replies. For example, you can set the tone to Act like a strict teacher, Act like a pirate and more.

Copy link
Member

@jacobtomlinson jacobtomlinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR looks good thanks!

How do we enforce folks have a newer openshell? Otherwise they will just run into the old bug if we remove this. We should add a preflight check for openshell version.

@drew
Copy link

drew commented Mar 17, 2026

PR looks good thanks!

How do we enforce folks have a newer openshell? Otherwise they will just run into the old bug if we remove this. We should add a preflight check for openshell version.

Checking Nemoclaw users are running the latest OpenShell would be awesome. That's good blanket guidance for basically any OpenShell type error 😄

ericksoa
ericksoa previously approved these changes Mar 17, 2026
Copy link
Contributor

@ericksoa ericksoa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — upstream fix confirmed by Brev team.

@ericksoa ericksoa dismissed their stale review March 18, 2026 00:01

Needs to also remove or update setup-spark references — if cgroup is handled upstream, setup-spark.sh and its CLI command are dead code.

Copy link
Contributor

@ericksoa ericksoa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going to update to version check on this PR

Copy link
Member

@jacobtomlinson jacobtomlinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we still need a version check on openshell as part of this PR.

Copy link
Contributor

@ericksoa ericksoa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — removing the check to force upgrades to openshell v0.0.8+. Still needs setup-spark cleanup per my earlier comment, or we can do that in a follow-up.

@ericksoa ericksoa dismissed jacobtomlinson’s stale review March 18, 2026 01:00

If we decide to add the version check, we can reopen 258 :)

@ericksoa ericksoa merged commit 0d1d2d8 into NVIDIA:main Mar 18, 2026
1 check passed
@brianwtaylor brianwtaylor mentioned this pull request Mar 18, 2026
kagura-agent pushed a commit to kagura-agent/NemoClaw that referenced this pull request Mar 18, 2026
…tibility

On cgroup v2 hosts (Ubuntu 24.04, DGX Spark), k3s-in-Docker requires
cgroupns=host in the Docker daemon config. Without this, kubelet fails
to initialize, the 'openshell' namespace is never created, and gateway
startup times out with:
  'timed out waiting for namespace openshell to exist'

This was previously caught by a preflight check (NVIDIA#62) that was removed
in NVIDIA#248. This commit restores the check in the onboard preflight:

- Detects cgroup v2 by reading /proc/mounts for 'cgroup2'
- Checks /etc/docker/daemon.json for 'default-cgroupns-mode: host'
- On mismatch, exits with a clear error and remediation steps
  (run 'sudo nemoclaw setup-spark' or manually configure daemon.json)
- Shows a success message when cgroup v2 is properly configured

Also improves GPU display for DGX Spark: shows 'unified memory' instead
of 'VRAM' when the Spark GPU is detected (since GB10 uses unified memory
architecture where nvidia-smi cannot report dedicated VRAM).

Fixes NVIDIA#280
@drew
Copy link

drew commented Mar 18, 2026

Heads up — #280 was filed ~2 hours after this merged. The reporter is on a DGX Spark with openshell 0.0.10-dev.1 and hit the exact cgroup v2 / cgroupns failure this check was catching. Looks like the upstream fix hasn't shipped to all OpenShell versions yet.

What makes you say that @brianwtaylor? I don't see anything related to cgroups in that issue. We should ask for openshell doctor logs before making a final assesment.

@kagura-agent kagura-agent mentioned this pull request Mar 18, 2026
@brianwtaylor
Copy link
Contributor

@drew totally fair. got a little ahead when I noticed the reporter also on a spark. openshell logs is the right call, i can explore a bit

@drew
Copy link

drew commented Mar 18, 2026 via email

@brianwtaylor
Copy link
Contributor

brianwtaylor commented Mar 18, 2026

Thanks for linking #329 — that confirmed the timeline.

So I went and reproduced it. Pulled "default-cgroupns-mode": "host" from
daemon.json on a second DGX Spark (Ubuntu 24.04, cgroup v2, OpenShell 0.0.6),
ran nemoclaw onboard, got the identical namespace timeout. Gateway container
logs:

E cgroup_manager_linux.go:406 "Unhandled Error"
err="cgroup manager.Set failed: openat2 /sys/fs/cgroup/kubepods/pids.max: no
such file or directory"

E kubelet.go:1745 "Failed to start ContainerManager"
err="failed to initialize top level QOS containers: error validating root
container [kubepods] :
cgroup ["kubepods"] has some missing controllers: cpu, cpuset, hugetlb,
memory, pids"

The reporter's on openshell 0.0.10-dev.1+g13f13c2 — a dev snapshot from the
devel channel (cut March 5). Your cgroup fix
(NVIDIA/OpenShell#329) shipped in v0.0.8 on March 17,
so their build predates it despite the version number looking newer. Upgrading
to v0.0.10 should resolve it — I've pointed them there in
#280.

If you think it's worth the guardrail I can put something together.

tldr: just update openshell

-brian

@brianwtaylor
Copy link
Contributor

Thanks for the link to #329 — that's exactly the fix the reporter's dev build is missing.

Here's the output from reproduction on my Spark (with cgroupns-mode: host removed from daemon.json):

openshell doctor check:

Checking system prerequisites...

  Docker ............. ok (version 29.1.3)
  DOCKER_HOST ........ (not set, using default socket)

All checks passed.

Notably, doctor check passes clean — it doesn't inspect cgroup config.

docker info (relevant bits):

Cgroup Driver: systemd
Cgroup Version: 2
Operating System: Ubuntu 24.04.4 LTS
Architecture: aarch64
Kernel Version: 6.17.0-1008-nvidia

cgroup v2 confirmed, which is the trigger — k3s inside the gateway container can't find v1 cgroup paths without cgroupns=host.

Ryuketsukami pushed a commit to Ryuketsukami/NemoClaw that referenced this pull request Mar 24, 2026
The cgroup v2 / cgroupns=host issue has been fixed on the OpenShell
side, so NemoClaw no longer needs to check for it during onboarding.

Removes bin/lib/preflight.js, its tests, and the check in onboard.js.
jessesanford pushed a commit to jessesanford/NemoClaw that referenced this pull request Mar 24, 2026
The cgroup v2 / cgroupns=host issue has been fixed on the OpenShell
side, so NemoClaw no longer needs to check for it during onboarding.

Removes bin/lib/preflight.js, its tests, and the check in onboard.js.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants