fix: remove cgroup v2 preflight check#248
Conversation
The cgroup v2 / cgroupns=host issue has been fixed on the OpenShell side, so NemoClaw no longer needs to check for it during onboarding. Removes bin/lib/preflight.js, its tests, and the check in onboard.js.
📝 WalkthroughWalkthroughThis change removes the cgroup v2 and Docker cgroupns validation logic from the onboarding system. The preflight.js module containing cgroup detection and daemon.json parsing is deleted entirely, along with its integration in onboard.js and corresponding test suite. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Comment Tip You can customize the tone of the review comments and chat replies.Configure the |
jacobtomlinson
left a comment
There was a problem hiding this comment.
PR looks good thanks!
How do we enforce folks have a newer openshell? Otherwise they will just run into the old bug if we remove this. We should add a preflight check for openshell version.
Checking Nemoclaw users are running the latest OpenShell would be awesome. That's good blanket guidance for basically any OpenShell type error 😄 |
ericksoa
left a comment
There was a problem hiding this comment.
LGTM — upstream fix confirmed by Brev team.
Needs to also remove or update setup-spark references — if cgroup is handled upstream, setup-spark.sh and its CLI command are dead code.
jacobtomlinson
left a comment
There was a problem hiding this comment.
I think we still need a version check on openshell as part of this PR.
ericksoa
left a comment
There was a problem hiding this comment.
LGTM — removing the check to force upgrades to openshell v0.0.8+. Still needs setup-spark cleanup per my earlier comment, or we can do that in a follow-up.
If we decide to add the version check, we can reopen 258 :)
…tibility On cgroup v2 hosts (Ubuntu 24.04, DGX Spark), k3s-in-Docker requires cgroupns=host in the Docker daemon config. Without this, kubelet fails to initialize, the 'openshell' namespace is never created, and gateway startup times out with: 'timed out waiting for namespace openshell to exist' This was previously caught by a preflight check (NVIDIA#62) that was removed in NVIDIA#248. This commit restores the check in the onboard preflight: - Detects cgroup v2 by reading /proc/mounts for 'cgroup2' - Checks /etc/docker/daemon.json for 'default-cgroupns-mode: host' - On mismatch, exits with a clear error and remediation steps (run 'sudo nemoclaw setup-spark' or manually configure daemon.json) - Shows a success message when cgroup v2 is properly configured Also improves GPU display for DGX Spark: shows 'unified memory' instead of 'VRAM' when the Spark GPU is detected (since GB10 uses unified memory architecture where nvidia-smi cannot report dedicated VRAM). Fixes NVIDIA#280
What makes you say that @brianwtaylor? I don't see anything related to cgroups in that issue. We should ask for |
|
@drew totally fair. got a little ahead when I noticed the reporter also on a spark. openshell logs is the right call, i can explore a bit |
|
No worries. Here is the upstream fix in OpenShell for your reference, NVIDIA/OpenShell#329. We have tested this fix on a Spark. Would also be useful to get the output of `openshell doctor check` and maybe `docker info` as well.
|
|
Thanks for linking #329 — that confirmed the timeline. So I went and reproduced it. Pulled "default-cgroupns-mode": "host" from E cgroup_manager_linux.go:406 "Unhandled Error" E kubelet.go:1745 "Failed to start ContainerManager" The reporter's on openshell 0.0.10-dev.1+g13f13c2 — a dev snapshot from the If you think it's worth the guardrail I can put something together. tldr: just update openshell -brian |
|
Thanks for the link to #329 — that's exactly the fix the reporter's dev build is missing. Here's the output from reproduction on my Spark (with
Notably, doctor check passes clean — it doesn't inspect cgroup config.
cgroup v2 confirmed, which is the trigger — k3s inside the gateway container can't find v1 cgroup paths without |
The cgroup v2 / cgroupns=host issue has been fixed on the OpenShell side, so NemoClaw no longer needs to check for it during onboarding. Removes bin/lib/preflight.js, its tests, and the check in onboard.js.
The cgroup v2 / cgroupns=host issue has been fixed on the OpenShell side, so NemoClaw no longer needs to check for it during onboarding. Removes bin/lib/preflight.js, its tests, and the check in onboard.js.
Summary
bin/lib/preflight.jsandtest/preflight.test.jsbin/lib/onboard.jsPer @drew feedback: the cgroup fix is handled upstream, so NemoClaw doesn't need to check for it.