fix(gpu): add WSL2 GPU support and ollama provider on Linux#254
fix(gpu): add WSL2 GPU support and ollama provider on Linux#254tyeth-ai-assisted wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
WSL2 GPU support: - Add wsl2-gpu-fix.sh that applies CDI mode, libdxcore.so injection, and node labeling after gateway start (workaround until OpenShell ships native WSL2 support via NVIDIA/OpenShell#411) - Hook it into both onboard.js (interactive wizard) and setup.sh (legacy script) so it runs automatically after gateway creation - Writes a complete CDI spec from scratch instead of fragile sed patching of the nvidia-ctk generated spec Ollama on Linux: - setup.sh only created the ollama-local provider on macOS (Darwin) - Now detects ollama on any platform (Linux/WSL2 included) - Enables local GPU inference via ollama for WSL2 users Closes NVIDIA/NemoClaw#TBD See also: NVIDIA/OpenShell#404, NVIDIA/OpenShell#411
📝 WalkthroughWalkthroughThe changes add WSL2 GPU support to the system by detecting GPU capability indicators and /dev/dxg device presence, then executing configuration logic to set up NVIDIA CDI runtime, label Kubernetes nodes, and verify GPU readiness. Additionally, the Ollama local-inference setup was refactored for cross-platform compatibility. Changes
Sequence Diagram(s)sequenceDiagram
participant GW as Gateway Startup
participant Script as wsl2-gpu-fix.sh
participant Kubectl as kubectl
participant NvidiaCTK as nvidia-ctk
participant K8s as Kubernetes
participant Plugin as nvidia-device-plugin
GW->>GW: Detect /dev/dxg & nimCapable GPU
GW->>Script: Execute wsl2-gpu-fix.sh
Script->>Script: Validate gateway connectivity
Script->>Script: Confirm WSL2 environment
Script->>NvidiaCTK: Generate CDI YAML spec for /dev/dxg
Script->>Script: Write CDI configuration
Script->>NvidiaCTK: Switch NVIDIA runtime to CDI mode
Script->>Kubectl: Label node with NVIDIA PCI capability
K8s->>Kubectl: Update node labels
Script->>Kubectl: Poll nvidia-device-plugin status (60 iterations)
Plugin->>Kubectl: Report GPU devices available
Kubectl-->>Script: GPU ready
Script->>Script: Log completion & success message
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Comment Tip CodeRabbit can use TruffleHog to scan for secrets in your code with verification capabilities.Add a TruffleHog config file (e.g. trufflehog-config.yml, trufflehog.yml) to your project to customize detectors and scanning behavior. The tool runs only when a config file is present. |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@bin/lib/onboard.js`:
- Around line 153-156: The WSL2 GPU fix helper is being run with ignoreError:
true so failures are swallowed; change the call that invokes run with fixScript
(the run(`bash "${fixScript}" nemoclaw`, { ignoreError: true }) usage) to
propagate errors instead of ignoring them—remove or set ignoreError to false and
ensure the caller surfaces a non-zero exit (throw or return error) so onboarding
fails fast when the wsl2-gpu-fix.sh (fixScript) step fails.
In `@scripts/setup.sh`:
- Around line 94-99: The script currently checks executability with the WSL2_FIX
variable using [ -x "$WSL2_FIX" ] which prevents running the helper via bash on
files without execute bits; change the guard to check existence (e.g., [ -f
"$WSL2_FIX" ] or [ -e "$WSL2_FIX" ]) so the block will call bash "$WSL2_FIX"
nemoclaw when the file is present, otherwise emit the same warn message; update
the conditional that references WSL2_FIX accordingly.
In `@wsl2-gpu-fix.sh`:
- Around line 31-34: The script currently uses DXCORE_PATH (and derived
DXCORE_DIR) without validation which can produce blank CDI mounts and still
switch the runtime; update the logic to check that DXCORE_PATH is non-empty (and
readable) right after discovery (the block setting DXCORE_PATH and DXCORE_DIR)
and, if not found, print a clear error including what was searched for and exit
non-zero before any CDI mount generation or runtime change (the code that later
references DXCORE_DIR/DRIVER_DIR and flips to cdi must not run); apply the same
validation/early-exit pattern to the later discovery block around lines 83-86
where DXCORE_PATH/DXCORE_DIR are used so the script fails fast instead of
producing invalid mounts.
- Around line 125-141: The readiness check currently treats only exactly "1" as
ready; update the loop that assigns GPU (variable GPU from the openshell doctor
exec -- kubectl command) to consider any positive integer as ready by testing
numeric value > 0 (e.g., convert GPU to an integer and use a numeric comparison)
instead of string equality to "1", and keep the existing success path that
echoes "GPU ready: nvidia.com/gpu=$GPU" and breaks; ensure the failure branch
remains unchanged.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 01389603-1c9b-4f76-a589-b0ba23caf2f2
📒 Files selected for processing (3)
bin/lib/onboard.jsscripts/setup.shwsl2-gpu-fix.sh
| console.log(" WSL2 detected — applying GPU CDI fixes..."); | ||
| const fixScript = path.join(ROOT, "wsl2-gpu-fix.sh"); | ||
| if (fs.existsSync(fixScript)) { | ||
| run(`bash "${fixScript}" nemoclaw`, { ignoreError: true }); |
There was a problem hiding this comment.
Surface failures from the WSL2 fix helper.
The legacy setup path fails fast on this helper, but onboarding ignores a non-zero exit here and keeps going. That makes later WSL2 GPU failures look unrelated to the actual root cause.
Suggested fix
if (fs.existsSync(fixScript)) {
- run(`bash "${fixScript}" nemoclaw`, { ignoreError: true });
+ try {
+ run(`bash "${fixScript}" nemoclaw`, { ignoreError: false });
+ } catch {
+ console.log(" Warning: WSL2 GPU fix failed; GPU sandbox creation may fail on WSL2.");
+ }
} else {📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| console.log(" WSL2 detected — applying GPU CDI fixes..."); | |
| const fixScript = path.join(ROOT, "wsl2-gpu-fix.sh"); | |
| if (fs.existsSync(fixScript)) { | |
| run(`bash "${fixScript}" nemoclaw`, { ignoreError: true }); | |
| console.log(" WSL2 detected — applying GPU CDI fixes..."); | |
| const fixScript = path.join(ROOT, "wsl2-gpu-fix.sh"); | |
| if (fs.existsSync(fixScript)) { | |
| try { | |
| run(`bash "${fixScript}" nemoclaw`, { ignoreError: false }); | |
| } catch { | |
| console.log(" Warning: WSL2 GPU fix failed; GPU sandbox creation may fail on WSL2."); | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@bin/lib/onboard.js` around lines 153 - 156, The WSL2 GPU fix helper is being
run with ignoreError: true so failures are swallowed; change the call that
invokes run with fixScript (the run(`bash "${fixScript}" nemoclaw`, {
ignoreError: true }) usage) to propagate errors instead of ignoring them—remove
or set ignoreError to false and ensure the caller surfaces a non-zero exit
(throw or return error) so onboarding fails fast when the wsl2-gpu-fix.sh
(fixScript) step fails.
| WSL2_FIX="${REPO_DIR}/wsl2-gpu-fix.sh" | ||
| if [ -x "$WSL2_FIX" ]; then | ||
| bash "$WSL2_FIX" nemoclaw | ||
| else | ||
| warn "wsl2-gpu-fix.sh not found at $WSL2_FIX — GPU sandbox may fail on WSL2" | ||
| fi |
There was a problem hiding this comment.
Don't require the helper to be executable here.
This block already runs the file with bash, so -x is stricter than needed. On WSL2 checkouts from /mnt/c, execute bits are often not preserved, which would skip the workaround on the exact platform this PR targets.
Suggested fix
- if [ -x "$WSL2_FIX" ]; then
+ if [ -f "$WSL2_FIX" ]; then
bash "$WSL2_FIX" nemoclaw
else📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| WSL2_FIX="${REPO_DIR}/wsl2-gpu-fix.sh" | |
| if [ -x "$WSL2_FIX" ]; then | |
| bash "$WSL2_FIX" nemoclaw | |
| else | |
| warn "wsl2-gpu-fix.sh not found at $WSL2_FIX — GPU sandbox may fail on WSL2" | |
| fi | |
| WSL2_FIX="${REPO_DIR}/wsl2-gpu-fix.sh" | |
| if [ -f "$WSL2_FIX" ]; then | |
| bash "$WSL2_FIX" nemoclaw | |
| else | |
| warn "wsl2-gpu-fix.sh not found at $WSL2_FIX — GPU sandbox may fail on WSL2" | |
| fi |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/setup.sh` around lines 94 - 99, The script currently checks
executability with the WSL2_FIX variable using [ -x "$WSL2_FIX" ] which prevents
running the helper via bash on files without execute bits; change the guard to
check existence (e.g., [ -f "$WSL2_FIX" ] or [ -e "$WSL2_FIX" ]) so the block
will call bash "$WSL2_FIX" nemoclaw when the file is present, otherwise emit the
same warn message; update the conditional that references WSL2_FIX accordingly.
| GPU_UUID=$(nvidia-smi --query-gpu=gpu_uuid --format=csv,noheader 2>/dev/null | tr -d " " | head -1) | ||
| DXCORE_PATH=$(find /usr/lib -name "libdxcore.so" 2>/dev/null | head -1) | ||
| DXCORE_DIR=$(dirname "$DXCORE_PATH" 2>/dev/null || echo "/usr/lib/x86_64-linux-gnu") | ||
| DRIVER_DIR=$(ls -d /usr/lib/wsl/drivers/nv*.inf_amd64_* 2>/dev/null | head -1) |
There was a problem hiding this comment.
Fail fast when libdxcore.so is not discovered.
DXCORE_PATH is written into the CDI mounts without any validation. If discovery returns empty, the generated spec contains blank mount paths and the script still flips the runtime to cdi, which can leave the gateway in a worse state than before.
Suggested fix
GPU_UUID=$(nvidia-smi --query-gpu=gpu_uuid --format=csv,noheader 2>/dev/null | tr -d " " | head -1)
DXCORE_PATH=$(find /usr/lib -name "libdxcore.so" 2>/dev/null | head -1)
-DXCORE_DIR=$(dirname "$DXCORE_PATH" 2>/dev/null || echo "/usr/lib/x86_64-linux-gnu")
+if [ -z "$DXCORE_PATH" ]; then
+ echo "Error: libdxcore.so not found inside gateway"
+ exit 1
+fi
+DXCORE_DIR=$(dirname "$DXCORE_PATH")
DRIVER_DIR=$(ls -d /usr/lib/wsl/drivers/nv*.inf_amd64_* 2>/dev/null | head -1)Also applies to: 83-86
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@wsl2-gpu-fix.sh` around lines 31 - 34, The script currently uses DXCORE_PATH
(and derived DXCORE_DIR) without validation which can produce blank CDI mounts
and still switch the runtime; update the logic to check that DXCORE_PATH is
non-empty (and readable) right after discovery (the block setting DXCORE_PATH
and DXCORE_DIR) and, if not found, print a clear error including what was
searched for and exit non-zero before any CDI mount generation or runtime change
(the code that later references DXCORE_DIR/DRIVER_DIR and flips to cdi must not
run); apply the same validation/early-exit pattern to the later discovery block
around lines 83-86 where DXCORE_PATH/DXCORE_DIR are used so the script fails
fast instead of producing invalid mounts.
| echo "[4/4] Waiting for nvidia-device-plugin..." | ||
| for i in $(seq 1 60); do | ||
| GPU=$(openshell doctor exec -- kubectl get nodes -o jsonpath='{.items[0].status.allocatable.nvidia\.com/gpu}' 2>/dev/null || true) | ||
| if [ "$GPU" = "1" ]; then | ||
| echo "GPU ready: nvidia.com/gpu=$GPU" | ||
| break | ||
| fi | ||
| [ "$((i % 10))" = "0" ] && echo " still waiting ($i/60)..." | ||
| sleep 2 | ||
| done | ||
|
|
||
| if [ "$GPU" != "1" ]; then | ||
| echo "Warning: GPU resource not yet advertised after 120s" | ||
| echo "Checking device plugin pods..." | ||
| openshell doctor exec -- kubectl -n nvidia-device-plugin get pods 2>&1 | ||
| exit 1 | ||
| fi |
There was a problem hiding this comment.
Treat any positive GPU count as ready.
The success check is hard-coded to "1". On WSL2 hosts that expose 2+ GPUs, this loop will hit the timeout and fail even though nvidia.com/gpu is already advertised.
Suggested fix
- if [ "$GPU" = "1" ]; then
+ if [[ "$GPU" =~ ^[1-9][0-9]*$ ]]; then
echo "GPU ready: nvidia.com/gpu=$GPU"
break
fi
@@
-if [ "$GPU" != "1" ]; then
+if ! [[ "$GPU" =~ ^[1-9][0-9]*$ ]]; then
echo "Warning: GPU resource not yet advertised after 120s"
echo "Checking device plugin pods..."
openshell doctor exec -- kubectl -n nvidia-device-plugin get pods 2>&1
exit 1
fi📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| echo "[4/4] Waiting for nvidia-device-plugin..." | |
| for i in $(seq 1 60); do | |
| GPU=$(openshell doctor exec -- kubectl get nodes -o jsonpath='{.items[0].status.allocatable.nvidia\.com/gpu}' 2>/dev/null || true) | |
| if [ "$GPU" = "1" ]; then | |
| echo "GPU ready: nvidia.com/gpu=$GPU" | |
| break | |
| fi | |
| [ "$((i % 10))" = "0" ] && echo " still waiting ($i/60)..." | |
| sleep 2 | |
| done | |
| if [ "$GPU" != "1" ]; then | |
| echo "Warning: GPU resource not yet advertised after 120s" | |
| echo "Checking device plugin pods..." | |
| openshell doctor exec -- kubectl -n nvidia-device-plugin get pods 2>&1 | |
| exit 1 | |
| fi | |
| echo "[4/4] Waiting for nvidia-device-plugin..." | |
| for i in $(seq 1 60); do | |
| GPU=$(openshell doctor exec -- kubectl get nodes -o jsonpath='{.items[0].status.allocatable.nvidia\.com/gpu}' 2>/dev/null || true) | |
| if [[ "$GPU" =~ ^[1-9][0-9]*$ ]]; then | |
| echo "GPU ready: nvidia.com/gpu=$GPU" | |
| break | |
| fi | |
| [ "$((i % 10))" = "0" ] && echo " still waiting ($i/60)..." | |
| sleep 2 | |
| done | |
| if ! [[ "$GPU" =~ ^[1-9][0-9]*$ ]]; then | |
| echo "Warning: GPU resource not yet advertised after 120s" | |
| echo "Checking device plugin pods..." | |
| openshell doctor exec -- kubectl -n nvidia-device-plugin get pods 2>&1 | |
| exit 1 | |
| fi |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@wsl2-gpu-fix.sh` around lines 125 - 141, The readiness check currently treats
only exactly "1" as ready; update the loop that assigns GPU (variable GPU from
the openshell doctor exec -- kubectl command) to consider any positive integer
as ready by testing numeric value > 0 (e.g., convert GPU to an integer and use a
numeric comparison) instead of string equality to "1", and keep the existing
success path that echoes "GPU ready: nvidia.com/gpu=$GPU" and breaks; ensure the
failure branch remains unchanged.
Not sure this is warranted, the upstream openshell fix will hopefully solve at the root, but this unblocked me (@tyeth) so here it is.
Summary
wsl2-gpu-fix.shthat auto-configures CDI-based GPU injection on WSL2 after gateway creationonboard.js(interactive wizard) andsetup.sh(legacy script)Problem
On WSL2,
nemoclaw onboardwith GPU fails because:libdxcore.so, NFD can't see PCI)uname -s = Darwincheck), so Linux/WSL2 users can't use local inference via ollamaFull root cause analysis: NVIDIA/OpenShell#404
Changes
wsl2-gpu-fix.sh(new)Runs after gateway start on WSL2. Writes a complete CDI spec with:
/dev/dxgdevice node (WSL2's GPU interface)libdxcore.somount (nvidia-ctk bug omits it — nvidia-ctk cdi generate: libdxcore.so not found on WSL2 despite being present nvidia-container-toolkit#1739)autotocdimodepci-10de.present=true(NFD can't see NVIDIA PCI on WSL2)bin/lib/onboard.jsAfter gateway health check, detects WSL2 (
/dev/dxg) and runswsl2-gpu-fix.sh.scripts/setup.shTesting
Tested end-to-end on:
nemoclaw onboard-> gateway with GPU -> WSL2 fix auto-applied -> sandbox created -> local inference via ollama nemotron 70B workingRelated
Agent Investigation
Diagnosed and tested using
openshell doctorcommands. Iteratively debugged CDI spec generation, NVML init failures, and pod runtime errors.🤖 Generated with Claude Code
Summary by CodeRabbit
Release Notes
New Features
Bug Fixes