fix(gpu): add WSL2 GPU support and ollama provider on Linux by tyeth-ai-assisted · Pull Request #254 · NVIDIA/NemoClaw

tyeth-ai-assisted · 2026-03-17T23:56:23Z

Not sure this is warranted, the upstream openshell fix will hopefully solve at the root, but this unblocked me (@tyeth) so here it is.

Summary

Add wsl2-gpu-fix.sh that auto-configures CDI-based GPU injection on WSL2 after gateway creation
Hook it into both onboard.js (interactive wizard) and setup.sh (legacy script)
Fix ollama provider creation to work on Linux, not just macOS

Problem

On WSL2, nemoclaw onboard with GPU fails because:

The OpenShell gateway's nvidia-device-plugin can't detect GPUs (NVML fails without libdxcore.so, NFD can't see PCI)
The ollama-local provider is only created on macOS (uname -s = Darwin check), so Linux/WSL2 users can't use local inference via ollama

Full root cause analysis: NVIDIA/OpenShell#404

Changes

`wsl2-gpu-fix.sh` (new)

Runs after gateway start on WSL2. Writes a complete CDI spec with:

/dev/dxg device node (WSL2's GPU interface)
Per-GPU UUID and index entries (device plugin allocates by UUID)
libdxcore.so mount (nvidia-ctk bug omits it — nvidia-ctk cdi generate: libdxcore.so not found on WSL2 despite being present nvidia-container-toolkit#1739)
All WSL driver store library mounts
Switches nvidia runtime from auto to cdi mode
Labels node with pci-10de.present=true (NFD can't see NVIDIA PCI on WSL2)

`bin/lib/onboard.js`

After gateway health check, detects WSL2 (/dev/dxg) and runs wsl2-gpu-fix.sh.

`scripts/setup.sh`

Same WSL2 fix hook after gateway start
Ollama provider creation now works on any platform where ollama is installed or running

Testing

Tested end-to-end on:

Hardware: Framework 16, AMD Ryzen AI 7 350, NVIDIA RTX 5070 (8GB VRAM), 96GB DDR5
OS: WSL2 (Linux 6.6.87.2-microsoft-standard-WSL2)
Flow: nemoclaw onboard -> gateway with GPU -> WSL2 fix auto-applied -> sandbox created -> local inference via ollama nemotron 70B working

bug: GPU passthrough fails on WSL2 — NVML init fails without CDI mode and libdxcore.so OpenShell#404 — Root cause analysis
fix(gpu): add WSL2 GPU support via CDI mode OpenShell#411 — Upstream entrypoint fix (permanent solution when image ships)
nvidia-ctk cdi generate: libdxcore.so not found on WSL2 despite being present nvidia-container-toolkit#1739 — libdxcore.so CDI generation bug

Agent Investigation

Diagnosed and tested using openshell doctor commands. Iteratively debugged CDI spec generation, NVML init failures, and pod runtime errors.

🤖 Generated with Claude Code

Summary by CodeRabbit

Release Notes

New Features
- Added WSL2 GPU acceleration support with automatic NVIDIA GPU detection and configuration for WSL2 environments.
- Extended Ollama local inference setup to work cross-platform with improved service startup handling.
Bug Fixes
- Improved GPU detection and workarounds for WSL2 environment compatibility issues.

WSL2 GPU support: - Add wsl2-gpu-fix.sh that applies CDI mode, libdxcore.so injection, and node labeling after gateway start (workaround until OpenShell ships native WSL2 support via NVIDIA/OpenShell#411) - Hook it into both onboard.js (interactive wizard) and setup.sh (legacy script) so it runs automatically after gateway creation - Writes a complete CDI spec from scratch instead of fragile sed patching of the nvidia-ctk generated spec Ollama on Linux: - setup.sh only created the ollama-local provider on macOS (Darwin) - Now detects ollama on any platform (Linux/WSL2 included) - Enables local GPU inference via ollama for WSL2 users Closes NVIDIA/NemoClaw#TBD See also: NVIDIA/OpenShell#404, NVIDIA/OpenShell#411

coderabbitai · 2026-03-17T23:56:38Z

📝 Walkthrough

Walkthrough

The changes add WSL2 GPU support to the system by detecting GPU capability indicators and /dev/dxg device presence, then executing configuration logic to set up NVIDIA CDI runtime, label Kubernetes nodes, and verify GPU readiness. Additionally, the Ollama local-inference setup was refactored for cross-platform compatibility.

Changes

Cohort / File(s)	Summary
WSL2 GPU Fix Implementation `wsl2-gpu-fix.sh`	New bash script handling WSL2 GPU detection, CDI specification generation, NVIDIA runtime configuration, Kubernetes node labeling, and GPU device readiness polling with timeout and diagnostic error handling.
Startup Flow Integration `bin/lib/onboard.js`, `scripts/setup.sh`	Added conditional WSL2 GPU fix invocation in gateway startup; introduced WSL2 device detection and script execution with fallback warning handling; refactored Ollama setup for cross-platform operation with conditional service startup.

Sequence Diagram(s)

sequenceDiagram
    participant GW as Gateway Startup
    participant Script as wsl2-gpu-fix.sh
    participant Kubectl as kubectl
    participant NvidiaCTK as nvidia-ctk
    participant K8s as Kubernetes
    participant Plugin as nvidia-device-plugin

    GW->>GW: Detect /dev/dxg & nimCapable GPU
    GW->>Script: Execute wsl2-gpu-fix.sh
    Script->>Script: Validate gateway connectivity
    Script->>Script: Confirm WSL2 environment
    Script->>NvidiaCTK: Generate CDI YAML spec for /dev/dxg
    Script->>Script: Write CDI configuration
    Script->>NvidiaCTK: Switch NVIDIA runtime to CDI mode
    Script->>Kubectl: Label node with NVIDIA PCI capability
    K8s->>Kubectl: Update node labels
    Script->>Kubectl: Poll nvidia-device-plugin status (60 iterations)
    Plugin->>Kubectl: Report GPU devices available
    Kubectl-->>Script: GPU ready
    Script->>Script: Log completion & success message

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰✨ In WSL's valleys where GPUs hide,
A rabbit traced paths with NVIDIA pride,
CDI specs and labels, all set in place,
Now NVIDIA drivers give compute a grace! 🚀🎮

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately reflects the main changes: adding WSL2 GPU support and enabling the ollama provider on Linux, matching the core objectives.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

📝 Coding Plan

Generate coding plan for human review comments

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Tip

CodeRabbit can use TruffleHog to scan for secrets in your code with verification capabilities.

Add a TruffleHog config file (e.g. trufflehog-config.yml, trufflehog.yml) to your project to customize detectors and scanning behavior. The tool runs only when a config file is present.

coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bin/lib/onboard.js`:
- Around line 153-156: The WSL2 GPU fix helper is being run with ignoreError:
true so failures are swallowed; change the call that invokes run with fixScript
(the run(`bash "${fixScript}" nemoclaw`, { ignoreError: true }) usage) to
propagate errors instead of ignoring them—remove or set ignoreError to false and
ensure the caller surfaces a non-zero exit (throw or return error) so onboarding
fails fast when the wsl2-gpu-fix.sh (fixScript) step fails.

In `@scripts/setup.sh`:
- Around line 94-99: The script currently checks executability with the WSL2_FIX
variable using [ -x "$WSL2_FIX" ] which prevents running the helper via bash on
files without execute bits; change the guard to check existence (e.g., [ -f
"$WSL2_FIX" ] or [ -e "$WSL2_FIX" ]) so the block will call bash "$WSL2_FIX"
nemoclaw when the file is present, otherwise emit the same warn message; update
the conditional that references WSL2_FIX accordingly.

In `@wsl2-gpu-fix.sh`:
- Around line 31-34: The script currently uses DXCORE_PATH (and derived
DXCORE_DIR) without validation which can produce blank CDI mounts and still
switch the runtime; update the logic to check that DXCORE_PATH is non-empty (and
readable) right after discovery (the block setting DXCORE_PATH and DXCORE_DIR)
and, if not found, print a clear error including what was searched for and exit
non-zero before any CDI mount generation or runtime change (the code that later
references DXCORE_DIR/DRIVER_DIR and flips to cdi must not run); apply the same
validation/early-exit pattern to the later discovery block around lines 83-86
where DXCORE_PATH/DXCORE_DIR are used so the script fails fast instead of
producing invalid mounts.
- Around line 125-141: The readiness check currently treats only exactly "1" as
ready; update the loop that assigns GPU (variable GPU from the openshell doctor
exec -- kubectl command) to consider any positive integer as ready by testing
numeric value > 0 (e.g., convert GPU to an integer and use a numeric comparison)
instead of string equality to "1", and keep the existing success path that
echoes "GPU ready: nvidia.com/gpu=$GPU" and breaks; ensure the failure branch
remains unchanged.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 01389603-1c9b-4f76-a589-b0ba23caf2f2

📥 Commits

Reviewing files that changed from the base of the PR and between 2a9afbc and 3322c66.

📒 Files selected for processing (3)

bin/lib/onboard.js
scripts/setup.sh
wsl2-gpu-fix.sh

coderabbitai · 2026-03-18T00:04:36Z

bin/lib/onboard.js

+    console.log("  WSL2 detected — applying GPU CDI fixes...");
+    const fixScript = path.join(ROOT, "wsl2-gpu-fix.sh");
+    if (fs.existsSync(fixScript)) {
+      run(`bash "${fixScript}" nemoclaw`, { ignoreError: true });


⚠️ Potential issue | 🟠 Major

Surface failures from the WSL2 fix helper.

The legacy setup path fails fast on this helper, but onboarding ignores a non-zero exit here and keeps going. That makes later WSL2 GPU failures look unrelated to the actual root cause.

Suggested fix

if (fs.existsSync(fixScript)) { - run(`bash "${fixScript}" nemoclaw`, { ignoreError: true }); + try { + run(`bash "${fixScript}" nemoclaw`, { ignoreError: false }); + } catch { + console.log(" Warning: WSL2 GPU fix failed; GPU sandbox creation may fail on WSL2."); + } } else {

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

console.log(" WSL2 detected — applying GPU CDI fixes...");

const fixScript = path.join(ROOT, "wsl2-gpu-fix.sh");

if (fs.existsSync(fixScript)) {

run(`bash "${fixScript}" nemoclaw`, { ignoreError: true });

console.log(" WSL2 detected — applying GPU CDI fixes...");

const fixScript = path.join(ROOT, "wsl2-gpu-fix.sh");

if (fs.existsSync(fixScript)) {

try {

run(`bash "${fixScript}" nemoclaw`, { ignoreError: false });

} catch {

console.log(" Warning: WSL2 GPU fix failed; GPU sandbox creation may fail on WSL2.");

}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@bin/lib/onboard.js` around lines 153 - 156, The WSL2 GPU fix helper is being run with ignoreError: true so failures are swallowed; change the call that invokes run with fixScript (the run(`bash "${fixScript}" nemoclaw`, { ignoreError: true }) usage) to propagate errors instead of ignoring them—remove or set ignoreError to false and ensure the caller surfaces a non-zero exit (throw or return error) so onboarding fails fast when the wsl2-gpu-fix.sh (fixScript) step fails.

coderabbitai · 2026-03-18T00:04:36Z

scripts/setup.sh

+  WSL2_FIX="${REPO_DIR}/wsl2-gpu-fix.sh"
+  if [ -x "$WSL2_FIX" ]; then
+    bash "$WSL2_FIX" nemoclaw
+  else
+    warn "wsl2-gpu-fix.sh not found at $WSL2_FIX — GPU sandbox may fail on WSL2"
+  fi


⚠️ Potential issue | 🟠 Major

Don't require the helper to be executable here.

This block already runs the file with bash, so -x is stricter than needed. On WSL2 checkouts from /mnt/c, execute bits are often not preserved, which would skip the workaround on the exact platform this PR targets.

Suggested fix

- if [ -x "$WSL2_FIX" ]; then + if [ -f "$WSL2_FIX" ]; then bash "$WSL2_FIX" nemoclaw else

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

WSL2_FIX="${REPO_DIR}/wsl2-gpu-fix.sh"

if [ -x "$WSL2_FIX" ]; then

bash "$WSL2_FIX" nemoclaw

else

warn "wsl2-gpu-fix.sh not found at $WSL2_FIX — GPU sandbox may fail on WSL2"

fi

WSL2_FIX="${REPO_DIR}/wsl2-gpu-fix.sh"

if [ -f "$WSL2_FIX" ]; then

bash "$WSL2_FIX" nemoclaw

else

warn "wsl2-gpu-fix.sh not found at $WSL2_FIX — GPU sandbox may fail on WSL2"

fi

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@scripts/setup.sh` around lines 94 - 99, The script currently checks executability with the WSL2_FIX variable using [ -x "$WSL2_FIX" ] which prevents running the helper via bash on files without execute bits; change the guard to check existence (e.g., [ -f "$WSL2_FIX" ] or [ -e "$WSL2_FIX" ]) so the block will call bash "$WSL2_FIX" nemoclaw when the file is present, otherwise emit the same warn message; update the conditional that references WSL2_FIX accordingly.

coderabbitai · 2026-03-18T00:04:36Z

wsl2-gpu-fix.sh

+GPU_UUID=$(nvidia-smi --query-gpu=gpu_uuid --format=csv,noheader 2>/dev/null | tr -d " " | head -1)
+DXCORE_PATH=$(find /usr/lib -name "libdxcore.so" 2>/dev/null | head -1)
+DXCORE_DIR=$(dirname "$DXCORE_PATH" 2>/dev/null || echo "/usr/lib/x86_64-linux-gnu")
+DRIVER_DIR=$(ls -d /usr/lib/wsl/drivers/nv*.inf_amd64_* 2>/dev/null | head -1)


⚠️ Potential issue | 🔴 Critical

Fail fast when libdxcore.so is not discovered.

DXCORE_PATH is written into the CDI mounts without any validation. If discovery returns empty, the generated spec contains blank mount paths and the script still flips the runtime to cdi, which can leave the gateway in a worse state than before.

Suggested fix

GPU_UUID=$(nvidia-smi --query-gpu=gpu_uuid --format=csv,noheader 2>/dev/null | tr -d " " | head -1) DXCORE_PATH=$(find /usr/lib -name "libdxcore.so" 2>/dev/null | head -1) -DXCORE_DIR=$(dirname "$DXCORE_PATH" 2>/dev/null || echo "/usr/lib/x86_64-linux-gnu") +if [ -z "$DXCORE_PATH" ]; then + echo "Error: libdxcore.so not found inside gateway" + exit 1 +fi +DXCORE_DIR=$(dirname "$DXCORE_PATH") DRIVER_DIR=$(ls -d /usr/lib/wsl/drivers/nv*.inf_amd64_* 2>/dev/null | head -1)

Also applies to: 83-86

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@wsl2-gpu-fix.sh` around lines 31 - 34, The script currently uses DXCORE_PATH (and derived DXCORE_DIR) without validation which can produce blank CDI mounts and still switch the runtime; update the logic to check that DXCORE_PATH is non-empty (and readable) right after discovery (the block setting DXCORE_PATH and DXCORE_DIR) and, if not found, print a clear error including what was searched for and exit non-zero before any CDI mount generation or runtime change (the code that later references DXCORE_DIR/DRIVER_DIR and flips to cdi must not run); apply the same validation/early-exit pattern to the later discovery block around lines 83-86 where DXCORE_PATH/DXCORE_DIR are used so the script fails fast instead of producing invalid mounts.

coderabbitai · 2026-03-18T00:04:36Z

wsl2-gpu-fix.sh

+echo "[4/4] Waiting for nvidia-device-plugin..."
+for i in $(seq 1 60); do
+    GPU=$(openshell doctor exec -- kubectl get nodes -o jsonpath='{.items[0].status.allocatable.nvidia\.com/gpu}' 2>/dev/null || true)
+    if [ "$GPU" = "1" ]; then
+        echo "GPU ready: nvidia.com/gpu=$GPU"
+        break
+    fi
+    [ "$((i % 10))" = "0" ] && echo "  still waiting ($i/60)..."
+    sleep 2
+done
+
+if [ "$GPU" != "1" ]; then
+    echo "Warning: GPU resource not yet advertised after 120s"
+    echo "Checking device plugin pods..."
+    openshell doctor exec -- kubectl -n nvidia-device-plugin get pods 2>&1
+    exit 1
+fi


⚠️ Potential issue | 🟠 Major

Treat any positive GPU count as ready.

The success check is hard-coded to "1". On WSL2 hosts that expose 2+ GPUs, this loop will hit the timeout and fail even though nvidia.com/gpu is already advertised.

Suggested fix

- if [ "$GPU" = "1" ]; then + if [[ "$GPU" =~ ^[1-9][0-9]*$ ]]; then echo "GPU ready: nvidia.com/gpu=$GPU" break fi @@ -if [ "$GPU" != "1" ]; then +if ! [[ "$GPU" =~ ^[1-9][0-9]*$ ]]; then echo "Warning: GPU resource not yet advertised after 120s" echo "Checking device plugin pods..." openshell doctor exec -- kubectl -n nvidia-device-plugin get pods 2>&1 exit 1 fi

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

echo "[4/4] Waiting for nvidia-device-plugin..."

for i in $(seq 1 60); do

GPU=$(openshell doctor exec -- kubectl get nodes -o jsonpath='{.items[0].status.allocatable.nvidia\.com/gpu}' 2>/dev/null || true)

if [ "$GPU" = "1" ]; then

echo "GPU ready: nvidia.com/gpu=$GPU"

break

fi

[ "$((i % 10))" = "0" ] && echo " still waiting ($i/60)..."

sleep 2

done

if [ "$GPU" != "1" ]; then

echo "Warning: GPU resource not yet advertised after 120s"

echo "Checking device plugin pods..."

openshell doctor exec -- kubectl -n nvidia-device-plugin get pods 2>&1

exit 1

fi

echo "[4/4] Waiting for nvidia-device-plugin..."

for i in $(seq 1 60); do

GPU=$(openshell doctor exec -- kubectl get nodes -o jsonpath='{.items[0].status.allocatable.nvidia\.com/gpu}' 2>/dev/null || true)

if [[ "$GPU" =~ ^[1-9][0-9]*$ ]]; then

echo "GPU ready: nvidia.com/gpu=$GPU"

break

fi

[ "$((i % 10))" = "0" ] && echo " still waiting ($i/60)..."

sleep 2

done

if ! [[ "$GPU" =~ ^[1-9][0-9]*$ ]]; then

echo "Warning: GPU resource not yet advertised after 120s"

echo "Checking device plugin pods..."

openshell doctor exec -- kubectl -n nvidia-device-plugin get pods 2>&1

exit 1

fi

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@wsl2-gpu-fix.sh` around lines 125 - 141, The readiness check currently treats only exactly "1" as ready; update the loop that assigns GPU (variable GPU from the openshell doctor exec -- kubectl command) to consider any positive integer as ready by testing numeric value > 0 (e.g., convert GPU to an integer and use a numeric comparison) instead of string equality to "1", and keep the existing success path that echoes "GPU ready: nvidia.com/gpu=$GPU" and breaks; ensure the failure branch remains unchanged.

tyeth-ai-assisted marked this pull request as draft March 17, 2026 23:58

tyeth-ai-assisted mentioned this pull request Mar 17, 2026

[BUG] nemoclaw onboard forces --gpu on WSL2, sandbox DOA (workaround included) #208

Closed

coderabbitai bot reviewed Mar 18, 2026

View reviewed changes

wscurran added the Platform: Windows/WSL Support for Windows Subsystem for Linux label Mar 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(gpu): add WSL2 GPU support and ollama provider on Linux#254

fix(gpu): add WSL2 GPU support and ollama provider on Linux#254
tyeth-ai-assisted wants to merge 1 commit intoNVIDIA:mainfrom
tyeth-ai-assisted:fix/wsl2-gpu-support

tyeth-ai-assisted commented Mar 17, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Mar 17, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 18, 2026

Uh oh!

coderabbitai bot Mar 18, 2026

Uh oh!

coderabbitai bot Mar 18, 2026

Uh oh!

coderabbitai bot Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

tyeth-ai-assisted commented Mar 17, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Changes

wsl2-gpu-fix.sh (new)

bin/lib/onboard.js

scripts/setup.sh

Testing

Related

Agent Investigation

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tyeth-ai-assisted commented Mar 17, 2026 •

edited by coderabbitai bot

Loading

`wsl2-gpu-fix.sh` (new)

`bin/lib/onboard.js`

`scripts/setup.sh`

coderabbitai bot commented Mar 17, 2026 •

edited

Loading