Skip to content

fix(gpu): add WSL2 GPU support and ollama provider on Linux#254

Draft
tyeth-ai-assisted wants to merge 1 commit intoNVIDIA:mainfrom
tyeth-ai-assisted:fix/wsl2-gpu-support
Draft

fix(gpu): add WSL2 GPU support and ollama provider on Linux#254
tyeth-ai-assisted wants to merge 1 commit intoNVIDIA:mainfrom
tyeth-ai-assisted:fix/wsl2-gpu-support

Conversation

@tyeth-ai-assisted
Copy link

@tyeth-ai-assisted tyeth-ai-assisted commented Mar 17, 2026

Not sure this is warranted, the upstream openshell fix will hopefully solve at the root, but this unblocked me (@tyeth) so here it is.

Summary

  • Add wsl2-gpu-fix.sh that auto-configures CDI-based GPU injection on WSL2 after gateway creation
  • Hook it into both onboard.js (interactive wizard) and setup.sh (legacy script)
  • Fix ollama provider creation to work on Linux, not just macOS

Problem

On WSL2, nemoclaw onboard with GPU fails because:

  1. The OpenShell gateway's nvidia-device-plugin can't detect GPUs (NVML fails without libdxcore.so, NFD can't see PCI)
  2. The ollama-local provider is only created on macOS (uname -s = Darwin check), so Linux/WSL2 users can't use local inference via ollama

Full root cause analysis: NVIDIA/OpenShell#404

Changes

wsl2-gpu-fix.sh (new)

Runs after gateway start on WSL2. Writes a complete CDI spec with:

bin/lib/onboard.js

After gateway health check, detects WSL2 (/dev/dxg) and runs wsl2-gpu-fix.sh.

scripts/setup.sh

  • Same WSL2 fix hook after gateway start
  • Ollama provider creation now works on any platform where ollama is installed or running

Testing

Tested end-to-end on:

  • Hardware: Framework 16, AMD Ryzen AI 7 350, NVIDIA RTX 5070 (8GB VRAM), 96GB DDR5
  • OS: WSL2 (Linux 6.6.87.2-microsoft-standard-WSL2)
  • Flow: nemoclaw onboard -> gateway with GPU -> WSL2 fix auto-applied -> sandbox created -> local inference via ollama nemotron 70B working

Related

Agent Investigation

Diagnosed and tested using openshell doctor commands. Iteratively debugged CDI spec generation, NVML init failures, and pod runtime errors.

🤖 Generated with Claude Code

Summary by CodeRabbit

Release Notes

  • New Features

    • Added WSL2 GPU acceleration support with automatic NVIDIA GPU detection and configuration for WSL2 environments.
    • Extended Ollama local inference setup to work cross-platform with improved service startup handling.
  • Bug Fixes

    • Improved GPU detection and workarounds for WSL2 environment compatibility issues.

WSL2 GPU support:
- Add wsl2-gpu-fix.sh that applies CDI mode, libdxcore.so injection,
  and node labeling after gateway start (workaround until OpenShell
  ships native WSL2 support via NVIDIA/OpenShell#411)
- Hook it into both onboard.js (interactive wizard) and setup.sh
  (legacy script) so it runs automatically after gateway creation
- Writes a complete CDI spec from scratch instead of fragile sed
  patching of the nvidia-ctk generated spec

Ollama on Linux:
- setup.sh only created the ollama-local provider on macOS (Darwin)
- Now detects ollama on any platform (Linux/WSL2 included)
- Enables local GPU inference via ollama for WSL2 users

Closes NVIDIA/NemoClaw#TBD
See also: NVIDIA/OpenShell#404, NVIDIA/OpenShell#411
@coderabbitai
Copy link

coderabbitai bot commented Mar 17, 2026

📝 Walkthrough

Walkthrough

The changes add WSL2 GPU support to the system by detecting GPU capability indicators and /dev/dxg device presence, then executing configuration logic to set up NVIDIA CDI runtime, label Kubernetes nodes, and verify GPU readiness. Additionally, the Ollama local-inference setup was refactored for cross-platform compatibility.

Changes

Cohort / File(s) Summary
WSL2 GPU Fix Implementation
wsl2-gpu-fix.sh
New bash script handling WSL2 GPU detection, CDI specification generation, NVIDIA runtime configuration, Kubernetes node labeling, and GPU device readiness polling with timeout and diagnostic error handling.
Startup Flow Integration
bin/lib/onboard.js, scripts/setup.sh
Added conditional WSL2 GPU fix invocation in gateway startup; introduced WSL2 device detection and script execution with fallback warning handling; refactored Ollama setup for cross-platform operation with conditional service startup.

Sequence Diagram(s)

sequenceDiagram
    participant GW as Gateway Startup
    participant Script as wsl2-gpu-fix.sh
    participant Kubectl as kubectl
    participant NvidiaCTK as nvidia-ctk
    participant K8s as Kubernetes
    participant Plugin as nvidia-device-plugin

    GW->>GW: Detect /dev/dxg & nimCapable GPU
    GW->>Script: Execute wsl2-gpu-fix.sh
    Script->>Script: Validate gateway connectivity
    Script->>Script: Confirm WSL2 environment
    Script->>NvidiaCTK: Generate CDI YAML spec for /dev/dxg
    Script->>Script: Write CDI configuration
    Script->>NvidiaCTK: Switch NVIDIA runtime to CDI mode
    Script->>Kubectl: Label node with NVIDIA PCI capability
    K8s->>Kubectl: Update node labels
    Script->>Kubectl: Poll nvidia-device-plugin status (60 iterations)
    Plugin->>Kubectl: Report GPU devices available
    Kubectl-->>Script: GPU ready
    Script->>Script: Log completion & success message
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰✨ In WSL's valleys where GPUs hide,
A rabbit traced paths with NVIDIA pride,
CDI specs and labels, all set in place,
Now NVIDIA drivers give compute a grace! 🚀🎮

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately reflects the main changes: adding WSL2 GPU support and enabling the ollama provider on Linux, matching the core objectives.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

CodeRabbit can use TruffleHog to scan for secrets in your code with verification capabilities.

Add a TruffleHog config file (e.g. trufflehog-config.yml, trufflehog.yml) to your project to customize detectors and scanning behavior. The tool runs only when a config file is present.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bin/lib/onboard.js`:
- Around line 153-156: The WSL2 GPU fix helper is being run with ignoreError:
true so failures are swallowed; change the call that invokes run with fixScript
(the run(`bash "${fixScript}" nemoclaw`, { ignoreError: true }) usage) to
propagate errors instead of ignoring them—remove or set ignoreError to false and
ensure the caller surfaces a non-zero exit (throw or return error) so onboarding
fails fast when the wsl2-gpu-fix.sh (fixScript) step fails.

In `@scripts/setup.sh`:
- Around line 94-99: The script currently checks executability with the WSL2_FIX
variable using [ -x "$WSL2_FIX" ] which prevents running the helper via bash on
files without execute bits; change the guard to check existence (e.g., [ -f
"$WSL2_FIX" ] or [ -e "$WSL2_FIX" ]) so the block will call bash "$WSL2_FIX"
nemoclaw when the file is present, otherwise emit the same warn message; update
the conditional that references WSL2_FIX accordingly.

In `@wsl2-gpu-fix.sh`:
- Around line 31-34: The script currently uses DXCORE_PATH (and derived
DXCORE_DIR) without validation which can produce blank CDI mounts and still
switch the runtime; update the logic to check that DXCORE_PATH is non-empty (and
readable) right after discovery (the block setting DXCORE_PATH and DXCORE_DIR)
and, if not found, print a clear error including what was searched for and exit
non-zero before any CDI mount generation or runtime change (the code that later
references DXCORE_DIR/DRIVER_DIR and flips to cdi must not run); apply the same
validation/early-exit pattern to the later discovery block around lines 83-86
where DXCORE_PATH/DXCORE_DIR are used so the script fails fast instead of
producing invalid mounts.
- Around line 125-141: The readiness check currently treats only exactly "1" as
ready; update the loop that assigns GPU (variable GPU from the openshell doctor
exec -- kubectl command) to consider any positive integer as ready by testing
numeric value > 0 (e.g., convert GPU to an integer and use a numeric comparison)
instead of string equality to "1", and keep the existing success path that
echoes "GPU ready: nvidia.com/gpu=$GPU" and breaks; ensure the failure branch
remains unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 01389603-1c9b-4f76-a589-b0ba23caf2f2

📥 Commits

Reviewing files that changed from the base of the PR and between 2a9afbc and 3322c66.

📒 Files selected for processing (3)
  • bin/lib/onboard.js
  • scripts/setup.sh
  • wsl2-gpu-fix.sh

Comment on lines +153 to +156
console.log(" WSL2 detected — applying GPU CDI fixes...");
const fixScript = path.join(ROOT, "wsl2-gpu-fix.sh");
if (fs.existsSync(fixScript)) {
run(`bash "${fixScript}" nemoclaw`, { ignoreError: true });
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Surface failures from the WSL2 fix helper.

The legacy setup path fails fast on this helper, but onboarding ignores a non-zero exit here and keeps going. That makes later WSL2 GPU failures look unrelated to the actual root cause.

Suggested fix
     if (fs.existsSync(fixScript)) {
-      run(`bash "${fixScript}" nemoclaw`, { ignoreError: true });
+      try {
+        run(`bash "${fixScript}" nemoclaw`, { ignoreError: false });
+      } catch {
+        console.log("  Warning: WSL2 GPU fix failed; GPU sandbox creation may fail on WSL2.");
+      }
     } else {
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
console.log(" WSL2 detected — applying GPU CDI fixes...");
const fixScript = path.join(ROOT, "wsl2-gpu-fix.sh");
if (fs.existsSync(fixScript)) {
run(`bash "${fixScript}" nemoclaw`, { ignoreError: true });
console.log(" WSL2 detected — applying GPU CDI fixes...");
const fixScript = path.join(ROOT, "wsl2-gpu-fix.sh");
if (fs.existsSync(fixScript)) {
try {
run(`bash "${fixScript}" nemoclaw`, { ignoreError: false });
} catch {
console.log(" Warning: WSL2 GPU fix failed; GPU sandbox creation may fail on WSL2.");
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/onboard.js` around lines 153 - 156, The WSL2 GPU fix helper is being
run with ignoreError: true so failures are swallowed; change the call that
invokes run with fixScript (the run(`bash "${fixScript}" nemoclaw`, {
ignoreError: true }) usage) to propagate errors instead of ignoring them—remove
or set ignoreError to false and ensure the caller surfaces a non-zero exit
(throw or return error) so onboarding fails fast when the wsl2-gpu-fix.sh
(fixScript) step fails.

Comment on lines +94 to +99
WSL2_FIX="${REPO_DIR}/wsl2-gpu-fix.sh"
if [ -x "$WSL2_FIX" ]; then
bash "$WSL2_FIX" nemoclaw
else
warn "wsl2-gpu-fix.sh not found at $WSL2_FIX — GPU sandbox may fail on WSL2"
fi
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't require the helper to be executable here.

This block already runs the file with bash, so -x is stricter than needed. On WSL2 checkouts from /mnt/c, execute bits are often not preserved, which would skip the workaround on the exact platform this PR targets.

Suggested fix
-  if [ -x "$WSL2_FIX" ]; then
+  if [ -f "$WSL2_FIX" ]; then
     bash "$WSL2_FIX" nemoclaw
   else
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
WSL2_FIX="${REPO_DIR}/wsl2-gpu-fix.sh"
if [ -x "$WSL2_FIX" ]; then
bash "$WSL2_FIX" nemoclaw
else
warn "wsl2-gpu-fix.sh not found at $WSL2_FIX — GPU sandbox may fail on WSL2"
fi
WSL2_FIX="${REPO_DIR}/wsl2-gpu-fix.sh"
if [ -f "$WSL2_FIX" ]; then
bash "$WSL2_FIX" nemoclaw
else
warn "wsl2-gpu-fix.sh not found at $WSL2_FIX — GPU sandbox may fail on WSL2"
fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/setup.sh` around lines 94 - 99, The script currently checks
executability with the WSL2_FIX variable using [ -x "$WSL2_FIX" ] which prevents
running the helper via bash on files without execute bits; change the guard to
check existence (e.g., [ -f "$WSL2_FIX" ] or [ -e "$WSL2_FIX" ]) so the block
will call bash "$WSL2_FIX" nemoclaw when the file is present, otherwise emit the
same warn message; update the conditional that references WSL2_FIX accordingly.

Comment on lines +31 to +34
GPU_UUID=$(nvidia-smi --query-gpu=gpu_uuid --format=csv,noheader 2>/dev/null | tr -d " " | head -1)
DXCORE_PATH=$(find /usr/lib -name "libdxcore.so" 2>/dev/null | head -1)
DXCORE_DIR=$(dirname "$DXCORE_PATH" 2>/dev/null || echo "/usr/lib/x86_64-linux-gnu")
DRIVER_DIR=$(ls -d /usr/lib/wsl/drivers/nv*.inf_amd64_* 2>/dev/null | head -1)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Fail fast when libdxcore.so is not discovered.

DXCORE_PATH is written into the CDI mounts without any validation. If discovery returns empty, the generated spec contains blank mount paths and the script still flips the runtime to cdi, which can leave the gateway in a worse state than before.

Suggested fix
 GPU_UUID=$(nvidia-smi --query-gpu=gpu_uuid --format=csv,noheader 2>/dev/null | tr -d " " | head -1)
 DXCORE_PATH=$(find /usr/lib -name "libdxcore.so" 2>/dev/null | head -1)
-DXCORE_DIR=$(dirname "$DXCORE_PATH" 2>/dev/null || echo "/usr/lib/x86_64-linux-gnu")
+if [ -z "$DXCORE_PATH" ]; then
+    echo "Error: libdxcore.so not found inside gateway"
+    exit 1
+fi
+DXCORE_DIR=$(dirname "$DXCORE_PATH")
 DRIVER_DIR=$(ls -d /usr/lib/wsl/drivers/nv*.inf_amd64_* 2>/dev/null | head -1)

Also applies to: 83-86

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@wsl2-gpu-fix.sh` around lines 31 - 34, The script currently uses DXCORE_PATH
(and derived DXCORE_DIR) without validation which can produce blank CDI mounts
and still switch the runtime; update the logic to check that DXCORE_PATH is
non-empty (and readable) right after discovery (the block setting DXCORE_PATH
and DXCORE_DIR) and, if not found, print a clear error including what was
searched for and exit non-zero before any CDI mount generation or runtime change
(the code that later references DXCORE_DIR/DRIVER_DIR and flips to cdi must not
run); apply the same validation/early-exit pattern to the later discovery block
around lines 83-86 where DXCORE_PATH/DXCORE_DIR are used so the script fails
fast instead of producing invalid mounts.

Comment on lines +125 to +141
echo "[4/4] Waiting for nvidia-device-plugin..."
for i in $(seq 1 60); do
GPU=$(openshell doctor exec -- kubectl get nodes -o jsonpath='{.items[0].status.allocatable.nvidia\.com/gpu}' 2>/dev/null || true)
if [ "$GPU" = "1" ]; then
echo "GPU ready: nvidia.com/gpu=$GPU"
break
fi
[ "$((i % 10))" = "0" ] && echo " still waiting ($i/60)..."
sleep 2
done

if [ "$GPU" != "1" ]; then
echo "Warning: GPU resource not yet advertised after 120s"
echo "Checking device plugin pods..."
openshell doctor exec -- kubectl -n nvidia-device-plugin get pods 2>&1
exit 1
fi
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Treat any positive GPU count as ready.

The success check is hard-coded to "1". On WSL2 hosts that expose 2+ GPUs, this loop will hit the timeout and fail even though nvidia.com/gpu is already advertised.

Suggested fix
-    if [ "$GPU" = "1" ]; then
+    if [[ "$GPU" =~ ^[1-9][0-9]*$ ]]; then
         echo "GPU ready: nvidia.com/gpu=$GPU"
         break
     fi
@@
-if [ "$GPU" != "1" ]; then
+if ! [[ "$GPU" =~ ^[1-9][0-9]*$ ]]; then
     echo "Warning: GPU resource not yet advertised after 120s"
     echo "Checking device plugin pods..."
     openshell doctor exec -- kubectl -n nvidia-device-plugin get pods 2>&1
     exit 1
 fi
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
echo "[4/4] Waiting for nvidia-device-plugin..."
for i in $(seq 1 60); do
GPU=$(openshell doctor exec -- kubectl get nodes -o jsonpath='{.items[0].status.allocatable.nvidia\.com/gpu}' 2>/dev/null || true)
if [ "$GPU" = "1" ]; then
echo "GPU ready: nvidia.com/gpu=$GPU"
break
fi
[ "$((i % 10))" = "0" ] && echo " still waiting ($i/60)..."
sleep 2
done
if [ "$GPU" != "1" ]; then
echo "Warning: GPU resource not yet advertised after 120s"
echo "Checking device plugin pods..."
openshell doctor exec -- kubectl -n nvidia-device-plugin get pods 2>&1
exit 1
fi
echo "[4/4] Waiting for nvidia-device-plugin..."
for i in $(seq 1 60); do
GPU=$(openshell doctor exec -- kubectl get nodes -o jsonpath='{.items[0].status.allocatable.nvidia\.com/gpu}' 2>/dev/null || true)
if [[ "$GPU" =~ ^[1-9][0-9]*$ ]]; then
echo "GPU ready: nvidia.com/gpu=$GPU"
break
fi
[ "$((i % 10))" = "0" ] && echo " still waiting ($i/60)..."
sleep 2
done
if ! [[ "$GPU" =~ ^[1-9][0-9]*$ ]]; then
echo "Warning: GPU resource not yet advertised after 120s"
echo "Checking device plugin pods..."
openshell doctor exec -- kubectl -n nvidia-device-plugin get pods 2>&1
exit 1
fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@wsl2-gpu-fix.sh` around lines 125 - 141, The readiness check currently treats
only exactly "1" as ready; update the loop that assigns GPU (variable GPU from
the openshell doctor exec -- kubectl command) to consider any positive integer
as ready by testing numeric value > 0 (e.g., convert GPU to an integer and use a
numeric comparison) instead of string equality to "1", and keep the existing
success path that echoes "GPU ready: nvidia.com/gpu=$GPU" and breaks; ensure the
failure branch remains unchanged.

@wscurran wscurran added the Platform: Windows/WSL Support for Windows Subsystem for Linux label Mar 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Platform: Windows/WSL Support for Windows Subsystem for Linux

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants