Skip to content

docs(spark): add local Ollama inference setup section#678

Open
paritoshd-nv wants to merge 4 commits intoNVIDIA:mainfrom
paritoshd-nv:docs/spark-local-ollama-setup
Open

docs(spark): add local Ollama inference setup section#678
paritoshd-nv wants to merge 4 commits intoNVIDIA:mainfrom
paritoshd-nv:docs/spark-local-ollama-setup

Conversation

@paritoshd-nv
Copy link
Contributor

@paritoshd-nv paritoshd-nv commented Mar 23, 2026

Add step-by-step instructions for setting up local inference with Ollama on DGX Spark, covering NVIDIA runtime verification, Ollama install and model pre-load, OLLAMA_HOST=0.0.0.0 configuration, and sandbox connection with verification.

Fixes #314, #385

Summary

Related Issue

Changes

Type of Change

  • Code change for a new feature, bug fix, or refactor.
  • Code change with doc updates.
  • Doc only. Prose changes without code sample modifications.
  • Doc only. Includes code sample changes.

Testing

  • make check passes.
  • npm test passes.
  • make docs builds without warnings. (for doc-only changes)

Checklist

General

Code Changes

  • make format applied (TypeScript and Python).
  • Tests added or updated for new or changed behavior.
  • No secrets, API keys, or credentials committed.
  • Doc pages updated for any user-facing behavior changes (new commands, changed defaults, new features, bug fixes that contradict existing docs).

Doc Changes

  • Follows the style guide. Try running the update-docs agent skill to draft changes while complying with the style guide. For example, prompt your agent with "/update-docs catch up the docs for the new changes I made in this PR."
  • New pages include SPDX license header and frontmatter, if creating a new page.
  • Cross-references and links verified.

Summary by CodeRabbit

  • Documentation
    • Added a step-by-step guide to set up local GPU inference with Ollama: validate GPU/container runtime, install and verify Ollama, pre-load a large model, configure the service for network access, install client tools, connect to the local HTTPS inference endpoint with verification, and run an agent test using a sample prompt and session.

Add step-by-step instructions for setting up local inference with
Ollama on DGX Spark, covering NVIDIA runtime verification, Ollama
install and model pre-load, OLLAMA_HOST=0.0.0.0 configuration, and
sandbox connection with verification.

Fixes NVIDIA#314, NVIDIA#385
@coderabbitai
Copy link

coderabbitai bot commented Mar 23, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 00cac537-810f-4ccb-b012-7ece5ad893eb

📥 Commits

Reviewing files that changed from the base of the PR and between 0ff8614 and 8d02c4d.

📒 Files selected for processing (1)
  • spark-install.md
✅ Files skipped from review due to trivial changes (1)
  • spark-install.md

📝 Walkthrough

Walkthrough

Added a new "Setup Local Inference (Ollama)" documentation section to spark-install.md that provides an end-to-end local GPU inference setup: validate NVIDIA Docker runtime, install and configure Ollama (including pulling a model and enabling external binding), install OpenShell and NemoClaw, and verify/connect to inference.local.

Changes

Cohort / File(s) Summary
Documentation — Local inference workflow
docs/spark-install.md
Added "Setup Local Inference (Ollama)" section (+92 lines). Describes steps to test NVIDIA Docker runtime and configure with nvidia-ctk, install and verify Ollama (curl to :11434), pull nemotron-3-super:120b and preload, set OLLAMA_HOST=0.0.0.0 via systemd override, install OpenShell and NemoClaw, and run nemoclaw ... connect / openclaw agent ... --local with explicit curl checks to https://inference.local/v1/models.

Sequence Diagram(s)

sequenceDiagram
  actor User
  participant Host
  participant Docker
  participant Ollama
  participant Systemd
  participant OpenShell
  participant NemoClaw

  User->>Host: run `docker run --gpus all --rm nvidia/cuda:... nvidia-smi`
  Host->>Docker: attempt GPU runtime
  alt GPU runtime missing
    Host->>Host: run `nvidia-ctk` to configure runtime & restart Docker
  end
  User->>Host: install Ollama (official script)
  Host->>Ollama: start service (default localhost:11434)
  User->>Ollama: curl http://127.0.0.1:11434 (verify)
  User->>Ollama: `ollama pull nemotron-3-super:120b` (preload)
  User->>Systemd: create override to set `OLLAMA_HOST=0.0.0.0`
  Systemd->>Ollama: restart service (listen 0.0.0.0:11434)
  User->>OpenShell: install (script) & choose "Local Ollama" + model
  User->>NemoClaw: install (script), run `nemoclaw ... connect`
  NemoClaw->>Host: curl -sf https://inference.local/v1/models (validate routing)
  User->>OpenShell: run `openclaw agent ... --local` (start agent using local inference)
Loading

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~2 minutes

Poem

"I hopped through docs with eager paws—
tested GPUs and fixed the claws.
Pulled a model, opened the door,
now local inference roams the floor.
— 🐇"

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Linked Issues check ❓ Inconclusive The PR adds documentation for local Ollama setup on DGX Spark, addressing issue #314's need to document reliable in-sandbox routing to local Ollama. However, it is purely documentation and does not implement fixes for the underlying 403 Forbidden routing issue itself. Clarify whether this documentation PR is intended to supplement a code fix for issue #314's routing problem, or if code changes addressing the 403 error are expected in a separate PR.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title concisely and accurately describes the main change: adding local Ollama inference setup documentation to spark-install.md.
Out of Scope Changes check ✅ Passed All changes are within scope: the PR adds step-by-step setup documentation for local Ollama inference on DGX Spark, which aligns with documenting reliable in-sandbox routing mentioned in issue #314.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@spark-install.md`:
- Around line 169-177: Add an explicit check for the local inference proxy in
Step 6: after running the "nemoclaw my-assistant connect" command and before
"openclaw agent --agent main --local ...", run a curl GET against
http://inference.local/api/tags, capture the response body to
/tmp/inference_tags.json and assert the HTTP status is 200 so the documentation
verifies the non-403 fallback path is working; reference the existing step
commands ("nemoclaw my-assistant connect" and "openclaw agent --agent main
--local -m ...") so the check is placed inside the sandbox and fails the doc
verification if inference.local returns 403 or non-200.
- Line 157: Replace the netstat-based listener check "sudo netstat -nap | grep
11434" with an ss-based check: update the line that mentions netstat to use ss
to list listening TCP sockets with numeric ports and process info (for example
using ss with listen, tcp, numeric and process flags and filtering for port
11434) so the doc uses the standard iproute2 tool present on Ubuntu 24.04.
- Line 105: Replace the failing runtime verification command that uses the plain
"ubuntu" image; update the Docker command string "docker run --rm
--runtime=nvidia --gpus all ubuntu nvidia-smi" to use an NVIDIA CUDA image (for
example an nvidia/cuda:<tag>-runtime image such as
nvidia/cuda:11.8-runtime-ubuntu20.04) so that nvidia-smi is present in the
container; keep the same flags (--rm --runtime=nvidia --gpus all) and the final
command (nvidia-smi) but swap the image name to a CUDA runtime image.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 6ab3c5b3-d63d-4252-9a95-e07ac7cddea3

📥 Commits

Reviewing files that changed from the base of the PR and between ffa1283 and 321962f.

📒 Files selected for processing (1)
  • spark-install.md

netstat requires net-tools which is not installed by default on
Ubuntu 24.04. ss from iproute2 is available by default and is
more reliable for verifying listening sockets.

Signed-off-by: Paritosh Dixit <paritoshd@nvidia.com>
Add explicit curl to https://inference.local/v1/models inside the
sandbox to validate the proxy route before running the agent. This
prevents fallback paths from masking regressions in the fix for NVIDIA#314.
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
spark-install.md (1)

142-152: Add a hardening note when binding Ollama to 0.0.0.0.

Line 148 intentionally exposes Ollama on all interfaces; add a short warning to restrict network access (trusted LAN only / firewall), since Ollama is typically unauthenticated by default.

Suggested wording
 ### Step 4: Configure Ollama to Listen on All Interfaces

 By default Ollama binds to `127.0.0.1`, which is not reachable from inside the sandbox container. Configure it to listen on all interfaces:
+> Security note: `OLLAMA_HOST=0.0.0.0` exposes Ollama on your network. Restrict access with host firewall rules or trusted-network isolation.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@spark-install.md` around lines 142 - 152, Add a short hardening warning to
Step 4 near the OLLAMA_HOST=0.0.0.0 instruction: note that binding Ollama to
0.0.0.0 exposes the service to all network interfaces and should only be done on
a trusted LAN or behind a firewall, and recommend restricting access via
firewall rules or local network-only interfaces if Ollama is unauthenticated by
default; reference the OLLAMA_HOST=0.0.0.0 override.conf instruction so readers
know where to apply the caution.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@spark-install.md`:
- Around line 176-181: Update the probe command that currently reads `curl -s
https://inference.local/v1/models` so it fails fast on non-2xx responses;
replace it with a curl invocation that returns non-zero on non-success (for
example `curl -sSf https://inference.local/v1/models`) or explicitly assert HTTP
200 (for example `curl -s -o /dev/null -w '%{http_code}'
https://inference.local/v1/models | grep -q '^200$'`) so the step gates success
when `inference.local` does not return 200.

---

Nitpick comments:
In `@spark-install.md`:
- Around line 142-152: Add a short hardening warning to Step 4 near the
OLLAMA_HOST=0.0.0.0 instruction: note that binding Ollama to 0.0.0.0 exposes the
service to all network interfaces and should only be done on a trusted LAN or
behind a firewall, and recommend restricting access via firewall rules or local
network-only interfaces if Ollama is unauthenticated by default; reference the
OLLAMA_HOST=0.0.0.0 override.conf instruction so readers know where to apply the
caution.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 3569242d-fbcd-4439-8540-752cb0c458c5

📥 Commits

Reviewing files that changed from the base of the PR and between 321962f and a9dbc13.

📒 Files selected for processing (1)
  • spark-install.md

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
spark-install.md (3)

157-157: Add sudo to the ss command for complete process information.

The -p flag requires elevated privileges to display process information. While the command will work without sudo, it won't show the full process details that help verify Ollama is the service listening on port 11434.

📝 Suggested fix
-ss -tlnp | grep 11434
+sudo ss -tlnp | grep 11434
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@spark-install.md` at line 157, Update the command that checks listeners to
run with elevated privileges so process info is shown; change the existing "ss
-tlnp | grep 11434" invocation to run under sudo (i.e., prefix with sudo) so the
-p flag can return full process details and confirm Ollama is the process on
port 11434.

138-140: Consider providing a non-interactive alternative for model preloading.

The current step requires users to manually type /bye to exit, which breaks automation. Consider adding a note about a non-interactive approach or mention that this step is optional (the model will be loaded on first actual use).

📝 Suggested documentation improvement
 Run it briefly to pre-load weights into unified memory, then exit:
 
 ```bash
 ollama run nemotron-3-super:120b
 # type /bye to exit

+> Note: This step is optional. The model will be loaded automatically on first use, but pre-loading can reduce initial inference latency.

</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @spark-install.md around lines 138 - 140, Update the "ollama run
nemotron-3-super:120b" step to note that it is optional and that the model will
be loaded on first use, and add a short non-interactive alternative so
automation isn't blocked; reference the interactive shutdown token "/bye" and
describe using a one-shot or piped/timeout-based invocation as the
non-interactive approach and include a brief example sentence explaining reduced
latency from preloading.


</details>

---

`173-173`: **Clarify the sandbox name reference.**

The command references `my-assistant` as the sandbox name, but this name isn't defined in the "Setup Local Inference (Ollama)" section. Consider adding a note that this is the default sandbox name created during onboarding, or reference where users should have created this sandbox.


<details>
<summary>📝 Suggested clarification</summary>

```diff
+Connect to your sandbox (the default name is `my-assistant` if created during onboarding):
+
 ```bash
 # Connect to the sandbox
 nemoclaw my-assistant connect
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @spark-install.md at line 173, Clarify that the sandbox name "my-assistant"
used in the command "nemoclaw my-assistant connect" is the default sandbox
created during onboarding (or point to where users should create it). Update the
"Setup Local Inference (Ollama)" section to either mention that onboarding
creates a sandbox named "my-assistant" or add a brief note/instruction telling
users how to create/choose a sandbox before running "nemoclaw my-assistant
connect" so the reference is explicit and not ambiguous.


</details>

</blockquote></details>

</blockquote></details>

<details>
<summary>🤖 Prompt for all review comments with AI agents</summary>

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @spark-install.md:


Nitpick comments:
In @spark-install.md:

  • Line 157: Update the command that checks listeners to run with elevated
    privileges so process info is shown; change the existing "ss -tlnp | grep 11434"
    invocation to run under sudo (i.e., prefix with sudo) so the -p flag can return
    full process details and confirm Ollama is the process on port 11434.
  • Around line 138-140: Update the "ollama run nemotron-3-super:120b" step to
    note that it is optional and that the model will be loaded on first use, and add
    a short non-interactive alternative so automation isn't blocked; reference the
    interactive shutdown token "/bye" and describe using a one-shot or
    piped/timeout-based invocation as the non-interactive approach and include a
    brief example sentence explaining reduced latency from preloading.
  • Line 173: Clarify that the sandbox name "my-assistant" used in the command
    "nemoclaw my-assistant connect" is the default sandbox created during onboarding
    (or point to where users should create it). Update the "Setup Local Inference
    (Ollama)" section to either mention that onboarding creates a sandbox named
    "my-assistant" or add a brief note/instruction telling users how to
    create/choose a sandbox before running "nemoclaw my-assistant connect" so the
    reference is explicit and not ambiguous.

</details>

---

<details>
<summary>ℹ️ Review info</summary>

<details>
<summary>⚙️ Run configuration</summary>

**Configuration used**: Path: .coderabbit.yaml

**Review profile**: CHILL

**Plan**: Pro

**Run ID**: `216e11da-1053-4127-b4a3-815b93174dc0`

</details>

<details>
<summary>📥 Commits</summary>

Reviewing files that changed from the base of the PR and between a9dbc13e8855c50e06595bbd6295ee5102983a7f and 909f98e94bcddb60745c43e0238c5a5ed9161004.

</details>

<details>
<summary>📒 Files selected for processing (1)</summary>

* `spark-install.md`

</details>

</details>

<!-- This is an auto-generated comment by CodeRabbit for review status -->

@paritoshd-nv paritoshd-nv force-pushed the docs/spark-local-ollama-setup branch from 909f98e to 0ff8614 Compare March 23, 2026 01:53
Use curl -sf so the check exits non-zero on HTTP errors (403, 503,
etc.), preventing a silent 403 from masking a proxy routing regression.

Signed-off-by: Paritosh Dixit <paritoshd@nvidia.com>
@paritoshd-nv paritoshd-nv force-pushed the docs/spark-local-ollama-setup branch from 0ff8614 to 8d02c4d Compare March 23, 2026 01:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] inference.local returns HTTP 403 inside sandbox when using Ollama local inference on DGX Spark

1 participant