feat: expand provider onboarding and validation by kjw3 · Pull Request #648 · NVIDIA/NemoClaw

kjw3 · 2026-03-22T15:29:48Z

Summary

Expand nemoclaw onboard from an NVIDIA-only path into provider-first onboarding with live validation for hosted and local inference runtimes. This makes the sandbox model identity match the selected provider/model, adds compatible endpoint support, and updates the docs to reflect the current host-routed inference architecture.

Gemini note: the compatible-endpoint path here was informed by the debugging thread in #368. Credit to @andy-ratsirarson for surfacing the supportsStore: false direction that unblocked Gemini, and to @alannguyen1 for adjacent investigation that helped narrow the fix path.

Related Issue

Closes #24
Closes #567
Closes #628

Partially addresses Onboarding blocked: Unable to generate NVIDIA API key on build.nvidia.com — account verification + org permissions issues #201 by allowing non-NVIDIA providers during onboarding
Partially addresses CRITICAL: NVIDIA API Key Exposed in Process Arguments and Terminal Output #579 and chore: unify credential passing pattern across all integrations #616 by avoiding credential values in openshell argv during onboarding/provider setup
Partially addresses feat: add azure-nim inference profile for Microsoft Foundry-hosted NIM endpoints #58 by supporting API-key-based Foundry/OpenAI-compatible endpoints via Other OpenAI-compatible endpoint

Changes

Add provider-first onboarding for NVIDIA Endpoints, OpenAI, Other OpenAI-compatible endpoint, Anthropic, Other Anthropic-compatible endpoint, Google Gemini, and Local Ollama, with experimental local NIM/vLLM still gated separately.
Move provider/model selection before sandbox creation and bake the selected immutable OpenClaw model config into the sandbox image.
Validate the selected provider before leaving onboarding step 2:
- OpenAI-compatible runtimes probe /responses first and fall back to /chat/completions
- Anthropic-compatible runtimes validate via /v1/messages
Add curated model pickers plus Other... flows with retry logic:
- NVIDIA Endpoints manual models validate against /v1/models
- compatible proxy flows validate by live inference instead of depending on /models
- manual entry reprompts on bad syntax/model failures and returns to provider selection for auth/base URL/transport failures
Fix local Ollama onboarding to offer starter models when none are installed, pull/warm the selected model, and validate it before continuing.
Route sandboxed OpenClaw through inference.local with the correct selected model identity so in-sandbox reporting matches the active provider instead of a baked Nemotron default.
Avoid passing provider secrets in openshell argv during onboarding/provider creation.
Refresh README and inference docs for the new multi-provider developer workflow.
Add/refresh onboarding and inference tests, and align the maintained package migration-state tests with the recent hardened env checks from main.

Type of Change

Code change for a new feature, bug fix, or refactor.
Code change with doc updates.
Doc only. Prose changes without code sample modifications.
Doc only. Includes code sample changes.

Testing

make check passes.
npm test passes.
make docs builds without warnings. (for doc-only changes)

Additional validation:

cd nemoclaw && npm run build
node --test test/onboard-selection.test.js test/onboard.test.js test/inference-config.test.js
Manual validation for non-experimental providers on the feature branch, including OpenAI, Anthropic, Google Gemini, NVIDIA Endpoints, compatible proxy flows, and Ollama

Checklist

General

I have read and followed the contributing guide.
I have read and followed the style guide. (for doc-only changes)

Code Changes

make format applied (TypeScript and Python).
Tests added or updated for new or changed behavior.
No secrets, API keys, or credentials committed.
Doc pages updated for any user-facing behavior changes (new commands, changed defaults, new features, bug fixes that contradict existing docs).

Doc Changes

Follows the style guide. Try running the update-docs agent skill to draft changes while complying with the style guide. For example, prompt your agent with "/update-docs catch up the docs for the new changes I made in this PR."
New pages include SPDX license header and frontmatter, if creating a new page.
Cross-references and links verified.

Summary by CodeRabbit

New Features
- Multi-provider inference support enabling selection from NVIDIA Endpoints, OpenAI, Anthropic, Google Gemini, and compatible endpoints.
- Enhanced onboarding flow with provider validation and configuration.
- Improved secure credential input handling for sensitive data.
- Local Ollama support with improved model selection and validation.
Documentation
- Updated guides for provider selection, inference routing, and configuration across supported providers.

coderabbitai · 2026-03-22T15:30:01Z

📝 Walkthrough

Walkthrough

The PR introduces multi-provider inference support, replacing hardcoded NVIDIA defaults with provider-agnostic routing. Users now select from supported providers (NVIDIA Endpoints, OpenAI, Anthropic, Gemini, Ollama) during onboarding. Provider configuration is dynamically injected into the sandbox image at build time, credentials are securely prompted via TTY, and the OpenShell gateway routes inference requests to the selected provider endpoint.

Changes

Cohort / File(s)	Summary
Docker Build & Provider Injection `Dockerfile`	Added build arguments for provider selection (`NEMOCLAW_PROVIDER_KEY`, `NEMOCLAW_PRIMARY_MODEL_REF`, etc.). Updated `openclaw.json` generation to dynamically build a `providers` map and embed the selected provider/model instead of hardcoded NVIDIA defaults.
Credentials & Secret Handling `bin/lib/credentials.js`	Introduced `promptSecret()` for TTY-based secret input with backspace support and SIGINT handling. Extended `prompt()` to route secret inputs through the new handler when `opts.secret` is true, replacing non-secret credential prompts.
Inference Configuration `bin/lib/inference-config.js`	Extended provider selection config to handle multiple providers: added `nvidia-prod` and custom endpoint mappings for OpenAI, Anthropic, Gemini, and compatible endpoints. Each provider specifies credential environment variable and default model.
Local Inference Validation `bin/lib/local-inference.js`	Added provider-specific validation URL resolution (`getLocalProviderValidationBaseUrl`), Ollama bootstrap model selection logic (`getBootstrapOllamaModelOptions`), and updated model defaults to use bootstrap options when no models exist.
Onboarding Orchestration `bin/lib/onboard.js`	Restructured onboarding from streaming sandbox creation to argument-based OpenShell execution. Added OpenShell CLI discovery, gateway/sandbox reconciliation, provider/model validation via endpoint probing, Dockerfile patching with provider/model injection, and inference route setup via OpenShell commands. Significantly expanded provider catalog and retry/validation flows.
NIM Container Management `bin/lib/nim.js`	Refactored NIM operations to use explicit container names. Added `*ByName` variants (`startNimContainerByName`, `stopNimContainerByName`, `nimStatusByName`); existing functions now delegate to these.
CLI & Registry Updates `bin/nemoclaw.js`	Updated sandbox status and destroy flows to resolve NIM container name from sandbox registry metadata when available, falling back to derived names if absent.
Documentation Updates `README.md`, `docs/about/how-it-works.md`, `docs/inference/switch-inference-providers.md`, `docs/reference/commands.md`, `docs/reference/inference-profiles.md`, `docs/reference/troubleshooting.md`	Rewrote inference guidance from NVIDIA-centric to provider-agnostic. Added supported provider matrix, provider selection/validation flow explanation, credential storage location (`~/.nemoclaw/credentials.json`), and runtime switching via OpenShell routing.
Unit Tests `test/inference-config.test.js`, `test/local-inference.test.js`	Updated provider assertion from `nvidia-nim` to `nvidia-prod`, added test for Anthropic-compatible endpoint config, and corrected Ollama model list behavior when empty.
Onboarding Selection Tests `test/onboard-selection.test.js`	Comprehensive test suite for new onboarding flows: provider/model selection, endpoint validation via mocked `curl`, Ollama bootstrap/warmup, model-specific `preferredInferenceApi` assertions, retry/reprompt scenarios, and validation failure handling.
Onboarding Helper Tests `test/onboard.test.js`	Added tests for Dockerfile patching (`patchStagedDockerfile`), sandbox inference config mapping (`getSandboxInferenceConfig`), stale sandbox cleanup (`pruneStaleSandboxEntry`), and subprocess-driven credential/provider updates via OpenShell CLI.
End-to-End Tests `test/e2e/test-full-e2e.sh`	Refactored command success checks to explicit `if/then/else` blocks, updated provider assertion from `nvidia-nim` to `nvidia-prod`, and improved test readability with `shellcheck` source directives.

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant CLI as nemoclaw CLI
    participant Onboard as Onboarding<br/>Handler
    participant Provider as Provider<br/>Validation
    participant Docker as Docker<br/>Build
    participant OpenShell as OpenShell<br/>Gateway
    participant Sandbox as Sandbox<br/>Container
    participant Registry as Registry
    
    User->>CLI: nemoclaw onboard
    CLI->>Onboard: Initialize onboarding
    
    Onboard->>User: Prompt provider selection
    User-->>Onboard: Select provider (e.g., OpenAI)
    
    Onboard->>Provider: Validate provider endpoint
    Provider->>Provider: Probe /chat/completions or<br/>provider-specific endpoint
    Provider-->>Onboard: Endpoint available
    
    Onboard->>User: Prompt for model/credentials
    User-->>Onboard: Provide model + credential
    
    Onboard->>Docker: Patch staged Dockerfile<br/>(inject provider/model args)
    Docker-->>Onboard: Dockerfile patched
    
    Onboard->>Docker: Build sandbox image<br/>(NEMOCLAW_PROVIDER_KEY,<br/>NEMOCLAW_PRIMARY_MODEL_REF, etc.)
    Docker->>Docker: Generate openclaw.json<br/>with provider map
    Docker-->>Onboard: Image built
    
    Onboard->>OpenShell: Create sandbox from image
    OpenShell->>Sandbox: Launch container
    Sandbox-->>OpenShell: Ready
    
    Onboard->>OpenShell: Configure inference route<br/>(upsert provider + model)
    OpenShell-->>Onboard: Route configured
    
    Onboard->>Provider: Verify inference route
    Provider->>Sandbox: Test inference.local
    Sandbox->>OpenShell: Forward to provider endpoint
    OpenShell->>Provider: Call actual provider
    Provider-->>Sandbox: Response
    Sandbox-->>Onboard: Verified
    
    Onboard->>Registry: Store sandbox metadata<br/>(provider, model, nimContainer)
    Registry-->>Onboard: Metadata saved
    
    Onboard-->>User: Onboarding complete

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 Hops of joy through provider lands!
No more hardcoded NVIDIA brands—
OpenAI, Anthropic, Ollama too,
Each tunnel routed, shiny and new.
Credentials safe in TTY hands,
While sandbox dreams in Docker strands! 🐇

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 12.94% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'feat: expand provider onboarding and validation' accurately reflects the main change: expanding onboarding to support multiple providers and adding validation logic for provider selection.
Linked Issues check	✅ Passed	The PR comprehensively addresses all three linked issues: `#24` (in-sandbox model identity now reflects configured provider), `#567` (onboarding now allows provider choice before any API key prompt), and `#628` (Local Ollama selection is recorded in sandbox openclaw.json).
Out of Scope Changes check	✅ Passed	All changes are scoped to the provider onboarding and validation objectives. Docker build arguments, credential handling, inference routing, onboarding flows, documentation updates, and related tests are all directly aligned with the PR goals.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/openai-anthropic-gemini-providers

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-03-22T15:30:12Z

PR Preview Action v1.8.1
🚀 View preview at https://NVIDIA.github.io/NemoClaw/pr-preview/pr-648/
Built to branch `gh-pages` at 2026-03-22 16:06 UTC. Preview will be ready when the GitHub Pages deployment is complete.

coderabbitai

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

bin/lib/credentials.js (1)
172-179: ⚠️ Potential issue | 🔴 Critical

execSync is used but not imported — runtime ReferenceError.

The functions isRepoPrivate() (line 174) and ensureGithubToken() (line 189) both call execSync, but this function is not imported from child_process. This will cause a ReferenceError when either function is executed.
Proposed fix: restore the execSync import
 const fs = require("fs");
 const path = require("path");
 const readline = require("readline");
+const { execSync } = require("child_process");
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/credentials.js` around lines 172 - 179, The file uses execSync in
functions isRepoPrivate and ensureGithubToken but never imports it; add an
import for execSync from the child_process module at the top of the file so both
functions can call execSync without causing a ReferenceError (locate the
top-level imports and add the execSync import alongside them so isRepoPrivate
and ensureGithubToken work correctly).

🧹 Nitpick comments (2)

docs/inference/switch-inference-providers.md (1)
94-94: Keep one sentence per source line here.

This bullet packs two sentences onto one source line. Split it so diffs stay readable.

As per coding guidelines, “One sentence per line in source (makes diffs readable). Flag paragraphs where multiple sentences appear on the same line.”
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/inference/switch-inference-providers.md` at line 94, The bullet "Runtime
switching changes the OpenShell route. It does not rewrite your stored
credentials." contains two sentences on one source line; split it into two
source lines so each sentence is on its own line (e.g., place "Runtime switching
changes the OpenShell route." and "It does not rewrite your stored credentials."
on separate lines) to satisfy the "one sentence per source line" guideline and
keep diffs readable.
docs/reference/inference-profiles.md (1)
6-6: Use NemoClaw casing in the keywords entries.

Both new keywords use nemoclaw, which breaks the project word list and leaks inconsistent casing into search metadata.

As per coding guidelines, “NemoClaw, OpenClaw, and OpenShell must use correct casing.”
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/reference/inference-profiles.md` at line 6, The keywords array uses
incorrect lowercase "nemoclaw" tokens which violate the casing rule; update the
two entries in the keywords array to use the exact "NemoClaw" casing (e.g.,
change "nemoclaw inference profiles" -> "NemoClaw inference profiles" and
"nemoclaw provider routing" -> "NemoClaw provider routing") so the keywords list
and search metadata follow the required NemoClaw capitalization.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bin/lib/onboard.js`:
- Around line 1347-1356: The setupNim(gpu) function currently assumes
sandboxName exists which causes a ReferenceError; move any code that starts or
registers a NIM container (calls to nim.startNimContainer(...) and
registry.updateSandbox(...)) out of setupNim and defer it until after
createSandbox(...) returns in onboard(), so that you call
nim.startNimContainer(sandboxName, model) and then
registry.updateSandbox(sandboxName, { nimContainer }) using the real
sandboxName; keep setupNim responsible only for selecting
model/provider/preferredInferenceApi and return a flag or the model info (e.g.,
nimPending or model) to onboard(), then perform the actual container creation
and sandbox update there.

In `@docs/reference/inference-profiles.md`:
- Around line 41-50: The docs incorrectly mark Local Ollama as experimental and
omit it from the non-experimental provider table; update
docs/reference/inference-profiles.md to add a row for "Local Ollama" in the
provider table (e.g., Provider: Local Ollama, Endpoint Type: OpenAI-compatible
or appropriate type, Notes: local Ollama runtime) and remove or adjust the text
that wraps Local Ollama behind NEMOCLAW_EXPERIMENTAL (the gating mentioned later
in the file) so Local Ollama is listed alongside the other non-experimental
providers (ensure references to NEMOCLAW_EXPERIMENTAL only apply to local NVIDIA
NIM and local vLLM as intended).

In `@README.md`:
- Around line 184-193: The README's "Supported non-experimental onboarding
paths" table omits "Local Ollama" while the surrounding text still labels Ollama
as experimental; update the table row entries to include a "Local Ollama"
provider (with a short note like "Local Ollama — local model serving via
Ollama") or alternatively narrow the heading to explicitly state which providers
are listed (e.g., "Supported non-experimental onboarding paths (cloud
providers)"). Ensure the change aligns with the PR that moves Ollama into the
standard flow and update any adjacent note that still calls Ollama experimental
so wording is consistent.

In `@test/onboard.test.js`:
- Around line 196-215: The test fixture uses the old provider id "nvidia-nim" in
both the mocked inference get output and the setup call; update the mock
response lines that include "Provider: nvidia-nim" and the setupInference
invocation that passes "nvidia-nim" to instead use "nvidia-prod" so the test
aligns with the rest of the PR (look for the mocked block that returns "Gateway
inference" and the call to setupInference("test-box", ..., "nvidia-nim") and
replace the provider string in both places).

---

Outside diff comments:
In `@bin/lib/credentials.js`:
- Around line 172-179: The file uses execSync in functions isRepoPrivate and
ensureGithubToken but never imports it; add an import for execSync from the
child_process module at the top of the file so both functions can call execSync
without causing a ReferenceError (locate the top-level imports and add the
execSync import alongside them so isRepoPrivate and ensureGithubToken work
correctly).

---

Nitpick comments:
In `@docs/inference/switch-inference-providers.md`:
- Line 94: The bullet "Runtime switching changes the OpenShell route. It does
not rewrite your stored credentials." contains two sentences on one source line;
split it into two source lines so each sentence is on its own line (e.g., place
"Runtime switching changes the OpenShell route." and "It does not rewrite your
stored credentials." on separate lines) to satisfy the "one sentence per source
line" guideline and keep diffs readable.

In `@docs/reference/inference-profiles.md`:
- Line 6: The keywords array uses incorrect lowercase "nemoclaw" tokens which
violate the casing rule; update the two entries in the keywords array to use the
exact "NemoClaw" casing (e.g., change "nemoclaw inference profiles" -> "NemoClaw
inference profiles" and "nemoclaw provider routing" -> "NemoClaw provider
routing") so the keywords list and search metadata follow the required NemoClaw
capitalization.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 99f3efbe-a2ad-41d0-85ad-76ed5ea3acbd

📥 Commits

Reviewing files that changed from the base of the PR and between b6e0fc1 and 32c9436.

📒 Files selected for processing (15)

Dockerfile
README.md
bin/lib/credentials.js
bin/lib/inference-config.js
bin/lib/local-inference.js
bin/lib/onboard.js
docs/about/how-it-works.md
docs/inference/switch-inference-providers.md
docs/reference/commands.md
docs/reference/inference-profiles.md
docs/reference/troubleshooting.md
nemoclaw/src/commands/migration-state.test.ts
test/inference-config.test.js
test/onboard-selection.test.js
test/onboard.test.js

bin/lib/onboard.js

docs/reference/inference-profiles.md

README.md

test/onboard.test.js

coderabbitai

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

bin/lib/onboard.js (1)
1220-1233: ⚠️ Potential issue | 🟠 Major

Reusing an existing sandbox can leave the baked model identity stale.

By this point the gateway has already been switched to the newly selected provider/model, but this early return skips Dockerfile patching and rebuild. If the user changed either value, the existing sandbox keeps the old baked OpenClaw identity while the host route now points somewhere else.

Also applies to: 1988-1990
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/onboard.js` around lines 1220 - 1233, The early return after the user
chooses to "Keep existing sandbox" (`if (recreate.toLowerCase() !== "y") { ...
return sandboxName; }`) causes Dockerfile patching and rebuild to be skipped,
leaving a stale baked model identity; instead, change the flow so that choosing
to keep the existing sandbox does NOT return early — run the subsequent
Dockerfile patching and rebuild steps (the code that updates the baked OpenClaw
identity) unless you can detect that neither provider nor model changed; modify
the branch around liveExists / prompt (symbols: liveExists, isNonInteractive(),
prompt(), sandboxName) to only short-circuit when it is safe (e.g.,
provider/model unchanged), and apply the same fix at the other occurrence
mentioned (the logic around lines 1988-1990) so both code paths perform
Dockerfile patching/rebuild when necessary.

♻️ Duplicate comments (1)

bin/lib/onboard.js (1)

1779-1780: ⚠️ Potential issue | 🟠 Major

Don't write sandbox metadata under GATEWAY_NAME.

setupInference() is invoked with "nemoclaw" before createSandbox() returns the real sandbox name, but it still updates the registry with model/provider. That either creates a stray nemoclaw entry or leaves the real sandbox showing unknown in nemoclaw list/status.

Suggested fix

-  await setupInference(GATEWAY_NAME, model, provider, endpointUrl, credentialEnv);
+  await setupInference(model, provider, endpointUrl, credentialEnv);
   const sandboxName = await createSandbox(gpu, model, provider, preferredInferenceApi);
-  if (nimContainer) {
-    registry.updateSandbox(sandboxName, { nimContainer });
-  }
+  registry.updateSandbox(sandboxName, {
+    model,
+    provider,
+    ...(nimContainer ? { nimContainer } : {}),
+  });

and remove the registry write from setupInference(), since that step is gateway-scoped, not sandbox-scoped.

Also applies to: 1986-1993

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@bin/lib/onboard.js` around lines 1779 - 1780, The code writes sandbox
metadata under the GATEWAY_NAME because setupInference() (called with
"nemoclaw") calls registry.updateSandbox(sandboxName, { model, provider })
before the real sandbox name is known; remove the registry.updateSandbox call
from setupInference() so gateway-scoped setup doesn't write sandbox-scoped
metadata, and instead perform registry.updateSandbox(sandboxName, { model,
provider }) only after createSandbox() returns the actual sandboxName (or in the
function that has the real sandboxName), keeping verifyInferenceRoute(...) and
provider/model handling in setupInference() but not the registry write.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bin/lib/onboard.js`:
- Around line 531-552: The non-interactive validation path is using process.env
directly and can miss credentials; update validateOpenAiLikeSelection to resolve
credentials via getCredential(credentialEnv) (not raw process.env lookup) before
calling probeOpenAiLikeEndpoint so the probe gets the same credential resolution
as interactive flows (use the same pattern for other validators mentioned);
specifically, ensure the apiKey is derived via getCredential when credentialEnv
is provided, pass that apiKey into probeOpenAiLikeEndpoint, and replicate this
fix for the other functions that call probeOpenAiLikeEndpoint/Anthropic probes
so non-interactive CI runs use the resolved credential.
- Around line 1265-1275: The code is appending secrets (e.g., NVIDIA_API_KEY,
DISCORD_BOT_TOKEN, SLACK_BOT_TOKEN) into envArgs which get serialized into the
openshell sandbox create command and exposed on process argv; instead stop
adding secret values via formatEnvAssignment into envArgs and pass them via a
secure channel: set these values on the spawned process environment
(child_process.spawn/exec options.env) or deliver via stdin/temporary file with
restrictive permissions before calling runOpenshell, updating the logic around
envArgs, formatEnvAssignment and the call site that invokes runOpenshell so
non-secret flags remain in envArgs while actual token values are injected into
the child process env (or a secure file) rather than the command line.
- Around line 436-445: The curl invocations built into the cmd array (the code
that constructs cmd using probe.body and probe.url and then calls
spawnSync("bash", ["-c", cmd])) need connection and overall timeouts added so a
stalled endpoint doesn't hang onboarding; modify the cmd construction to include
--connect-timeout 5 and --max-time 15 (or use configurable constants) before the
URL and data args, and apply the same change to the other identical curl
builders (the other cmd/spawnSync occurrences) so all synchronous probes time
out reliably.
- Around line 1351-1355: When the nim-local fallback path triggers inside
selectionLoop it only logs and breaks without choosing cloud inference, leaving
model and preferredInferenceApi unset; change that flow to assign the cloud
provider values before breaking: set provider, endpointUrl, credentialEnv from
the remote/cloud configuration (e.g., REMOTE_PROVIDER_CONFIG.build or the
appropriate cloud entry), and set preferredInferenceApi to the cloud inference
API name so later inference setup/Dockerfile patching has valid routing data;
update the branch that currently only logs the fallback to perform these
assignments (affecting variables provider, endpointUrl, credentialEnv,
preferredInferenceApi and ensuring model is set/cleared appropriately) before
exiting selectionLoop.

---

Outside diff comments:
In `@bin/lib/onboard.js`:
- Around line 1220-1233: The early return after the user chooses to "Keep
existing sandbox" (`if (recreate.toLowerCase() !== "y") { ... return
sandboxName; }`) causes Dockerfile patching and rebuild to be skipped, leaving a
stale baked model identity; instead, change the flow so that choosing to keep
the existing sandbox does NOT return early — run the subsequent Dockerfile
patching and rebuild steps (the code that updates the baked OpenClaw identity)
unless you can detect that neither provider nor model changed; modify the branch
around liveExists / prompt (symbols: liveExists, isNonInteractive(), prompt(),
sandboxName) to only short-circuit when it is safe (e.g., provider/model
unchanged), and apply the same fix at the other occurrence mentioned (the logic
around lines 1988-1990) so both code paths perform Dockerfile patching/rebuild
when necessary.

---

Duplicate comments:
In `@bin/lib/onboard.js`:
- Around line 1779-1780: The code writes sandbox metadata under the GATEWAY_NAME
because setupInference() (called with "nemoclaw") calls
registry.updateSandbox(sandboxName, { model, provider }) before the real sandbox
name is known; remove the registry.updateSandbox call from setupInference() so
gateway-scoped setup doesn't write sandbox-scoped metadata, and instead perform
registry.updateSandbox(sandboxName, { model, provider }) only after
createSandbox() returns the actual sandboxName (or in the function that has the
real sandboxName), keeping verifyInferenceRoute(...) and provider/model handling
in setupInference() but not the registry write.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 1468b6ba-e6ed-433c-9846-91527e84a444

📥 Commits

Reviewing files that changed from the base of the PR and between 32c9436 and 98a8da9.

📒 Files selected for processing (6)

README.md
bin/lib/nim.js
bin/lib/onboard.js
bin/nemoclaw.js
test/e2e/test-full-e2e.sh
test/local-inference.test.js

✅ Files skipped from review due to trivial changes (1)

README.md

coderabbitai · 2026-03-22T15:51:07Z

bin/lib/onboard.js

+      const cmd = [
+        "curl -sS",
+        `-o ${shellQuote(bodyFile)}`,
+        "-w '%{http_code}'",
+        "-H 'Content-Type: application/json'",
+        ...(apiKey ? ['-H "Authorization: Bearer $NEMOCLAW_PROBE_API_KEY"'] : []),
+        `-d ${shellQuote(probe.body)}`,
+        shellQuote(probe.url),
+      ].join(" ");
+      const result = spawnSync("bash", ["-c", cmd], {


⚠️ Potential issue | 🟠 Major

Add time limits to the live validation curls.

All of these probes are synchronous spawnSync(...curl...) calls without --connect-timeout or --max-time. A blackholed hostname or stalled endpoint will hang onboarding indefinitely instead of reaching the retry/reprompt flow.

Minimal hardening

- "curl -sS", + "curl -sS --connect-timeout 5 --max-time 15",

Also applies to: 479-493, 641-649, 694-701, 729-737

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@bin/lib/onboard.js` around lines 436 - 445, The curl invocations built into the cmd array (the code that constructs cmd using probe.body and probe.url and then calls spawnSync("bash", ["-c", cmd])) need connection and overall timeouts added so a stalled endpoint doesn't hang onboarding; modify the cmd construction to include --connect-timeout 5 and --max-time 15 (or use configurable constants) before the URL and data args, and apply the same change to the other identical curl builders (the other cmd/spawnSync occurrences) so all synchronous probes time out reliably.

bin/lib/onboard.js

coderabbitai · 2026-03-22T15:51:07Z

bin/lib/onboard.js

  if (process.env.NVIDIA_API_KEY) {
-    envArgs.push(`NVIDIA_API_KEY=${shellQuote(process.env.NVIDIA_API_KEY)}`);
+    envArgs.push(formatEnvAssignment("NVIDIA_API_KEY", process.env.NVIDIA_API_KEY));
  }
  const discordToken = getCredential("DISCORD_BOT_TOKEN") || process.env.DISCORD_BOT_TOKEN;
  if (discordToken) {
-    envArgs.push(`DISCORD_BOT_TOKEN=${shellQuote(discordToken)}`);
+    envArgs.push(formatEnvAssignment("DISCORD_BOT_TOKEN", discordToken));
  }
  const slackToken = getCredential("SLACK_BOT_TOKEN") || process.env.SLACK_BOT_TOKEN;
  if (slackToken) {
-    envArgs.push(`SLACK_BOT_TOKEN=${shellQuote(slackToken)}`);
+    envArgs.push(formatEnvAssignment("SLACK_BOT_TOKEN", slackToken));
  }


⚠️ Potential issue | 🟠 Major

NVIDIA_API_KEY still leaks through the sandbox-create argv.

envArgs is serialized into openshell sandbox create ... -- env ..., and runOpenshell() shells that through bash -c, so the provider key is still exposed in process arguments during onboarding. The same argv path also affects any other tokens appended here.

Also applies to: 1281-1284

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@bin/lib/onboard.js` around lines 1265 - 1275, The code is appending secrets (e.g., NVIDIA_API_KEY, DISCORD_BOT_TOKEN, SLACK_BOT_TOKEN) into envArgs which get serialized into the openshell sandbox create command and exposed on process argv; instead stop adding secret values via formatEnvAssignment into envArgs and pass them via a secure channel: set these values on the spawned process environment (child_process.spawn/exec options.env) or deliver via stdin/temporary file with restrictive permissions before calling runOpenshell, updating the logic around envArgs, formatEnvAssignment and the call site that invokes runOpenshell so non-secret flags remain in envArgs while actual token values are injected into the child process env (or a secure file) rather than the command line.

bin/lib/onboard.js

kjw3 · 2026-03-22T15:57:02Z

Reviewer summary for this PR:

This branch changes NemoClaw from a mostly NVIDIA-hosted onboarding flow into a provider-first routed inference flow.

What changed

Onboarding now starts with provider and model selection before sandbox creation.
Supported non-experimental onboarding paths now include:
- NVIDIA Endpoints
- OpenAI
- Other OpenAI-compatible endpoint
- Anthropic
- Other Anthropic-compatible endpoint
- Google Gemini
- Local Ollama
Credentials remain host-side. The sandbox does not receive raw provider API keys.
The sandbox image now bakes the selected model/provider identity into immutable ~/.openclaw/openclaw.json during build instead of relying on post-create config mutation.
OpenShell continues to route sandbox inference through inference.local to the selected host-side provider.

Validation behavior

Onboarding does not move past step 2 unless the selected provider/model/key or local runtime actually validates.
OpenAI-compatible providers prefer POST /responses first, then fall back to POST /chat/completions if needed.
Anthropic providers validate with a live POST /v1/messages request.
NVIDIA Endpoints manual Other... model entry validates against https://integrate.api.nvidia.com/v1/models.
Custom compatible endpoints no longer depend on a /models endpoint. They validate by trying real inference on the entered model.
Ollama now pulls, warms, and validates the selected model before onboarding continues. If no models are installed, onboarding offers starter models and an Other... path.

Notable compatibility work

Gemini now works through the routed OpenAI-compatible path after baking compat.supportsStore=false on the sandbox model entry.
Credit to @andy-ratsirarson and @alannguyen1 for the issue-thread investigation that helped narrow the working Gemini path.
NVIDIA-hosted and compatible-provider sandbox model identities were cleaned up so OpenClaw uses the routed inference/... model path instead of native provider aliases that do not match NemoClaw routing.

Docs and tests

README and developer docs were updated to describe routed multi-provider inference instead of NVIDIA-only onboarding.
Unit coverage was updated for the new provider flows, migration restore hardening, and local runtime handling.

Validation completed on this branch

Live validated by hand on the non-experimental providers during development:
- NVIDIA Endpoints
- OpenAI
- Anthropic
- Google Gemini
- Local Ollama
Package checks and coverage now pass again:
- cd nemoclaw && npm run check
- cd nemoclaw && npm test -- --coverage
Root provider/onboarding tests exercised:
- node --test test/onboard.test.js test/onboard-selection.test.js test/inference-config.test.js test/credentials.test.js

Still intentionally experimental

Local NVIDIA NIM
Local vLLM

If you are reviewing behavior, the highest-value things to check are:

step-2 validation and retry behavior for provider/model/key mistakes
that the selected provider/model is what OpenClaw reports inside the sandbox
that compatible endpoint flows recover cleanly from model vs auth/base-URL failures

coderabbitai

♻️ Duplicate comments (3)

bin/lib/onboard.js (3)
1264-1275: ⚠️ Potential issue | 🟠 Major

Secrets still passed through sandbox-create argv.

NVIDIA_API_KEY, DISCORD_BOT_TOKEN, and SLACK_BOT_TOKEN values are serialized into envArgs and passed to runOpenshell(["sandbox", "create", ...]). This exposes secrets in process arguments during onboarding.

Consider passing secrets via the spawned process's environment or a temporary file with restricted permissions instead of command-line arguments.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/onboard.js` around lines 1264 - 1275, The diff is serializing secrets
into envArgs (via formatEnvAssignment) which are passed to
runOpenshell(["sandbox", "create", ...]) exposing sensitive values; change the
logic so NVIDIA_API_KEY, DISCORD_BOT_TOKEN and SLACK_BOT_TOKEN are NOT appended
to envArgs but are injected via the child process environment or a secure
temporary file instead: detect values using getCredential or process.env as now,
remove any pushes to envArgs for those keys, and when invoking
runOpenshell("sandbox", "create", ...) pass an env object that includes these
secrets (or pass a path to a file created with restrictive permissions) so
secrets never appear in the command-line; update any callers that build
envArgs/formatEnvAssignment and ensure runOpenshell accepts and forwards the env
map to the spawned process.
436-452: ⚠️ Potential issue | 🟠 Major

Add timeouts to synchronous curl probes.

These spawnSync curl calls lack --connect-timeout and --max-time flags. A blackholed or stalled endpoint will hang onboarding indefinitely instead of reaching the retry/reprompt flow.
Suggested fix
       const cmd = [
-        "curl -sS",
+        "curl -sS --connect-timeout 5 --max-time 30",
         `-o ${shellQuote(bodyFile)}`,
Apply this pattern to all curl invocations: lines 436, 479, 641, 694, and 729.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/onboard.js` around lines 436 - 452, The curl command built into the
cmd array before the spawnSync call (where variables like probe, bodyFile,
shellQuote, and ROOT are used) must include connection and total timeouts to
avoid hanging; add --connect-timeout and --max-time flags with appropriate
values to the array construction that creates cmd so the spawned bash curl call
receives them, and apply the same change to the other identical curl invocations
in this file (the other places that construct a curl cmd and call spawnSync with
the same pattern).
531-552: ⚠️ Potential issue | 🟡 Minor

Credential resolution works correctly for interactive flows.

The validateOpenAiLikeSelection function properly uses getCredential(credentialEnv) to resolve credentials. This addresses the concern about credential lookup for validation.

However, for non-interactive mode, if credentials are only in process.env and not stored via saveCredential, there's a gap since getCredential reads from the credentials file, not process.env.
Suggested fix to support env-only credentials
 async function validateOpenAiLikeSelection(
   label,
   endpointUrl,
   model,
   credentialEnv = null,
   retryMessage = "Please choose a provider/model again."
 ) {
-  const apiKey = credentialEnv ? getCredential(credentialEnv) : "";
+  const apiKey = credentialEnv ? (getCredential(credentialEnv) || process.env[credentialEnv] || "") : "";
   const probe = probeOpenAiLikeEndpoint(endpointUrl, model, apiKey);
Apply the same pattern to validateAnthropicSelection (line 555), validateAnthropicSelectionWithRetryMessage (line 578), validateCustomOpenAiLikeSelection (line 595), and validateCustomAnthropicSelection (line 617).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/onboard.js` around lines 531 - 552, The non-interactive validation
functions (validateOpenAiLikeSelection, validateAnthropicSelection,
validateAnthropicSelectionWithRetryMessage, validateCustomOpenAiLikeSelection,
validateCustomAnthropicSelection) currently only use
getCredential(credentialEnv) which reads the credentials file and thus misses
credentials provided solely via process.env; update each function to resolve
apiKey by falling back to process.env when getCredential returns falsy (e.g.
const apiKey = credentialEnv ? (getCredential(credentialEnv) ||
process.env[credentialEnv] || "") : "";), then pass that apiKey into the
existing probe* calls so non-interactive flows accept env-only credentials.

🧹 Nitpick comments (3)

bin/lib/onboard.js (2)

1615-1643: Duplicate Ollama validation logic.

The Ollama model selection and validation loop appears twice: once in the ollama branch (lines 1615-1643) and again in the install-ollama branch (lines 1655-1683). Consider extracting this into a helper function.

Extract shared Ollama validation loop

+async function selectAndValidateOllamaModel(gpu, requestedModel) {
+  while (true) {
+    const installedModels = getOllamaModelOptions(runCapture);
+    let model;
+    if (isNonInteractive()) {
+      model = requestedModel || getDefaultOllamaModel(runCapture, gpu);
+    } else {
+      model = await promptOllamaModel(gpu);
+    }
+    const probe = prepareOllamaModel(model, installedModels);
+    if (!probe.ok) {
+      console.error(`  ${probe.message}`);
+      if (isNonInteractive()) process.exit(1);
+      console.log("  Choose a different Ollama model or select Other.");
+      console.log("");
+      continue;
+    }
+    const api = await validateOpenAiLikeSelection(
+      "Local Ollama",
+      getLocalProviderValidationBaseUrl("ollama-local"),
+      model,
+      null,
+      "Choose a different Ollama model or select Other."
+    );
+    if (!api) continue;
+    return { model, api };
+  }
+}

Also applies to: 1655-1683

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@bin/lib/onboard.js` around lines 1615 - 1643, The Ollama model
selection/validation loop (using getOllamaModelOptions, isNonInteractive,
getDefaultOllamaModel, promptOllamaModel, prepareOllamaModel,
validateOpenAiLikeSelection, getLocalProviderValidationBaseUrl, model,
preferredInferenceApi) is duplicated; extract it into a single helper function
(for example selectAndValidateOllamaModel(runCapture, provider, gpu,
requestedModel)) that encapsulates the while(true) loop and returns the
validated preferredInferenceApi (and selected model if needed), then replace
both duplicated blocks in the ollama and install-ollama branches by calling this
helper and handling its return value.

1568-1570: Cloud fallback relies on implicit defaults—consider making this explicit.

When no NIM models fit GPU VRAM, the code logs "Falling back to cloud API" but then falls through without explicitly setting cloud provider values. This works because the defaults at lines 1351-1355 are already set to NVIDIA Endpoints (nvidia-prod), but the implicit behavior is subtle.

Make the fallback explicit

       if (models.length === 0) {
         console.log("  No NIM models fit your GPU VRAM. Falling back to cloud API.");
+        // Defaults already set to NVIDIA Endpoints (nvidia-prod)
+        break;
       } else {

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@bin/lib/onboard.js` around lines 1568 - 1570, The branch that handles
models.length === 0 only logs a message but relies on earlier implicit defaults
for the cloud fallback; explicitly set the cloud fallback variables when no
local NIM models fit GPU VRAM by assigning the cloud provider/endpoint and any
flag that selects cloud inference (e.g., set provider/endpoint to the NVIDIA
production endpoint value used as the default and ensure the cloud selection
flag is true). Update the models.length === 0 block to explicitly set the same
provider/endpoint variables used elsewhere (the variables that previously
defaulted to "nvidia-prod"/NVIDIA) so the fallback behavior is clear and not
reliant on implicit earlier defaults.

test/onboard.test.js (1)

174-177: Subprocess tests don't clean up temporary directories.

The subprocess tests create temporary directories (e.g., line 176) but don't clean them up after completion, unlike the earlier tests that use try/finally with fs.rmSync. Consider adding cleanup or using a shared test fixture helper.

Add cleanup to subprocess tests

+    try {
       const result = spawnSync(process.execPath, [scriptPath], {
         // ...
       });
       // assertions...
+    } finally {
+      fs.rmSync(tmpDir, { recursive: true, force: true });
+    }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@test/onboard.test.js` around lines 174 - 177, The subprocess test starting
with the "passes credential names to openshell without embedding secret values
in argv" it-block creates temporary directories (tmpDir, fakeBin) but never
removes them; wrap the test body in a try/finally (or add an afterEach/fixture)
and in the finally call fs.rmSync(tmpDir, { recursive: true, force: true }) to
ensure tmpDir and fakeBin are removed, referencing tmpDir/fakeBin variables in
test/onboard.test.js to locate the code to change.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@bin/lib/onboard.js`:
- Around line 1264-1275: The diff is serializing secrets into envArgs (via
formatEnvAssignment) which are passed to runOpenshell(["sandbox", "create",
...]) exposing sensitive values; change the logic so NVIDIA_API_KEY,
DISCORD_BOT_TOKEN and SLACK_BOT_TOKEN are NOT appended to envArgs but are
injected via the child process environment or a secure temporary file instead:
detect values using getCredential or process.env as now, remove any pushes to
envArgs for those keys, and when invoking runOpenshell("sandbox", "create", ...)
pass an env object that includes these secrets (or pass a path to a file created
with restrictive permissions) so secrets never appear in the command-line;
update any callers that build envArgs/formatEnvAssignment and ensure
runOpenshell accepts and forwards the env map to the spawned process.
- Around line 436-452: The curl command built into the cmd array before the
spawnSync call (where variables like probe, bodyFile, shellQuote, and ROOT are
used) must include connection and total timeouts to avoid hanging; add
--connect-timeout and --max-time flags with appropriate values to the array
construction that creates cmd so the spawned bash curl call receives them, and
apply the same change to the other identical curl invocations in this file (the
other places that construct a curl cmd and call spawnSync with the same
pattern).
- Around line 531-552: The non-interactive validation functions
(validateOpenAiLikeSelection, validateAnthropicSelection,
validateAnthropicSelectionWithRetryMessage, validateCustomOpenAiLikeSelection,
validateCustomAnthropicSelection) currently only use
getCredential(credentialEnv) which reads the credentials file and thus misses
credentials provided solely via process.env; update each function to resolve
apiKey by falling back to process.env when getCredential returns falsy (e.g.
const apiKey = credentialEnv ? (getCredential(credentialEnv) ||
process.env[credentialEnv] || "") : "";), then pass that apiKey into the
existing probe* calls so non-interactive flows accept env-only credentials.

---

Nitpick comments:
In `@bin/lib/onboard.js`:
- Around line 1615-1643: The Ollama model selection/validation loop (using
getOllamaModelOptions, isNonInteractive, getDefaultOllamaModel,
promptOllamaModel, prepareOllamaModel, validateOpenAiLikeSelection,
getLocalProviderValidationBaseUrl, model, preferredInferenceApi) is duplicated;
extract it into a single helper function (for example
selectAndValidateOllamaModel(runCapture, provider, gpu, requestedModel)) that
encapsulates the while(true) loop and returns the validated
preferredInferenceApi (and selected model if needed), then replace both
duplicated blocks in the ollama and install-ollama branches by calling this
helper and handling its return value.
- Around line 1568-1570: The branch that handles models.length === 0 only logs a
message but relies on earlier implicit defaults for the cloud fallback;
explicitly set the cloud fallback variables when no local NIM models fit GPU
VRAM by assigning the cloud provider/endpoint and any flag that selects cloud
inference (e.g., set provider/endpoint to the NVIDIA production endpoint value
used as the default and ensure the cloud selection flag is true). Update the
models.length === 0 block to explicitly set the same provider/endpoint variables
used elsewhere (the variables that previously defaulted to "nvidia-prod"/NVIDIA)
so the fallback behavior is clear and not reliant on implicit earlier defaults.

In `@test/onboard.test.js`:
- Around line 174-177: The subprocess test starting with the "passes credential
names to openshell without embedding secret values in argv" it-block creates
temporary directories (tmpDir, fakeBin) but never removes them; wrap the test
body in a try/finally (or add an afterEach/fixture) and in the finally call
fs.rmSync(tmpDir, { recursive: true, force: true }) to ensure tmpDir and fakeBin
are removed, referencing tmpDir/fakeBin variables in test/onboard.test.js to
locate the code to change.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 1e2c3775-838b-4bd9-84f5-170136252f4e

📥 Commits

Reviewing files that changed from the base of the PR and between 98a8da9 and 69adc37.

📒 Files selected for processing (8)

README.md
bin/lib/credentials.js
bin/lib/onboard.js
docs/inference/switch-inference-providers.md
docs/reference/inference-profiles.md
nemoclaw/src/commands/migration-state.test.ts
nemoclaw/src/commands/migration-state.ts
test/onboard.test.js

✅ Files skipped from review due to trivial changes (1)

nemoclaw/src/commands/migration-state.ts

🚧 Files skipped from review as they are similar to previous changes (3)

nemoclaw/src/commands/migration-state.test.ts
README.md
bin/lib/credentials.js

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

bin/lib/onboard.js (1)
1565-1604: ⚠️ Potential issue | 🟠 Major

nim-local fallback leaves model unset when no NIM models fit GPU.

If a user selects nim-local but no models fit their GPU VRAM (line 1568), the code logs a fallback message and breaks. However, model remains null from line 1350, causing downstream createSandbox and setupInference to receive invalid data.

Since this path is already gated by NEMOCLAW_EXPERIMENTAL=1, consider prompting the user to select a different provider instead of silently continuing with null values.
Suggested fix
       if (models.length === 0) {
-        console.log("  No NIM models fit your GPU VRAM. Falling back to cloud API.");
+        console.log("  No NIM models fit your GPU VRAM. Please choose a different provider.");
+        continue selectionLoop;
       } else {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/onboard.js` around lines 1565 - 1604, When the "nim-local" branch
finds no compatible models (models.length === 0), the code currently logs a
fallback and breaks while leaving the shared variable model null, causing
downstream callers like createSandbox and setupInference to receive invalid
input; update this branch so that instead of just logging and breaking it
prompts the user to choose a different provider (or programmatically falls back
to a cloud provider) and sets provider and model to valid values before
continuing, or exits with an explicit error—modify the nim-local handling around
the models.length === 0 check to either re-run the provider selection flow (so
model gets assigned) or set provider to the cloud fallback and assign a sane
default to model, while preserving the NEMOCLAW_EXPERIMENTAL gate.

♻️ Duplicate comments (2)

bin/lib/onboard.js (2)
531-552: ⚠️ Potential issue | 🟠 Major

Non-interactive validation may fail due to credential resolution mismatch.

In non-interactive mode, lines 1460-1477 check process.env[credentialEnv] to verify credentials are present, but then validateOpenAiLikeSelection and similar functions resolve credentials via getCredential() (line 538). If getCredential() reads from ~/.nemoclaw/credentials.json rather than process.env, the validation probe will run without the API key even though the env var was provided.
Suggested fix
+function resolveCredentialValue(envName) {
+  return envName ? (process.env[envName] || getCredential(envName) || "") : "";
+}
+
 async function validateOpenAiLikeSelection(
   label,
   endpointUrl,
   model,
   credentialEnv = null,
   retryMessage = "Please choose a provider/model again."
 ) {
-  const apiKey = credentialEnv ? getCredential(credentialEnv) : "";
+  const apiKey = resolveCredentialValue(credentialEnv);
Apply the same pattern to validateAnthropicSelection, validateAnthropicSelectionWithRetryMessage, validateCustomOpenAiLikeSelection, and validateCustomAnthropicSelection.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/onboard.js` around lines 531 - 552, The validation functions
(validateOpenAiLikeSelection, validateAnthropicSelection,
validateAnthropicSelectionWithRetryMessage, validateCustomOpenAiLikeSelection,
validateCustomAnthropicSelection) resolve credentials via getCredential(), which
can read from the credentials file and thus miss an API key provided in
process.env[credentialEnv]; update each function so when credentialEnv is passed
it first checks process.env[credentialEnv] for the API key and uses that value
if present, otherwise fall back to getCredential(credentialEnv), then pass that
resolved apiKey into probeOpenAiLikeEndpoint / probeAnthropicEndpoint calls to
ensure non-interactive validation uses environment-provided secrets.
436-445: ⚠️ Potential issue | 🟠 Major

Add timeouts to validation curl commands.

These synchronous spawnSync(...curl...) calls lack --connect-timeout and --max-time. A blackholed hostname or stalled endpoint will hang onboarding indefinitely instead of reaching the retry/reprompt flow.
Suggested fix
       const cmd = [
-        "curl -sS",
+        "curl -sS --connect-timeout 5 --max-time 30",
         `-o ${shellQuote(bodyFile)}`,
Apply the same fix to the other curl invocations at lines 480, 642, 695, and 730.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/onboard.js` around lines 436 - 445, The curl invocations built into
the cmd array (used with spawnSync in the onboarding flow) lack timeouts and can
hang; update the cmd construction to include safe timeout flags (e.g., add
'--connect-timeout 5' and '--max-time 30' as separate elements in the array
before the URL) while preserving existing headers and shellQuote usage (refer to
the cmd, probe.body, probe.url, shellQuote, and spawnSync occurrences); apply
the same change to the other curl command constructions/spawnSync calls in this
file (the other cmd/spawnSync blocks) so every synchronous curl includes the
connect and overall timeouts.

🧹 Nitpick comments (2)

docs/reference/inference-profiles.md (1)

23-25: Split compound sentence for clarity.

Line 24 is a long compound sentence. Consider splitting for readability and to follow the one-sentence-per-line guideline.

Suggested fix

 NemoClaw configures inference through the OpenShell gateway.
-The agent inside the sandbox talks to `inference.local`, and OpenShell routes that traffic to the provider you selected during onboarding.
+The agent inside the sandbox talks to `inference.local`.
+OpenShell routes that traffic to the provider you selected during onboarding.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@docs/reference/inference-profiles.md` around lines 23 - 25, The sentence "The
agent inside the sandbox talks to `inference.local`, and OpenShell routes that
traffic to the provider you selected during onboarding." is a long compound
sentence — split it into two clearer sentences; for example, keep "The agent
inside the sandbox talks to `inference.local`." as one sentence and make
"OpenShell routes that traffic to the provider you selected during onboarding."
a separate sentence so the document follows the one-sentence-per-line guideline
and improves readability.

docs/inference/switch-inference-providers.md (1)

90-95: Use a MyST admonition for the Notes section.

The "Notes" section uses a plain H2 header with bullet points. Consider using a :::{note} admonition for callout content, which better signals informational context to readers. As per coding guidelines, "Use MyST admonitions (:::{tip}, :::{note}, :::{warning}) for callouts, not bold text or blockquotes."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/inference/switch-inference-providers.md` around lines 90 - 95, Replace
the plain H2 "## Notes" and its bullet list with a MyST admonition block by
converting the header into a `:::{note}` block, placing the four bullet lines
("- The host keeps provider credentials.", "- The sandbox continues to use
`inference.local`.", "- Runtime switching changes the OpenShell route.", "- It
does not rewrite your stored credentials.") inside the admonition, and closing
it with `:::` so the content becomes a proper MyST note callout instead of a
normal section.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bin/lib/credentials.js`:
- Around line 110-124: The outer Promise in prompt() only accepts resolve so the
silent branch's catch never settles callers; update the Promise constructor to
(resolve, reject) and in the promptSecret().catch handler call reject(err) for
both error paths instead of throwing (and for the SIGINT path call reject(err)
then process.kill(process.pid, "SIGINT") to ensure the Promise is settled before
exiting). This change should be made in the prompt function and touches the
promptSecret error handling block so callers awaiting prompt() won't hang.

In `@bin/lib/onboard.js`:
- Around line 1264-1275: The envArgs array currently includes sensitive values
via formatEnvAssignment for NVIDIA_API_KEY, DISCORD_BOT_TOKEN, and
SLACK_BOT_TOKEN which causes secret leakage in command-line arguments; change
the onboarding flow so these keys are not appended to envArgs and instead inject
them into the spawned process environment (opts.env) or send them over stdin/a
securely permissioned temp file when calling the sandbox create command;
specifically, stop pushing formatEnvAssignment("NVIDIA_API_KEY", ...),
formatEnvAssignment("DISCORD_BOT_TOKEN", ...), and
formatEnvAssignment("SLACK_BOT_TOKEN", ...) into envArgs and set process spawn
options.env (or write/read from stdin/temp file) so the secrets are available to
the child process without appearing in the argv.

---

Outside diff comments:
In `@bin/lib/onboard.js`:
- Around line 1565-1604: When the "nim-local" branch finds no compatible models
(models.length === 0), the code currently logs a fallback and breaks while
leaving the shared variable model null, causing downstream callers like
createSandbox and setupInference to receive invalid input; update this branch so
that instead of just logging and breaking it prompts the user to choose a
different provider (or programmatically falls back to a cloud provider) and sets
provider and model to valid values before continuing, or exits with an explicit
error—modify the nim-local handling around the models.length === 0 check to
either re-run the provider selection flow (so model gets assigned) or set
provider to the cloud fallback and assign a sane default to model, while
preserving the NEMOCLAW_EXPERIMENTAL gate.

---

Duplicate comments:
In `@bin/lib/onboard.js`:
- Around line 531-552: The validation functions (validateOpenAiLikeSelection,
validateAnthropicSelection, validateAnthropicSelectionWithRetryMessage,
validateCustomOpenAiLikeSelection, validateCustomAnthropicSelection) resolve
credentials via getCredential(), which can read from the credentials file and
thus miss an API key provided in process.env[credentialEnv]; update each
function so when credentialEnv is passed it first checks
process.env[credentialEnv] for the API key and uses that value if present,
otherwise fall back to getCredential(credentialEnv), then pass that resolved
apiKey into probeOpenAiLikeEndpoint / probeAnthropicEndpoint calls to ensure
non-interactive validation uses environment-provided secrets.
- Around line 436-445: The curl invocations built into the cmd array (used with
spawnSync in the onboarding flow) lack timeouts and can hang; update the cmd
construction to include safe timeout flags (e.g., add '--connect-timeout 5' and
'--max-time 30' as separate elements in the array before the URL) while
preserving existing headers and shellQuote usage (refer to the cmd, probe.body,
probe.url, shellQuote, and spawnSync occurrences); apply the same change to the
other curl command constructions/spawnSync calls in this file (the other
cmd/spawnSync blocks) so every synchronous curl includes the connect and overall
timeouts.

---

Nitpick comments:
In `@docs/inference/switch-inference-providers.md`:
- Around line 90-95: Replace the plain H2 "## Notes" and its bullet list with a
MyST admonition block by converting the header into a `:::{note}` block, placing
the four bullet lines ("- The host keeps provider credentials.", "- The sandbox
continues to use `inference.local`.", "- Runtime switching changes the OpenShell
route.", "- It does not rewrite your stored credentials.") inside the
admonition, and closing it with `:::` so the content becomes a proper MyST note
callout instead of a normal section.

In `@docs/reference/inference-profiles.md`:
- Around line 23-25: The sentence "The agent inside the sandbox talks to
`inference.local`, and OpenShell routes that traffic to the provider you
selected during onboarding." is a long compound sentence — split it into two
clearer sentences; for example, keep "The agent inside the sandbox talks to
`inference.local`." as one sentence and make "OpenShell routes that traffic to
the provider you selected during onboarding." a separate sentence so the
document follows the one-sentence-per-line guideline and improves readability.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 12d15cc3-576c-4c90-99bf-e5ba1679442a

📥 Commits

Reviewing files that changed from the base of the PR and between 69adc37 and cca052e.

📒 Files selected for processing (18)

Dockerfile
README.md
bin/lib/credentials.js
bin/lib/inference-config.js
bin/lib/local-inference.js
bin/lib/nim.js
bin/lib/onboard.js
bin/nemoclaw.js
docs/about/how-it-works.md
docs/inference/switch-inference-providers.md
docs/reference/commands.md
docs/reference/inference-profiles.md
docs/reference/troubleshooting.md
test/e2e/test-full-e2e.sh
test/inference-config.test.js
test/local-inference.test.js
test/onboard-selection.test.js
test/onboard.test.js

✅ Files skipped from review due to trivial changes (3)

docs/reference/troubleshooting.md
docs/reference/commands.md
test/onboard.test.js

🚧 Files skipped from review as they are similar to previous changes (6)

test/local-inference.test.js
docs/about/how-it-works.md
test/inference-config.test.js
bin/lib/nim.js
README.md
test/onboard-selection.test.js

coderabbitai · 2026-03-22T16:11:21Z

bin/lib/credentials.js

+function prompt(question, opts = {}) {
  return new Promise((resolve) => {
+    const silent = opts.secret === true && process.stdin.isTTY && process.stderr.isTTY;
+    if (silent) {
+      promptSecret(question)
+        .then(resolve)
+        .catch((err) => {
+          if (err && err.code === "SIGINT") {
+            process.kill(process.pid, "SIGINT");
+            return;
+          }
+          throw err;
+        });
+      return;
+    }


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify unresolved Promise paths in prompt() silent branch (read-only). awk 'NR>=110 && NR<=124 {printf "%4d %s\n", NR, $0}' bin/lib/credentials.js rg -nP 'new Promise$\(resolve$\s*=>\s*\{' bin/lib/credentials.js -n -A12 -B2

Repository: NVIDIA/NemoClaw

Length of output: 1081

🏁 Script executed:

sed -n '110,200p' bin/lib/credentials.js | cat -n

Repository: NVIDIA/NemoClaw

Length of output: 3624

Settle the outer prompt() Promise on silent-branch failures.

The Promise at line 111 only accepts resolve, but the silent branch (lines 114–123) has error paths that never settle it. When promptSecret() rejects with SIGINT (line 118), process.kill() executes and returns, leaving the outer Promise unresolved. For other errors (line 121), throw creates an unhandled rejection while the outer Promise still hangs. Callers awaiting prompt() can hang indefinitely.

Add reject to the Promise constructor and call reject(err) in both error cases:

Suggested fix

-function prompt(question, opts = {}) { - return new Promise((resolve) => { +function prompt(question, opts = {}) { + return new Promise((resolve, reject) => { const silent = opts.secret === true && process.stdin.isTTY && process.stderr.isTTY; if (silent) { promptSecret(question) .then(resolve) .catch((err) => { if (err && err.code === "SIGINT") { process.kill(process.pid, "SIGINT"); - return; + reject(err); + return; } - throw err; + reject(err); }); return; }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

function prompt(question, opts = {}) {

return new Promise((resolve) => {

const silent = opts.secret === true && process.stdin.isTTY && process.stderr.isTTY;

if (silent) {

promptSecret(question)

.then(resolve)

.catch((err) => {

if (err && err.code === "SIGINT") {

process.kill(process.pid, "SIGINT");

return;

}

throw err;

});

return;

}

function prompt(question, opts = {}) {

return new Promise((resolve, reject) => {

const silent = opts.secret === true && process.stdin.isTTY && process.stderr.isTTY;

if (silent) {

promptSecret(question)

.then(resolve)

.catch((err) => {

if (err && err.code === "SIGINT") {

process.kill(process.pid, "SIGINT");

reject(err);

return;

}

reject(err);

});

return;

}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@bin/lib/credentials.js` around lines 110 - 124, The outer Promise in prompt() only accepts resolve so the silent branch's catch never settles callers; update the Promise constructor to (resolve, reject) and in the promptSecret().catch handler call reject(err) for both error paths instead of throwing (and for the SIGINT path call reject(err) then process.kill(process.pid, "SIGINT") to ensure the Promise is settled before exiting). This change should be made in the prompt function and touches the promptSecret error handling block so callers awaiting prompt() won't hang.

coderabbitai · 2026-03-22T16:11:21Z

bin/lib/onboard.js

+  const envArgs = [formatEnvAssignment("CHAT_UI_URL", chatUiUrl)];
  if (process.env.NVIDIA_API_KEY) {
-    envArgs.push(`NVIDIA_API_KEY=${shellQuote(process.env.NVIDIA_API_KEY)}`);
+    envArgs.push(formatEnvAssignment("NVIDIA_API_KEY", process.env.NVIDIA_API_KEY));
  }
  const discordToken = getCredential("DISCORD_BOT_TOKEN") || process.env.DISCORD_BOT_TOKEN;
  if (discordToken) {
-    envArgs.push(`DISCORD_BOT_TOKEN=${shellQuote(discordToken)}`);
+    envArgs.push(formatEnvAssignment("DISCORD_BOT_TOKEN", discordToken));
  }
  const slackToken = getCredential("SLACK_BOT_TOKEN") || process.env.SLACK_BOT_TOKEN;
  if (slackToken) {
-    envArgs.push(`SLACK_BOT_TOKEN=${shellQuote(slackToken)}`);
+    envArgs.push(formatEnvAssignment("SLACK_BOT_TOKEN", slackToken));
  }


⚠️ Potential issue | 🟠 Major

Secrets still leak through sandbox-create command arguments.

NVIDIA_API_KEY, DISCORD_BOT_TOKEN, and SLACK_BOT_TOKEN are serialized into envArgs and passed as command-line arguments to openshell sandbox create ... -- env NVIDIA_API_KEY=.... These values are visible in process listings (ps aux).

The PR objectives state "provider secrets are not passed in openshell argv during onboarding/provider creation," but this code path still exposes them.

Consider passing secrets via the spawned process's environment (opts.env) or via stdin/temporary file with restrictive permissions instead of command-line arguments.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@bin/lib/onboard.js` around lines 1264 - 1275, The envArgs array currently includes sensitive values via formatEnvAssignment for NVIDIA_API_KEY, DISCORD_BOT_TOKEN, and SLACK_BOT_TOKEN which causes secret leakage in command-line arguments; change the onboarding flow so these keys are not appended to envArgs and instead inject them into the spawned process environment (opts.env) or send them over stdin/a securely permissioned temp file when calling the sandbox create command; specifically, stop pushing formatEnvAssignment("NVIDIA_API_KEY", ...), formatEnvAssignment("DISCORD_BOT_TOKEN", ...), and formatEnvAssignment("SLACK_BOT_TOKEN", ...) into envArgs and set process spawn options.env (or write/read from stdin/temp file) so the secrets are available to the child process without appearing in the argv.

andy-ratsirarson · 2026-03-22T17:21:25Z

❯ The sandbox image now bakes the selected model/provider identity into immutable ~/.openclaw/openclaw.json during build instead of relying on post-create config mutation.

@kjw3 Does baking the model into openclaw.json matter given the gateway rewrites the model in flight? As the openshell inference set handles routing regardless of what openclaw.json says, even random model names get rewritten.

Since the .nemoclaw/config.json actually have the inference model, why not make that override openclaw.json and be controlled by the user on the host via ~/.nemoclaw/config.json

in my current sandbox

~/.nemoclaw/config.json
{
  "endpointType": "custom",
  "endpointUrl": "https://inference.local/v1",
  "ncpPartner": null,
  "model": "gemini-2.5-flash",
  "profile": "inference-local",
  "credentialEnv": "OPENAI_API_KEY",
  "provider": "custom",
  "providerLabel": "Custom Provider",
  "onboardedAt": "2026-03-22T16:48:25.562Z"
}

coderabbitai bot reviewed Mar 22, 2026

View reviewed changes

bin/lib/onboard.js Show resolved Hide resolved

docs/reference/inference-profiles.md Show resolved Hide resolved

README.md Show resolved Hide resolved

test/onboard.test.js Outdated Show resolved Hide resolved

kjw3 requested review from cv, drobison00, ericksoa, jayavenkatesh19, liveaverage and miyoungc March 22, 2026 15:44

kjw3 added the enhancement: provider Use this label to identify requests to add a new AI provider to NemoClaw. label Mar 22, 2026

kjw3 requested a review from krmurph March 22, 2026 15:48

coderabbitai bot reviewed Mar 22, 2026

View reviewed changes

kjw3 added 4 commits March 22, 2026 11:59

feat: expand provider onboarding and validation

9a0d474

test: refresh provider onboarding expectations

08887f2

fix: align provider onboarding follow-up checks

af114d5

fix: address final provider review follow-ups

cca052e

kjw3 force-pushed the feat/openai-anthropic-gemini-providers branch from 69adc37 to cca052e Compare March 22, 2026 16:06

coderabbitai bot reviewed Mar 22, 2026

View reviewed changes

Conversation

kjw3 commented Mar 22, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related Issue

Changes

Type of Change

Testing

Checklist

General

Code Changes

Doc Changes

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

github-actions bot commented Mar 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Built to branch gh-pages at 2026-03-22 16:06 UTC. Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kjw3 commented Mar 22, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

andy-ratsirarson commented Mar 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kjw3 commented Mar 22, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 22, 2026 •

edited

Loading

github-actions bot commented Mar 22, 2026 •

edited

Loading

Built to branch `gh-pages` at 2026-03-22 16:06 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

andy-ratsirarson commented Mar 22, 2026 •

edited

Loading