Skip to content

feat: expand provider onboarding and validation#648

Open
kjw3 wants to merge 4 commits intomainfrom
feat/openai-anthropic-gemini-providers
Open

feat: expand provider onboarding and validation#648
kjw3 wants to merge 4 commits intomainfrom
feat/openai-anthropic-gemini-providers

Conversation

@kjw3
Copy link
Contributor

@kjw3 kjw3 commented Mar 22, 2026

Summary

Expand nemoclaw onboard from an NVIDIA-only path into provider-first onboarding with live validation for hosted and local inference runtimes. This makes the sandbox model identity match the selected provider/model, adds compatible endpoint support, and updates the docs to reflect the current host-routed inference architecture.

Gemini note: the compatible-endpoint path here was informed by the debugging thread in #368. Credit to @andy-ratsirarson for surfacing the supportsStore: false direction that unblocked Gemini, and to @alannguyen1 for adjacent investigation that helped narrow the fix path.

Related Issue

Closes #24
Closes #567
Closes #628

Related:

Changes

  • Add provider-first onboarding for NVIDIA Endpoints, OpenAI, Other OpenAI-compatible endpoint, Anthropic, Other Anthropic-compatible endpoint, Google Gemini, and Local Ollama, with experimental local NIM/vLLM still gated separately.
  • Move provider/model selection before sandbox creation and bake the selected immutable OpenClaw model config into the sandbox image.
  • Validate the selected provider before leaving onboarding step 2:
    • OpenAI-compatible runtimes probe /responses first and fall back to /chat/completions
    • Anthropic-compatible runtimes validate via /v1/messages
  • Add curated model pickers plus Other... flows with retry logic:
    • NVIDIA Endpoints manual models validate against /v1/models
    • compatible proxy flows validate by live inference instead of depending on /models
    • manual entry reprompts on bad syntax/model failures and returns to provider selection for auth/base URL/transport failures
  • Fix local Ollama onboarding to offer starter models when none are installed, pull/warm the selected model, and validate it before continuing.
  • Route sandboxed OpenClaw through inference.local with the correct selected model identity so in-sandbox reporting matches the active provider instead of a baked Nemotron default.
  • Avoid passing provider secrets in openshell argv during onboarding/provider creation.
  • Refresh README and inference docs for the new multi-provider developer workflow.
  • Add/refresh onboarding and inference tests, and align the maintained package migration-state tests with the recent hardened env checks from main.

Type of Change

  • Code change for a new feature, bug fix, or refactor.
  • Code change with doc updates.
  • Doc only. Prose changes without code sample modifications.
  • Doc only. Includes code sample changes.

Testing

  • make check passes.
  • npm test passes.
  • make docs builds without warnings. (for doc-only changes)

Additional validation:

  • cd nemoclaw && npm run build
  • node --test test/onboard-selection.test.js test/onboard.test.js test/inference-config.test.js
  • Manual validation for non-experimental providers on the feature branch, including OpenAI, Anthropic, Google Gemini, NVIDIA Endpoints, compatible proxy flows, and Ollama

Checklist

General

Code Changes

  • make format applied (TypeScript and Python).
  • Tests added or updated for new or changed behavior.
  • No secrets, API keys, or credentials committed.
  • Doc pages updated for any user-facing behavior changes (new commands, changed defaults, new features, bug fixes that contradict existing docs).

Doc Changes

  • Follows the style guide. Try running the update-docs agent skill to draft changes while complying with the style guide. For example, prompt your agent with "/update-docs catch up the docs for the new changes I made in this PR."
  • New pages include SPDX license header and frontmatter, if creating a new page.
  • Cross-references and links verified.

Summary by CodeRabbit

  • New Features

    • Multi-provider inference support enabling selection from NVIDIA Endpoints, OpenAI, Anthropic, Google Gemini, and compatible endpoints.
    • Enhanced onboarding flow with provider validation and configuration.
    • Improved secure credential input handling for sensitive data.
    • Local Ollama support with improved model selection and validation.
  • Documentation

    • Updated guides for provider selection, inference routing, and configuration across supported providers.

@coderabbitai
Copy link

coderabbitai bot commented Mar 22, 2026

📝 Walkthrough

Walkthrough

The PR introduces multi-provider inference support, replacing hardcoded NVIDIA defaults with provider-agnostic routing. Users now select from supported providers (NVIDIA Endpoints, OpenAI, Anthropic, Gemini, Ollama) during onboarding. Provider configuration is dynamically injected into the sandbox image at build time, credentials are securely prompted via TTY, and the OpenShell gateway routes inference requests to the selected provider endpoint.

Changes

Cohort / File(s) Summary
Docker Build & Provider Injection
Dockerfile
Added build arguments for provider selection (NEMOCLAW_PROVIDER_KEY, NEMOCLAW_PRIMARY_MODEL_REF, etc.). Updated openclaw.json generation to dynamically build a providers map and embed the selected provider/model instead of hardcoded NVIDIA defaults.
Credentials & Secret Handling
bin/lib/credentials.js
Introduced promptSecret() for TTY-based secret input with backspace support and SIGINT handling. Extended prompt() to route secret inputs through the new handler when opts.secret is true, replacing non-secret credential prompts.
Inference Configuration
bin/lib/inference-config.js
Extended provider selection config to handle multiple providers: added nvidia-prod and custom endpoint mappings for OpenAI, Anthropic, Gemini, and compatible endpoints. Each provider specifies credential environment variable and default model.
Local Inference Validation
bin/lib/local-inference.js
Added provider-specific validation URL resolution (getLocalProviderValidationBaseUrl), Ollama bootstrap model selection logic (getBootstrapOllamaModelOptions), and updated model defaults to use bootstrap options when no models exist.
Onboarding Orchestration
bin/lib/onboard.js
Restructured onboarding from streaming sandbox creation to argument-based OpenShell execution. Added OpenShell CLI discovery, gateway/sandbox reconciliation, provider/model validation via endpoint probing, Dockerfile patching with provider/model injection, and inference route setup via OpenShell commands. Significantly expanded provider catalog and retry/validation flows.
NIM Container Management
bin/lib/nim.js
Refactored NIM operations to use explicit container names. Added *ByName variants (startNimContainerByName, stopNimContainerByName, nimStatusByName); existing functions now delegate to these.
CLI & Registry Updates
bin/nemoclaw.js
Updated sandbox status and destroy flows to resolve NIM container name from sandbox registry metadata when available, falling back to derived names if absent.
Documentation Updates
README.md, docs/about/how-it-works.md, docs/inference/switch-inference-providers.md, docs/reference/commands.md, docs/reference/inference-profiles.md, docs/reference/troubleshooting.md
Rewrote inference guidance from NVIDIA-centric to provider-agnostic. Added supported provider matrix, provider selection/validation flow explanation, credential storage location (~/.nemoclaw/credentials.json), and runtime switching via OpenShell routing.
Unit Tests
test/inference-config.test.js, test/local-inference.test.js
Updated provider assertion from nvidia-nim to nvidia-prod, added test for Anthropic-compatible endpoint config, and corrected Ollama model list behavior when empty.
Onboarding Selection Tests
test/onboard-selection.test.js
Comprehensive test suite for new onboarding flows: provider/model selection, endpoint validation via mocked curl, Ollama bootstrap/warmup, model-specific preferredInferenceApi assertions, retry/reprompt scenarios, and validation failure handling.
Onboarding Helper Tests
test/onboard.test.js
Added tests for Dockerfile patching (patchStagedDockerfile), sandbox inference config mapping (getSandboxInferenceConfig), stale sandbox cleanup (pruneStaleSandboxEntry), and subprocess-driven credential/provider updates via OpenShell CLI.
End-to-End Tests
test/e2e/test-full-e2e.sh
Refactored command success checks to explicit if/then/else blocks, updated provider assertion from nvidia-nim to nvidia-prod, and improved test readability with shellcheck source directives.

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant CLI as nemoclaw CLI
    participant Onboard as Onboarding<br/>Handler
    participant Provider as Provider<br/>Validation
    participant Docker as Docker<br/>Build
    participant OpenShell as OpenShell<br/>Gateway
    participant Sandbox as Sandbox<br/>Container
    participant Registry as Registry
    
    User->>CLI: nemoclaw onboard
    CLI->>Onboard: Initialize onboarding
    
    Onboard->>User: Prompt provider selection
    User-->>Onboard: Select provider (e.g., OpenAI)
    
    Onboard->>Provider: Validate provider endpoint
    Provider->>Provider: Probe /chat/completions or<br/>provider-specific endpoint
    Provider-->>Onboard: Endpoint available
    
    Onboard->>User: Prompt for model/credentials
    User-->>Onboard: Provide model + credential
    
    Onboard->>Docker: Patch staged Dockerfile<br/>(inject provider/model args)
    Docker-->>Onboard: Dockerfile patched
    
    Onboard->>Docker: Build sandbox image<br/>(NEMOCLAW_PROVIDER_KEY,<br/>NEMOCLAW_PRIMARY_MODEL_REF, etc.)
    Docker->>Docker: Generate openclaw.json<br/>with provider map
    Docker-->>Onboard: Image built
    
    Onboard->>OpenShell: Create sandbox from image
    OpenShell->>Sandbox: Launch container
    Sandbox-->>OpenShell: Ready
    
    Onboard->>OpenShell: Configure inference route<br/>(upsert provider + model)
    OpenShell-->>Onboard: Route configured
    
    Onboard->>Provider: Verify inference route
    Provider->>Sandbox: Test inference.local
    Sandbox->>OpenShell: Forward to provider endpoint
    OpenShell->>Provider: Call actual provider
    Provider-->>Sandbox: Response
    Sandbox-->>Onboard: Verified
    
    Onboard->>Registry: Store sandbox metadata<br/>(provider, model, nimContainer)
    Registry-->>Onboard: Metadata saved
    
    Onboard-->>User: Onboarding complete
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 Hops of joy through provider lands!
No more hardcoded NVIDIA brands—
OpenAI, Anthropic, Ollama too,
Each tunnel routed, shiny and new.
Credentials safe in TTY hands,
While sandbox dreams in Docker strands! 🐇

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 12.94% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat: expand provider onboarding and validation' accurately reflects the main change: expanding onboarding to support multiple providers and adding validation logic for provider selection.
Linked Issues check ✅ Passed The PR comprehensively addresses all three linked issues: #24 (in-sandbox model identity now reflects configured provider), #567 (onboarding now allows provider choice before any API key prompt), and #628 (Local Ollama selection is recorded in sandbox openclaw.json).
Out of Scope Changes check ✅ Passed All changes are scoped to the provider onboarding and validation objectives. Docker build arguments, credential handling, inference routing, onboarding flows, documentation updates, and related tests are all directly aligned with the PR goals.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/openai-anthropic-gemini-providers

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link

github-actions bot commented Mar 22, 2026

PR Preview Action v1.8.1

QR code for preview link

🚀 View preview at
https://NVIDIA.github.io/NemoClaw/pr-preview/pr-648/

Built to branch gh-pages at 2026-03-22 16:06 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
bin/lib/credentials.js (1)

172-179: ⚠️ Potential issue | 🔴 Critical

execSync is used but not imported — runtime ReferenceError.

The functions isRepoPrivate() (line 174) and ensureGithubToken() (line 189) both call execSync, but this function is not imported from child_process. This will cause a ReferenceError when either function is executed.

Proposed fix: restore the execSync import
 const fs = require("fs");
 const path = require("path");
 const readline = require("readline");
+const { execSync } = require("child_process");
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/credentials.js` around lines 172 - 179, The file uses execSync in
functions isRepoPrivate and ensureGithubToken but never imports it; add an
import for execSync from the child_process module at the top of the file so both
functions can call execSync without causing a ReferenceError (locate the
top-level imports and add the execSync import alongside them so isRepoPrivate
and ensureGithubToken work correctly).
🧹 Nitpick comments (2)
docs/inference/switch-inference-providers.md (1)

94-94: Keep one sentence per source line here.

This bullet packs two sentences onto one source line. Split it so diffs stay readable.

As per coding guidelines, “One sentence per line in source (makes diffs readable). Flag paragraphs where multiple sentences appear on the same line.”

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/inference/switch-inference-providers.md` at line 94, The bullet "Runtime
switching changes the OpenShell route. It does not rewrite your stored
credentials." contains two sentences on one source line; split it into two
source lines so each sentence is on its own line (e.g., place "Runtime switching
changes the OpenShell route." and "It does not rewrite your stored credentials."
on separate lines) to satisfy the "one sentence per source line" guideline and
keep diffs readable.
docs/reference/inference-profiles.md (1)

6-6: Use NemoClaw casing in the keywords entries.

Both new keywords use nemoclaw, which breaks the project word list and leaks inconsistent casing into search metadata.

As per coding guidelines, “NemoClaw, OpenClaw, and OpenShell must use correct casing.”

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/reference/inference-profiles.md` at line 6, The keywords array uses
incorrect lowercase "nemoclaw" tokens which violate the casing rule; update the
two entries in the keywords array to use the exact "NemoClaw" casing (e.g.,
change "nemoclaw inference profiles" -> "NemoClaw inference profiles" and
"nemoclaw provider routing" -> "NemoClaw provider routing") so the keywords list
and search metadata follow the required NemoClaw capitalization.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bin/lib/onboard.js`:
- Around line 1347-1356: The setupNim(gpu) function currently assumes
sandboxName exists which causes a ReferenceError; move any code that starts or
registers a NIM container (calls to nim.startNimContainer(...) and
registry.updateSandbox(...)) out of setupNim and defer it until after
createSandbox(...) returns in onboard(), so that you call
nim.startNimContainer(sandboxName, model) and then
registry.updateSandbox(sandboxName, { nimContainer }) using the real
sandboxName; keep setupNim responsible only for selecting
model/provider/preferredInferenceApi and return a flag or the model info (e.g.,
nimPending or model) to onboard(), then perform the actual container creation
and sandbox update there.

In `@docs/reference/inference-profiles.md`:
- Around line 41-50: The docs incorrectly mark Local Ollama as experimental and
omit it from the non-experimental provider table; update
docs/reference/inference-profiles.md to add a row for "Local Ollama" in the
provider table (e.g., Provider: Local Ollama, Endpoint Type: OpenAI-compatible
or appropriate type, Notes: local Ollama runtime) and remove or adjust the text
that wraps Local Ollama behind NEMOCLAW_EXPERIMENTAL (the gating mentioned later
in the file) so Local Ollama is listed alongside the other non-experimental
providers (ensure references to NEMOCLAW_EXPERIMENTAL only apply to local NVIDIA
NIM and local vLLM as intended).

In `@README.md`:
- Around line 184-193: The README's "Supported non-experimental onboarding
paths" table omits "Local Ollama" while the surrounding text still labels Ollama
as experimental; update the table row entries to include a "Local Ollama"
provider (with a short note like "Local Ollama — local model serving via
Ollama") or alternatively narrow the heading to explicitly state which providers
are listed (e.g., "Supported non-experimental onboarding paths (cloud
providers)"). Ensure the change aligns with the PR that moves Ollama into the
standard flow and update any adjacent note that still calls Ollama experimental
so wording is consistent.

In `@test/onboard.test.js`:
- Around line 196-215: The test fixture uses the old provider id "nvidia-nim" in
both the mocked inference get output and the setup call; update the mock
response lines that include "Provider: nvidia-nim" and the setupInference
invocation that passes "nvidia-nim" to instead use "nvidia-prod" so the test
aligns with the rest of the PR (look for the mocked block that returns "Gateway
inference" and the call to setupInference("test-box", ..., "nvidia-nim") and
replace the provider string in both places).

---

Outside diff comments:
In `@bin/lib/credentials.js`:
- Around line 172-179: The file uses execSync in functions isRepoPrivate and
ensureGithubToken but never imports it; add an import for execSync from the
child_process module at the top of the file so both functions can call execSync
without causing a ReferenceError (locate the top-level imports and add the
execSync import alongside them so isRepoPrivate and ensureGithubToken work
correctly).

---

Nitpick comments:
In `@docs/inference/switch-inference-providers.md`:
- Line 94: The bullet "Runtime switching changes the OpenShell route. It does
not rewrite your stored credentials." contains two sentences on one source line;
split it into two source lines so each sentence is on its own line (e.g., place
"Runtime switching changes the OpenShell route." and "It does not rewrite your
stored credentials." on separate lines) to satisfy the "one sentence per source
line" guideline and keep diffs readable.

In `@docs/reference/inference-profiles.md`:
- Line 6: The keywords array uses incorrect lowercase "nemoclaw" tokens which
violate the casing rule; update the two entries in the keywords array to use the
exact "NemoClaw" casing (e.g., change "nemoclaw inference profiles" -> "NemoClaw
inference profiles" and "nemoclaw provider routing" -> "NemoClaw provider
routing") so the keywords list and search metadata follow the required NemoClaw
capitalization.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 99f3efbe-a2ad-41d0-85ad-76ed5ea3acbd

📥 Commits

Reviewing files that changed from the base of the PR and between b6e0fc1 and 32c9436.

📒 Files selected for processing (15)
  • Dockerfile
  • README.md
  • bin/lib/credentials.js
  • bin/lib/inference-config.js
  • bin/lib/local-inference.js
  • bin/lib/onboard.js
  • docs/about/how-it-works.md
  • docs/inference/switch-inference-providers.md
  • docs/reference/commands.md
  • docs/reference/inference-profiles.md
  • docs/reference/troubleshooting.md
  • nemoclaw/src/commands/migration-state.test.ts
  • test/inference-config.test.js
  • test/onboard-selection.test.js
  • test/onboard.test.js

@kjw3 kjw3 added the enhancement: provider Use this label to identify requests to add a new AI provider to NemoClaw. label Mar 22, 2026
@kjw3 kjw3 requested a review from krmurph March 22, 2026 15:48
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
bin/lib/onboard.js (1)

1220-1233: ⚠️ Potential issue | 🟠 Major

Reusing an existing sandbox can leave the baked model identity stale.

By this point the gateway has already been switched to the newly selected provider/model, but this early return skips Dockerfile patching and rebuild. If the user changed either value, the existing sandbox keeps the old baked OpenClaw identity while the host route now points somewhere else.

Also applies to: 1988-1990

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/onboard.js` around lines 1220 - 1233, The early return after the user
chooses to "Keep existing sandbox" (`if (recreate.toLowerCase() !== "y") { ...
return sandboxName; }`) causes Dockerfile patching and rebuild to be skipped,
leaving a stale baked model identity; instead, change the flow so that choosing
to keep the existing sandbox does NOT return early — run the subsequent
Dockerfile patching and rebuild steps (the code that updates the baked OpenClaw
identity) unless you can detect that neither provider nor model changed; modify
the branch around liveExists / prompt (symbols: liveExists, isNonInteractive(),
prompt(), sandboxName) to only short-circuit when it is safe (e.g.,
provider/model unchanged), and apply the same fix at the other occurrence
mentioned (the logic around lines 1988-1990) so both code paths perform
Dockerfile patching/rebuild when necessary.
♻️ Duplicate comments (1)
bin/lib/onboard.js (1)

1779-1780: ⚠️ Potential issue | 🟠 Major

Don't write sandbox metadata under GATEWAY_NAME.

setupInference() is invoked with "nemoclaw" before createSandbox() returns the real sandbox name, but it still updates the registry with model/provider. That either creates a stray nemoclaw entry or leaves the real sandbox showing unknown in nemoclaw list/status.

Suggested fix
-  await setupInference(GATEWAY_NAME, model, provider, endpointUrl, credentialEnv);
+  await setupInference(model, provider, endpointUrl, credentialEnv);
   const sandboxName = await createSandbox(gpu, model, provider, preferredInferenceApi);
-  if (nimContainer) {
-    registry.updateSandbox(sandboxName, { nimContainer });
-  }
+  registry.updateSandbox(sandboxName, {
+    model,
+    provider,
+    ...(nimContainer ? { nimContainer } : {}),
+  });

and remove the registry write from setupInference(), since that step is gateway-scoped, not sandbox-scoped.

Also applies to: 1986-1993

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/onboard.js` around lines 1779 - 1780, The code writes sandbox
metadata under the GATEWAY_NAME because setupInference() (called with
"nemoclaw") calls registry.updateSandbox(sandboxName, { model, provider })
before the real sandbox name is known; remove the registry.updateSandbox call
from setupInference() so gateway-scoped setup doesn't write sandbox-scoped
metadata, and instead perform registry.updateSandbox(sandboxName, { model,
provider }) only after createSandbox() returns the actual sandboxName (or in the
function that has the real sandboxName), keeping verifyInferenceRoute(...) and
provider/model handling in setupInference() but not the registry write.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bin/lib/onboard.js`:
- Around line 531-552: The non-interactive validation path is using process.env
directly and can miss credentials; update validateOpenAiLikeSelection to resolve
credentials via getCredential(credentialEnv) (not raw process.env lookup) before
calling probeOpenAiLikeEndpoint so the probe gets the same credential resolution
as interactive flows (use the same pattern for other validators mentioned);
specifically, ensure the apiKey is derived via getCredential when credentialEnv
is provided, pass that apiKey into probeOpenAiLikeEndpoint, and replicate this
fix for the other functions that call probeOpenAiLikeEndpoint/Anthropic probes
so non-interactive CI runs use the resolved credential.
- Around line 1265-1275: The code is appending secrets (e.g., NVIDIA_API_KEY,
DISCORD_BOT_TOKEN, SLACK_BOT_TOKEN) into envArgs which get serialized into the
openshell sandbox create command and exposed on process argv; instead stop
adding secret values via formatEnvAssignment into envArgs and pass them via a
secure channel: set these values on the spawned process environment
(child_process.spawn/exec options.env) or deliver via stdin/temporary file with
restrictive permissions before calling runOpenshell, updating the logic around
envArgs, formatEnvAssignment and the call site that invokes runOpenshell so
non-secret flags remain in envArgs while actual token values are injected into
the child process env (or a secure file) rather than the command line.
- Around line 436-445: The curl invocations built into the cmd array (the code
that constructs cmd using probe.body and probe.url and then calls
spawnSync("bash", ["-c", cmd])) need connection and overall timeouts added so a
stalled endpoint doesn't hang onboarding; modify the cmd construction to include
--connect-timeout 5 and --max-time 15 (or use configurable constants) before the
URL and data args, and apply the same change to the other identical curl
builders (the other cmd/spawnSync occurrences) so all synchronous probes time
out reliably.
- Around line 1351-1355: When the nim-local fallback path triggers inside
selectionLoop it only logs and breaks without choosing cloud inference, leaving
model and preferredInferenceApi unset; change that flow to assign the cloud
provider values before breaking: set provider, endpointUrl, credentialEnv from
the remote/cloud configuration (e.g., REMOTE_PROVIDER_CONFIG.build or the
appropriate cloud entry), and set preferredInferenceApi to the cloud inference
API name so later inference setup/Dockerfile patching has valid routing data;
update the branch that currently only logs the fallback to perform these
assignments (affecting variables provider, endpointUrl, credentialEnv,
preferredInferenceApi and ensuring model is set/cleared appropriately) before
exiting selectionLoop.

---

Outside diff comments:
In `@bin/lib/onboard.js`:
- Around line 1220-1233: The early return after the user chooses to "Keep
existing sandbox" (`if (recreate.toLowerCase() !== "y") { ... return
sandboxName; }`) causes Dockerfile patching and rebuild to be skipped, leaving a
stale baked model identity; instead, change the flow so that choosing to keep
the existing sandbox does NOT return early — run the subsequent Dockerfile
patching and rebuild steps (the code that updates the baked OpenClaw identity)
unless you can detect that neither provider nor model changed; modify the branch
around liveExists / prompt (symbols: liveExists, isNonInteractive(), prompt(),
sandboxName) to only short-circuit when it is safe (e.g., provider/model
unchanged), and apply the same fix at the other occurrence mentioned (the logic
around lines 1988-1990) so both code paths perform Dockerfile patching/rebuild
when necessary.

---

Duplicate comments:
In `@bin/lib/onboard.js`:
- Around line 1779-1780: The code writes sandbox metadata under the GATEWAY_NAME
because setupInference() (called with "nemoclaw") calls
registry.updateSandbox(sandboxName, { model, provider }) before the real sandbox
name is known; remove the registry.updateSandbox call from setupInference() so
gateway-scoped setup doesn't write sandbox-scoped metadata, and instead perform
registry.updateSandbox(sandboxName, { model, provider }) only after
createSandbox() returns the actual sandboxName (or in the function that has the
real sandboxName), keeping verifyInferenceRoute(...) and provider/model handling
in setupInference() but not the registry write.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 1468b6ba-e6ed-433c-9846-91527e84a444

📥 Commits

Reviewing files that changed from the base of the PR and between 32c9436 and 98a8da9.

📒 Files selected for processing (6)
  • README.md
  • bin/lib/nim.js
  • bin/lib/onboard.js
  • bin/nemoclaw.js
  • test/e2e/test-full-e2e.sh
  • test/local-inference.test.js
✅ Files skipped from review due to trivial changes (1)
  • README.md

Comment on lines +436 to +445
const cmd = [
"curl -sS",
`-o ${shellQuote(bodyFile)}`,
"-w '%{http_code}'",
"-H 'Content-Type: application/json'",
...(apiKey ? ['-H "Authorization: Bearer $NEMOCLAW_PROBE_API_KEY"'] : []),
`-d ${shellQuote(probe.body)}`,
shellQuote(probe.url),
].join(" ");
const result = spawnSync("bash", ["-c", cmd], {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add time limits to the live validation curls.

All of these probes are synchronous spawnSync(...curl...) calls without --connect-timeout or --max-time. A blackholed hostname or stalled endpoint will hang onboarding indefinitely instead of reaching the retry/reprompt flow.

Minimal hardening
-        "curl -sS",
+        "curl -sS --connect-timeout 5 --max-time 15",

Also applies to: 479-493, 641-649, 694-701, 729-737

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/onboard.js` around lines 436 - 445, The curl invocations built into
the cmd array (the code that constructs cmd using probe.body and probe.url and
then calls spawnSync("bash", ["-c", cmd])) need connection and overall timeouts
added so a stalled endpoint doesn't hang onboarding; modify the cmd construction
to include --connect-timeout 5 and --max-time 15 (or use configurable constants)
before the URL and data args, and apply the same change to the other identical
curl builders (the other cmd/spawnSync occurrences) so all synchronous probes
time out reliably.

Comment on lines 1265 to 1275
if (process.env.NVIDIA_API_KEY) {
envArgs.push(`NVIDIA_API_KEY=${shellQuote(process.env.NVIDIA_API_KEY)}`);
envArgs.push(formatEnvAssignment("NVIDIA_API_KEY", process.env.NVIDIA_API_KEY));
}
const discordToken = getCredential("DISCORD_BOT_TOKEN") || process.env.DISCORD_BOT_TOKEN;
if (discordToken) {
envArgs.push(`DISCORD_BOT_TOKEN=${shellQuote(discordToken)}`);
envArgs.push(formatEnvAssignment("DISCORD_BOT_TOKEN", discordToken));
}
const slackToken = getCredential("SLACK_BOT_TOKEN") || process.env.SLACK_BOT_TOKEN;
if (slackToken) {
envArgs.push(`SLACK_BOT_TOKEN=${shellQuote(slackToken)}`);
envArgs.push(formatEnvAssignment("SLACK_BOT_TOKEN", slackToken));
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

NVIDIA_API_KEY still leaks through the sandbox-create argv.

envArgs is serialized into openshell sandbox create ... -- env ..., and runOpenshell() shells that through bash -c, so the provider key is still exposed in process arguments during onboarding. The same argv path also affects any other tokens appended here.

Also applies to: 1281-1284

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/onboard.js` around lines 1265 - 1275, The code is appending secrets
(e.g., NVIDIA_API_KEY, DISCORD_BOT_TOKEN, SLACK_BOT_TOKEN) into envArgs which
get serialized into the openshell sandbox create command and exposed on process
argv; instead stop adding secret values via formatEnvAssignment into envArgs and
pass them via a secure channel: set these values on the spawned process
environment (child_process.spawn/exec options.env) or deliver via
stdin/temporary file with restrictive permissions before calling runOpenshell,
updating the logic around envArgs, formatEnvAssignment and the call site that
invokes runOpenshell so non-secret flags remain in envArgs while actual token
values are injected into the child process env (or a secure file) rather than
the command line.

@kjw3
Copy link
Contributor Author

kjw3 commented Mar 22, 2026

Reviewer summary for this PR:

This branch changes NemoClaw from a mostly NVIDIA-hosted onboarding flow into a provider-first routed inference flow.

What changed

  • Onboarding now starts with provider and model selection before sandbox creation.
  • Supported non-experimental onboarding paths now include:
    • NVIDIA Endpoints
    • OpenAI
    • Other OpenAI-compatible endpoint
    • Anthropic
    • Other Anthropic-compatible endpoint
    • Google Gemini
    • Local Ollama
  • Credentials remain host-side. The sandbox does not receive raw provider API keys.
  • The sandbox image now bakes the selected model/provider identity into immutable ~/.openclaw/openclaw.json during build instead of relying on post-create config mutation.
  • OpenShell continues to route sandbox inference through inference.local to the selected host-side provider.

Validation behavior

  • Onboarding does not move past step 2 unless the selected provider/model/key or local runtime actually validates.
  • OpenAI-compatible providers prefer POST /responses first, then fall back to POST /chat/completions if needed.
  • Anthropic providers validate with a live POST /v1/messages request.
  • NVIDIA Endpoints manual Other... model entry validates against https://integrate.api.nvidia.com/v1/models.
  • Custom compatible endpoints no longer depend on a /models endpoint. They validate by trying real inference on the entered model.
  • Ollama now pulls, warms, and validates the selected model before onboarding continues. If no models are installed, onboarding offers starter models and an Other... path.

Notable compatibility work

  • Gemini now works through the routed OpenAI-compatible path after baking compat.supportsStore=false on the sandbox model entry.
  • Credit to @andy-ratsirarson and @alannguyen1 for the issue-thread investigation that helped narrow the working Gemini path.
  • NVIDIA-hosted and compatible-provider sandbox model identities were cleaned up so OpenClaw uses the routed inference/... model path instead of native provider aliases that do not match NemoClaw routing.

Docs and tests

  • README and developer docs were updated to describe routed multi-provider inference instead of NVIDIA-only onboarding.
  • Unit coverage was updated for the new provider flows, migration restore hardening, and local runtime handling.

Validation completed on this branch

  • Live validated by hand on the non-experimental providers during development:
    • NVIDIA Endpoints
    • OpenAI
    • Anthropic
    • Google Gemini
    • Local Ollama
  • Package checks and coverage now pass again:
    • cd nemoclaw && npm run check
    • cd nemoclaw && npm test -- --coverage
  • Root provider/onboarding tests exercised:
    • node --test test/onboard.test.js test/onboard-selection.test.js test/inference-config.test.js test/credentials.test.js

Still intentionally experimental

  • Local NVIDIA NIM
  • Local vLLM

If you are reviewing behavior, the highest-value things to check are:

  • step-2 validation and retry behavior for provider/model/key mistakes
  • that the selected provider/model is what OpenClaw reports inside the sandbox
  • that compatible endpoint flows recover cleanly from model vs auth/base-URL failures

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (3)
bin/lib/onboard.js (3)

1264-1275: ⚠️ Potential issue | 🟠 Major

Secrets still passed through sandbox-create argv.

NVIDIA_API_KEY, DISCORD_BOT_TOKEN, and SLACK_BOT_TOKEN values are serialized into envArgs and passed to runOpenshell(["sandbox", "create", ...]). This exposes secrets in process arguments during onboarding.

Consider passing secrets via the spawned process's environment or a temporary file with restricted permissions instead of command-line arguments.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/onboard.js` around lines 1264 - 1275, The diff is serializing secrets
into envArgs (via formatEnvAssignment) which are passed to
runOpenshell(["sandbox", "create", ...]) exposing sensitive values; change the
logic so NVIDIA_API_KEY, DISCORD_BOT_TOKEN and SLACK_BOT_TOKEN are NOT appended
to envArgs but are injected via the child process environment or a secure
temporary file instead: detect values using getCredential or process.env as now,
remove any pushes to envArgs for those keys, and when invoking
runOpenshell("sandbox", "create", ...) pass an env object that includes these
secrets (or pass a path to a file created with restrictive permissions) so
secrets never appear in the command-line; update any callers that build
envArgs/formatEnvAssignment and ensure runOpenshell accepts and forwards the env
map to the spawned process.

436-452: ⚠️ Potential issue | 🟠 Major

Add timeouts to synchronous curl probes.

These spawnSync curl calls lack --connect-timeout and --max-time flags. A blackholed or stalled endpoint will hang onboarding indefinitely instead of reaching the retry/reprompt flow.

Suggested fix
       const cmd = [
-        "curl -sS",
+        "curl -sS --connect-timeout 5 --max-time 30",
         `-o ${shellQuote(bodyFile)}`,

Apply this pattern to all curl invocations: lines 436, 479, 641, 694, and 729.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/onboard.js` around lines 436 - 452, The curl command built into the
cmd array before the spawnSync call (where variables like probe, bodyFile,
shellQuote, and ROOT are used) must include connection and total timeouts to
avoid hanging; add --connect-timeout and --max-time flags with appropriate
values to the array construction that creates cmd so the spawned bash curl call
receives them, and apply the same change to the other identical curl invocations
in this file (the other places that construct a curl cmd and call spawnSync with
the same pattern).

531-552: ⚠️ Potential issue | 🟡 Minor

Credential resolution works correctly for interactive flows.

The validateOpenAiLikeSelection function properly uses getCredential(credentialEnv) to resolve credentials. This addresses the concern about credential lookup for validation.

However, for non-interactive mode, if credentials are only in process.env and not stored via saveCredential, there's a gap since getCredential reads from the credentials file, not process.env.

Suggested fix to support env-only credentials
 async function validateOpenAiLikeSelection(
   label,
   endpointUrl,
   model,
   credentialEnv = null,
   retryMessage = "Please choose a provider/model again."
 ) {
-  const apiKey = credentialEnv ? getCredential(credentialEnv) : "";
+  const apiKey = credentialEnv ? (getCredential(credentialEnv) || process.env[credentialEnv] || "") : "";
   const probe = probeOpenAiLikeEndpoint(endpointUrl, model, apiKey);

Apply the same pattern to validateAnthropicSelection (line 555), validateAnthropicSelectionWithRetryMessage (line 578), validateCustomOpenAiLikeSelection (line 595), and validateCustomAnthropicSelection (line 617).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/onboard.js` around lines 531 - 552, The non-interactive validation
functions (validateOpenAiLikeSelection, validateAnthropicSelection,
validateAnthropicSelectionWithRetryMessage, validateCustomOpenAiLikeSelection,
validateCustomAnthropicSelection) currently only use
getCredential(credentialEnv) which reads the credentials file and thus misses
credentials provided solely via process.env; update each function to resolve
apiKey by falling back to process.env when getCredential returns falsy (e.g.
const apiKey = credentialEnv ? (getCredential(credentialEnv) ||
process.env[credentialEnv] || "") : "";), then pass that apiKey into the
existing probe* calls so non-interactive flows accept env-only credentials.
🧹 Nitpick comments (3)
bin/lib/onboard.js (2)

1615-1643: Duplicate Ollama validation logic.

The Ollama model selection and validation loop appears twice: once in the ollama branch (lines 1615-1643) and again in the install-ollama branch (lines 1655-1683). Consider extracting this into a helper function.

Extract shared Ollama validation loop
+async function selectAndValidateOllamaModel(gpu, requestedModel) {
+  while (true) {
+    const installedModels = getOllamaModelOptions(runCapture);
+    let model;
+    if (isNonInteractive()) {
+      model = requestedModel || getDefaultOllamaModel(runCapture, gpu);
+    } else {
+      model = await promptOllamaModel(gpu);
+    }
+    const probe = prepareOllamaModel(model, installedModels);
+    if (!probe.ok) {
+      console.error(`  ${probe.message}`);
+      if (isNonInteractive()) process.exit(1);
+      console.log("  Choose a different Ollama model or select Other.");
+      console.log("");
+      continue;
+    }
+    const api = await validateOpenAiLikeSelection(
+      "Local Ollama",
+      getLocalProviderValidationBaseUrl("ollama-local"),
+      model,
+      null,
+      "Choose a different Ollama model or select Other."
+    );
+    if (!api) continue;
+    return { model, api };
+  }
+}

Also applies to: 1655-1683

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/onboard.js` around lines 1615 - 1643, The Ollama model
selection/validation loop (using getOllamaModelOptions, isNonInteractive,
getDefaultOllamaModel, promptOllamaModel, prepareOllamaModel,
validateOpenAiLikeSelection, getLocalProviderValidationBaseUrl, model,
preferredInferenceApi) is duplicated; extract it into a single helper function
(for example selectAndValidateOllamaModel(runCapture, provider, gpu,
requestedModel)) that encapsulates the while(true) loop and returns the
validated preferredInferenceApi (and selected model if needed), then replace
both duplicated blocks in the ollama and install-ollama branches by calling this
helper and handling its return value.

1568-1570: Cloud fallback relies on implicit defaults—consider making this explicit.

When no NIM models fit GPU VRAM, the code logs "Falling back to cloud API" but then falls through without explicitly setting cloud provider values. This works because the defaults at lines 1351-1355 are already set to NVIDIA Endpoints (nvidia-prod), but the implicit behavior is subtle.

Make the fallback explicit
       if (models.length === 0) {
         console.log("  No NIM models fit your GPU VRAM. Falling back to cloud API.");
+        // Defaults already set to NVIDIA Endpoints (nvidia-prod)
+        break;
       } else {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/onboard.js` around lines 1568 - 1570, The branch that handles
models.length === 0 only logs a message but relies on earlier implicit defaults
for the cloud fallback; explicitly set the cloud fallback variables when no
local NIM models fit GPU VRAM by assigning the cloud provider/endpoint and any
flag that selects cloud inference (e.g., set provider/endpoint to the NVIDIA
production endpoint value used as the default and ensure the cloud selection
flag is true). Update the models.length === 0 block to explicitly set the same
provider/endpoint variables used elsewhere (the variables that previously
defaulted to "nvidia-prod"/NVIDIA) so the fallback behavior is clear and not
reliant on implicit earlier defaults.
test/onboard.test.js (1)

174-177: Subprocess tests don't clean up temporary directories.

The subprocess tests create temporary directories (e.g., line 176) but don't clean them up after completion, unlike the earlier tests that use try/finally with fs.rmSync. Consider adding cleanup or using a shared test fixture helper.

Add cleanup to subprocess tests
+    try {
       const result = spawnSync(process.execPath, [scriptPath], {
         // ...
       });
       // assertions...
+    } finally {
+      fs.rmSync(tmpDir, { recursive: true, force: true });
+    }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/onboard.test.js` around lines 174 - 177, The subprocess test starting
with the "passes credential names to openshell without embedding secret values
in argv" it-block creates temporary directories (tmpDir, fakeBin) but never
removes them; wrap the test body in a try/finally (or add an afterEach/fixture)
and in the finally call fs.rmSync(tmpDir, { recursive: true, force: true }) to
ensure tmpDir and fakeBin are removed, referencing tmpDir/fakeBin variables in
test/onboard.test.js to locate the code to change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@bin/lib/onboard.js`:
- Around line 1264-1275: The diff is serializing secrets into envArgs (via
formatEnvAssignment) which are passed to runOpenshell(["sandbox", "create",
...]) exposing sensitive values; change the logic so NVIDIA_API_KEY,
DISCORD_BOT_TOKEN and SLACK_BOT_TOKEN are NOT appended to envArgs but are
injected via the child process environment or a secure temporary file instead:
detect values using getCredential or process.env as now, remove any pushes to
envArgs for those keys, and when invoking runOpenshell("sandbox", "create", ...)
pass an env object that includes these secrets (or pass a path to a file created
with restrictive permissions) so secrets never appear in the command-line;
update any callers that build envArgs/formatEnvAssignment and ensure
runOpenshell accepts and forwards the env map to the spawned process.
- Around line 436-452: The curl command built into the cmd array before the
spawnSync call (where variables like probe, bodyFile, shellQuote, and ROOT are
used) must include connection and total timeouts to avoid hanging; add
--connect-timeout and --max-time flags with appropriate values to the array
construction that creates cmd so the spawned bash curl call receives them, and
apply the same change to the other identical curl invocations in this file (the
other places that construct a curl cmd and call spawnSync with the same
pattern).
- Around line 531-552: The non-interactive validation functions
(validateOpenAiLikeSelection, validateAnthropicSelection,
validateAnthropicSelectionWithRetryMessage, validateCustomOpenAiLikeSelection,
validateCustomAnthropicSelection) currently only use
getCredential(credentialEnv) which reads the credentials file and thus misses
credentials provided solely via process.env; update each function to resolve
apiKey by falling back to process.env when getCredential returns falsy (e.g.
const apiKey = credentialEnv ? (getCredential(credentialEnv) ||
process.env[credentialEnv] || "") : "";), then pass that apiKey into the
existing probe* calls so non-interactive flows accept env-only credentials.

---

Nitpick comments:
In `@bin/lib/onboard.js`:
- Around line 1615-1643: The Ollama model selection/validation loop (using
getOllamaModelOptions, isNonInteractive, getDefaultOllamaModel,
promptOllamaModel, prepareOllamaModel, validateOpenAiLikeSelection,
getLocalProviderValidationBaseUrl, model, preferredInferenceApi) is duplicated;
extract it into a single helper function (for example
selectAndValidateOllamaModel(runCapture, provider, gpu, requestedModel)) that
encapsulates the while(true) loop and returns the validated
preferredInferenceApi (and selected model if needed), then replace both
duplicated blocks in the ollama and install-ollama branches by calling this
helper and handling its return value.
- Around line 1568-1570: The branch that handles models.length === 0 only logs a
message but relies on earlier implicit defaults for the cloud fallback;
explicitly set the cloud fallback variables when no local NIM models fit GPU
VRAM by assigning the cloud provider/endpoint and any flag that selects cloud
inference (e.g., set provider/endpoint to the NVIDIA production endpoint value
used as the default and ensure the cloud selection flag is true). Update the
models.length === 0 block to explicitly set the same provider/endpoint variables
used elsewhere (the variables that previously defaulted to "nvidia-prod"/NVIDIA)
so the fallback behavior is clear and not reliant on implicit earlier defaults.

In `@test/onboard.test.js`:
- Around line 174-177: The subprocess test starting with the "passes credential
names to openshell without embedding secret values in argv" it-block creates
temporary directories (tmpDir, fakeBin) but never removes them; wrap the test
body in a try/finally (or add an afterEach/fixture) and in the finally call
fs.rmSync(tmpDir, { recursive: true, force: true }) to ensure tmpDir and fakeBin
are removed, referencing tmpDir/fakeBin variables in test/onboard.test.js to
locate the code to change.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 1e2c3775-838b-4bd9-84f5-170136252f4e

📥 Commits

Reviewing files that changed from the base of the PR and between 98a8da9 and 69adc37.

📒 Files selected for processing (8)
  • README.md
  • bin/lib/credentials.js
  • bin/lib/onboard.js
  • docs/inference/switch-inference-providers.md
  • docs/reference/inference-profiles.md
  • nemoclaw/src/commands/migration-state.test.ts
  • nemoclaw/src/commands/migration-state.ts
  • test/onboard.test.js
✅ Files skipped from review due to trivial changes (1)
  • nemoclaw/src/commands/migration-state.ts
🚧 Files skipped from review as they are similar to previous changes (3)
  • nemoclaw/src/commands/migration-state.test.ts
  • README.md
  • bin/lib/credentials.js

@kjw3 kjw3 force-pushed the feat/openai-anthropic-gemini-providers branch from 69adc37 to cca052e Compare March 22, 2026 16:06
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
bin/lib/onboard.js (1)

1565-1604: ⚠️ Potential issue | 🟠 Major

nim-local fallback leaves model unset when no NIM models fit GPU.

If a user selects nim-local but no models fit their GPU VRAM (line 1568), the code logs a fallback message and breaks. However, model remains null from line 1350, causing downstream createSandbox and setupInference to receive invalid data.

Since this path is already gated by NEMOCLAW_EXPERIMENTAL=1, consider prompting the user to select a different provider instead of silently continuing with null values.

Suggested fix
       if (models.length === 0) {
-        console.log("  No NIM models fit your GPU VRAM. Falling back to cloud API.");
+        console.log("  No NIM models fit your GPU VRAM. Please choose a different provider.");
+        continue selectionLoop;
       } else {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/onboard.js` around lines 1565 - 1604, When the "nim-local" branch
finds no compatible models (models.length === 0), the code currently logs a
fallback and breaks while leaving the shared variable model null, causing
downstream callers like createSandbox and setupInference to receive invalid
input; update this branch so that instead of just logging and breaking it
prompts the user to choose a different provider (or programmatically falls back
to a cloud provider) and sets provider and model to valid values before
continuing, or exits with an explicit error—modify the nim-local handling around
the models.length === 0 check to either re-run the provider selection flow (so
model gets assigned) or set provider to the cloud fallback and assign a sane
default to model, while preserving the NEMOCLAW_EXPERIMENTAL gate.
♻️ Duplicate comments (2)
bin/lib/onboard.js (2)

531-552: ⚠️ Potential issue | 🟠 Major

Non-interactive validation may fail due to credential resolution mismatch.

In non-interactive mode, lines 1460-1477 check process.env[credentialEnv] to verify credentials are present, but then validateOpenAiLikeSelection and similar functions resolve credentials via getCredential() (line 538). If getCredential() reads from ~/.nemoclaw/credentials.json rather than process.env, the validation probe will run without the API key even though the env var was provided.

Suggested fix
+function resolveCredentialValue(envName) {
+  return envName ? (process.env[envName] || getCredential(envName) || "") : "";
+}
+
 async function validateOpenAiLikeSelection(
   label,
   endpointUrl,
   model,
   credentialEnv = null,
   retryMessage = "Please choose a provider/model again."
 ) {
-  const apiKey = credentialEnv ? getCredential(credentialEnv) : "";
+  const apiKey = resolveCredentialValue(credentialEnv);

Apply the same pattern to validateAnthropicSelection, validateAnthropicSelectionWithRetryMessage, validateCustomOpenAiLikeSelection, and validateCustomAnthropicSelection.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/onboard.js` around lines 531 - 552, The validation functions
(validateOpenAiLikeSelection, validateAnthropicSelection,
validateAnthropicSelectionWithRetryMessage, validateCustomOpenAiLikeSelection,
validateCustomAnthropicSelection) resolve credentials via getCredential(), which
can read from the credentials file and thus miss an API key provided in
process.env[credentialEnv]; update each function so when credentialEnv is passed
it first checks process.env[credentialEnv] for the API key and uses that value
if present, otherwise fall back to getCredential(credentialEnv), then pass that
resolved apiKey into probeOpenAiLikeEndpoint / probeAnthropicEndpoint calls to
ensure non-interactive validation uses environment-provided secrets.

436-445: ⚠️ Potential issue | 🟠 Major

Add timeouts to validation curl commands.

These synchronous spawnSync(...curl...) calls lack --connect-timeout and --max-time. A blackholed hostname or stalled endpoint will hang onboarding indefinitely instead of reaching the retry/reprompt flow.

Suggested fix
       const cmd = [
-        "curl -sS",
+        "curl -sS --connect-timeout 5 --max-time 30",
         `-o ${shellQuote(bodyFile)}`,

Apply the same fix to the other curl invocations at lines 480, 642, 695, and 730.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/onboard.js` around lines 436 - 445, The curl invocations built into
the cmd array (used with spawnSync in the onboarding flow) lack timeouts and can
hang; update the cmd construction to include safe timeout flags (e.g., add
'--connect-timeout 5' and '--max-time 30' as separate elements in the array
before the URL) while preserving existing headers and shellQuote usage (refer to
the cmd, probe.body, probe.url, shellQuote, and spawnSync occurrences); apply
the same change to the other curl command constructions/spawnSync calls in this
file (the other cmd/spawnSync blocks) so every synchronous curl includes the
connect and overall timeouts.
🧹 Nitpick comments (2)
docs/reference/inference-profiles.md (1)

23-25: Split compound sentence for clarity.

Line 24 is a long compound sentence. Consider splitting for readability and to follow the one-sentence-per-line guideline.

Suggested fix
 NemoClaw configures inference through the OpenShell gateway.
-The agent inside the sandbox talks to `inference.local`, and OpenShell routes that traffic to the provider you selected during onboarding.
+The agent inside the sandbox talks to `inference.local`.
+OpenShell routes that traffic to the provider you selected during onboarding.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/reference/inference-profiles.md` around lines 23 - 25, The sentence "The
agent inside the sandbox talks to `inference.local`, and OpenShell routes that
traffic to the provider you selected during onboarding." is a long compound
sentence — split it into two clearer sentences; for example, keep "The agent
inside the sandbox talks to `inference.local`." as one sentence and make
"OpenShell routes that traffic to the provider you selected during onboarding."
a separate sentence so the document follows the one-sentence-per-line guideline
and improves readability.
docs/inference/switch-inference-providers.md (1)

90-95: Use a MyST admonition for the Notes section.

The "Notes" section uses a plain H2 header with bullet points. Consider using a :::{note} admonition for callout content, which better signals informational context to readers. As per coding guidelines, "Use MyST admonitions (:::{tip}, :::{note}, :::{warning}) for callouts, not bold text or blockquotes."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/inference/switch-inference-providers.md` around lines 90 - 95, Replace
the plain H2 "## Notes" and its bullet list with a MyST admonition block by
converting the header into a `:::{note}` block, placing the four bullet lines
("- The host keeps provider credentials.", "- The sandbox continues to use
`inference.local`.", "- Runtime switching changes the OpenShell route.", "- It
does not rewrite your stored credentials.") inside the admonition, and closing
it with `:::` so the content becomes a proper MyST note callout instead of a
normal section.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bin/lib/credentials.js`:
- Around line 110-124: The outer Promise in prompt() only accepts resolve so the
silent branch's catch never settles callers; update the Promise constructor to
(resolve, reject) and in the promptSecret().catch handler call reject(err) for
both error paths instead of throwing (and for the SIGINT path call reject(err)
then process.kill(process.pid, "SIGINT") to ensure the Promise is settled before
exiting). This change should be made in the prompt function and touches the
promptSecret error handling block so callers awaiting prompt() won't hang.

In `@bin/lib/onboard.js`:
- Around line 1264-1275: The envArgs array currently includes sensitive values
via formatEnvAssignment for NVIDIA_API_KEY, DISCORD_BOT_TOKEN, and
SLACK_BOT_TOKEN which causes secret leakage in command-line arguments; change
the onboarding flow so these keys are not appended to envArgs and instead inject
them into the spawned process environment (opts.env) or send them over stdin/a
securely permissioned temp file when calling the sandbox create command;
specifically, stop pushing formatEnvAssignment("NVIDIA_API_KEY", ...),
formatEnvAssignment("DISCORD_BOT_TOKEN", ...), and
formatEnvAssignment("SLACK_BOT_TOKEN", ...) into envArgs and set process spawn
options.env (or write/read from stdin/temp file) so the secrets are available to
the child process without appearing in the argv.

---

Outside diff comments:
In `@bin/lib/onboard.js`:
- Around line 1565-1604: When the "nim-local" branch finds no compatible models
(models.length === 0), the code currently logs a fallback and breaks while
leaving the shared variable model null, causing downstream callers like
createSandbox and setupInference to receive invalid input; update this branch so
that instead of just logging and breaking it prompts the user to choose a
different provider (or programmatically falls back to a cloud provider) and sets
provider and model to valid values before continuing, or exits with an explicit
error—modify the nim-local handling around the models.length === 0 check to
either re-run the provider selection flow (so model gets assigned) or set
provider to the cloud fallback and assign a sane default to model, while
preserving the NEMOCLAW_EXPERIMENTAL gate.

---

Duplicate comments:
In `@bin/lib/onboard.js`:
- Around line 531-552: The validation functions (validateOpenAiLikeSelection,
validateAnthropicSelection, validateAnthropicSelectionWithRetryMessage,
validateCustomOpenAiLikeSelection, validateCustomAnthropicSelection) resolve
credentials via getCredential(), which can read from the credentials file and
thus miss an API key provided in process.env[credentialEnv]; update each
function so when credentialEnv is passed it first checks
process.env[credentialEnv] for the API key and uses that value if present,
otherwise fall back to getCredential(credentialEnv), then pass that resolved
apiKey into probeOpenAiLikeEndpoint / probeAnthropicEndpoint calls to ensure
non-interactive validation uses environment-provided secrets.
- Around line 436-445: The curl invocations built into the cmd array (used with
spawnSync in the onboarding flow) lack timeouts and can hang; update the cmd
construction to include safe timeout flags (e.g., add '--connect-timeout 5' and
'--max-time 30' as separate elements in the array before the URL) while
preserving existing headers and shellQuote usage (refer to the cmd, probe.body,
probe.url, shellQuote, and spawnSync occurrences); apply the same change to the
other curl command constructions/spawnSync calls in this file (the other
cmd/spawnSync blocks) so every synchronous curl includes the connect and overall
timeouts.

---

Nitpick comments:
In `@docs/inference/switch-inference-providers.md`:
- Around line 90-95: Replace the plain H2 "## Notes" and its bullet list with a
MyST admonition block by converting the header into a `:::{note}` block, placing
the four bullet lines ("- The host keeps provider credentials.", "- The sandbox
continues to use `inference.local`.", "- Runtime switching changes the OpenShell
route.", "- It does not rewrite your stored credentials.") inside the
admonition, and closing it with `:::` so the content becomes a proper MyST note
callout instead of a normal section.

In `@docs/reference/inference-profiles.md`:
- Around line 23-25: The sentence "The agent inside the sandbox talks to
`inference.local`, and OpenShell routes that traffic to the provider you
selected during onboarding." is a long compound sentence — split it into two
clearer sentences; for example, keep "The agent inside the sandbox talks to
`inference.local`." as one sentence and make "OpenShell routes that traffic to
the provider you selected during onboarding." a separate sentence so the
document follows the one-sentence-per-line guideline and improves readability.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 12d15cc3-576c-4c90-99bf-e5ba1679442a

📥 Commits

Reviewing files that changed from the base of the PR and between 69adc37 and cca052e.

📒 Files selected for processing (18)
  • Dockerfile
  • README.md
  • bin/lib/credentials.js
  • bin/lib/inference-config.js
  • bin/lib/local-inference.js
  • bin/lib/nim.js
  • bin/lib/onboard.js
  • bin/nemoclaw.js
  • docs/about/how-it-works.md
  • docs/inference/switch-inference-providers.md
  • docs/reference/commands.md
  • docs/reference/inference-profiles.md
  • docs/reference/troubleshooting.md
  • test/e2e/test-full-e2e.sh
  • test/inference-config.test.js
  • test/local-inference.test.js
  • test/onboard-selection.test.js
  • test/onboard.test.js
✅ Files skipped from review due to trivial changes (3)
  • docs/reference/troubleshooting.md
  • docs/reference/commands.md
  • test/onboard.test.js
🚧 Files skipped from review as they are similar to previous changes (6)
  • test/local-inference.test.js
  • docs/about/how-it-works.md
  • test/inference-config.test.js
  • bin/lib/nim.js
  • README.md
  • test/onboard-selection.test.js

Comment on lines +110 to +124
function prompt(question, opts = {}) {
return new Promise((resolve) => {
const silent = opts.secret === true && process.stdin.isTTY && process.stderr.isTTY;
if (silent) {
promptSecret(question)
.then(resolve)
.catch((err) => {
if (err && err.code === "SIGINT") {
process.kill(process.pid, "SIGINT");
return;
}
throw err;
});
return;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify unresolved Promise paths in prompt() silent branch (read-only).
awk 'NR>=110 && NR<=124 {printf "%4d  %s\n", NR, $0}' bin/lib/credentials.js
rg -nP 'new Promise\(\(resolve\)\s*=>\s*\{' bin/lib/credentials.js -n -A12 -B2

Repository: NVIDIA/NemoClaw

Length of output: 1081


🏁 Script executed:

sed -n '110,200p' bin/lib/credentials.js | cat -n

Repository: NVIDIA/NemoClaw

Length of output: 3624


Settle the outer prompt() Promise on silent-branch failures.

The Promise at line 111 only accepts resolve, but the silent branch (lines 114–123) has error paths that never settle it. When promptSecret() rejects with SIGINT (line 118), process.kill() executes and returns, leaving the outer Promise unresolved. For other errors (line 121), throw creates an unhandled rejection while the outer Promise still hangs. Callers awaiting prompt() can hang indefinitely.

Add reject to the Promise constructor and call reject(err) in both error cases:

Suggested fix
-function prompt(question, opts = {}) {
-  return new Promise((resolve) => {
+function prompt(question, opts = {}) {
+  return new Promise((resolve, reject) => {
     const silent = opts.secret === true && process.stdin.isTTY && process.stderr.isTTY;
     if (silent) {
       promptSecret(question)
         .then(resolve)
         .catch((err) => {
           if (err && err.code === "SIGINT") {
             process.kill(process.pid, "SIGINT");
-            return;
+            reject(err);
+            return;
           }
-          throw err;
+          reject(err);
         });
       return;
     }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
function prompt(question, opts = {}) {
return new Promise((resolve) => {
const silent = opts.secret === true && process.stdin.isTTY && process.stderr.isTTY;
if (silent) {
promptSecret(question)
.then(resolve)
.catch((err) => {
if (err && err.code === "SIGINT") {
process.kill(process.pid, "SIGINT");
return;
}
throw err;
});
return;
}
function prompt(question, opts = {}) {
return new Promise((resolve, reject) => {
const silent = opts.secret === true && process.stdin.isTTY && process.stderr.isTTY;
if (silent) {
promptSecret(question)
.then(resolve)
.catch((err) => {
if (err && err.code === "SIGINT") {
process.kill(process.pid, "SIGINT");
reject(err);
return;
}
reject(err);
});
return;
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/credentials.js` around lines 110 - 124, The outer Promise in prompt()
only accepts resolve so the silent branch's catch never settles callers; update
the Promise constructor to (resolve, reject) and in the promptSecret().catch
handler call reject(err) for both error paths instead of throwing (and for the
SIGINT path call reject(err) then process.kill(process.pid, "SIGINT") to ensure
the Promise is settled before exiting). This change should be made in the prompt
function and touches the promptSecret error handling block so callers awaiting
prompt() won't hang.

Comment on lines +1264 to 1275
const envArgs = [formatEnvAssignment("CHAT_UI_URL", chatUiUrl)];
if (process.env.NVIDIA_API_KEY) {
envArgs.push(`NVIDIA_API_KEY=${shellQuote(process.env.NVIDIA_API_KEY)}`);
envArgs.push(formatEnvAssignment("NVIDIA_API_KEY", process.env.NVIDIA_API_KEY));
}
const discordToken = getCredential("DISCORD_BOT_TOKEN") || process.env.DISCORD_BOT_TOKEN;
if (discordToken) {
envArgs.push(`DISCORD_BOT_TOKEN=${shellQuote(discordToken)}`);
envArgs.push(formatEnvAssignment("DISCORD_BOT_TOKEN", discordToken));
}
const slackToken = getCredential("SLACK_BOT_TOKEN") || process.env.SLACK_BOT_TOKEN;
if (slackToken) {
envArgs.push(`SLACK_BOT_TOKEN=${shellQuote(slackToken)}`);
envArgs.push(formatEnvAssignment("SLACK_BOT_TOKEN", slackToken));
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Secrets still leak through sandbox-create command arguments.

NVIDIA_API_KEY, DISCORD_BOT_TOKEN, and SLACK_BOT_TOKEN are serialized into envArgs and passed as command-line arguments to openshell sandbox create ... -- env NVIDIA_API_KEY=.... These values are visible in process listings (ps aux).

The PR objectives state "provider secrets are not passed in openshell argv during onboarding/provider creation," but this code path still exposes them.

Consider passing secrets via the spawned process's environment (opts.env) or via stdin/temporary file with restrictive permissions instead of command-line arguments.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/onboard.js` around lines 1264 - 1275, The envArgs array currently
includes sensitive values via formatEnvAssignment for NVIDIA_API_KEY,
DISCORD_BOT_TOKEN, and SLACK_BOT_TOKEN which causes secret leakage in
command-line arguments; change the onboarding flow so these keys are not
appended to envArgs and instead inject them into the spawned process environment
(opts.env) or send them over stdin/a securely permissioned temp file when
calling the sandbox create command; specifically, stop pushing
formatEnvAssignment("NVIDIA_API_KEY", ...),
formatEnvAssignment("DISCORD_BOT_TOKEN", ...), and
formatEnvAssignment("SLACK_BOT_TOKEN", ...) into envArgs and set process spawn
options.env (or write/read from stdin/temp file) so the secrets are available to
the child process without appearing in the argv.

@andy-ratsirarson
Copy link

andy-ratsirarson commented Mar 22, 2026

The sandbox image now bakes the selected model/provider identity into immutable ~/.openclaw/openclaw.json during build instead of relying on post-create config mutation.

@kjw3 Does baking the model into openclaw.json matter given the gateway rewrites the model in flight? As the openshell inference set handles routing regardless of what openclaw.json says, even random model names get rewritten.

Since the .nemoclaw/config.json actually have the inference model, why not make that override openclaw.json and be controlled by the user on the host via ~/.nemoclaw/config.json

in my current sandbox

~/.nemoclaw/config.json
{
  "endpointType": "custom",
  "endpointUrl": "https://inference.local/v1",
  "ncpPartner": null,
  "model": "gemini-2.5-flash",
  "profile": "inference-local",
  "credentialEnv": "OPENAI_API_KEY",
  "provider": "custom",
  "providerLabel": "Custom Provider",
  "onboardedAt": "2026-03-22T16:48:25.562Z"
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement: provider Use this label to identify requests to add a new AI provider to NemoClaw.

Projects

None yet

2 participants