Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
345 changes: 345 additions & 0 deletions .agents/skills/debug-inference/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,345 @@
---
name: debug-inference
description: Debug why inference.local or external inference setup is failing. Use when the user cannot reach a local model server, has provider base URL issues, sees inference verification failures, hits protocol mismatches, or needs to diagnose inference on local vs remote gateways. Trigger keywords - debug inference, inference.local, local inference, ollama, vllm, sglang, trtllm, NIM, inference failing, model server unreachable, failed to verify inference endpoint, host.openshell.internal.
---

# Debug Inference

Diagnose why OpenShell inference is failing and recommend exact fix commands.

Use `openshell` CLI commands to inspect the active gateway, provider records, managed inference config, and sandbox behavior. Use a short sandbox probe when needed to confirm end-to-end routing.

## Overview

OpenShell supports two different inference paths. Diagnose the correct one first.

1. **Managed inference** through `https://inference.local`
- Configured by `openshell inference set`
- Shared by every sandbox on the active gateway
- Credentials and model are injected by OpenShell
2. **Direct external inference** to hosts like `api.openai.com`
- Controlled by `network_policies`
- Requires the application to call the external host directly
- Requires provider attachment and network access to be configured separately

For local or self-hosted engines such as Ollama, vLLM, SGLang, TRT-LLM, and many NIM deployments, the most common managed inference pattern is an `openai` provider with `OPENAI_BASE_URL` pointing at a host the gateway can reach.

## Prerequisites

- `openshell` is on the PATH
- The active gateway is running
- You know the failing setup, or can infer it from commands and config

## Tools Available

Use these commands first:

```bash
# Which gateway is active, and can the CLI reach it?
openshell status

# Show managed inference config for inference.local
openshell inference get

# Inspect the provider record referenced by inference.local
openshell provider get <provider-name>

# Inspect gateway topology details when remote/local confusion is suspected
openshell gateway info

# Run a minimal end-to-end probe from a sandbox
openshell sandbox create -- curl https://inference.local/v1/chat/completions --json '{"messages":[{"role":"user","content":"hello"}],"max_tokens":10}'
```

## Workflow

When the user asks to debug inference, run diagnostics automatically in this order. Stop and report findings as soon as a root cause is identified.

### Determine Context

Establish these facts first:

1. Is the application calling `https://inference.local` or a direct external host?
2. Which gateway is active, and is it local, remote, or cloud?
3. Which provider and model are configured for managed inference?
4. Is the upstream local to the gateway host, or somewhere else?

### Step 0: Check the Active Gateway

Run:

```bash
openshell status
openshell gateway info
```

Look for:

- Active gateway name and endpoint
- Whether the gateway is local or remote
- Whether `host.openshell.internal` would point to the local machine or a remote host

Common mistake:

- **Laptop-local model + remote gateway**: `host.openshell.internal` points to the remote gateway host, not your laptop. A laptop-local Ollama or vLLM server will not be reachable without a tunnel or shared reachable network path.

### Step 1: Check Whether Managed Inference Is Configured

Run:

```bash
openshell inference get
```

Interpretation:

- **`Not configured`**: `inference.local` has no backend yet. Fix by configuring it:

```bash
openshell inference set --provider <name> --model <id>
```

- **Provider and model shown**: Continue to provider inspection.

### Step 2: Inspect the Provider Record

Run:

```bash
openshell provider get <provider-name>
```

Check:

- Provider type matches the client API shape
- `openai` for OpenAI-compatible engines such as Ollama, vLLM, SGLang, TRT-LLM, and many NIM deployments
- `anthropic` for Anthropic Messages API
- `nvidia` for NVIDIA-hosted OpenAI-compatible endpoints
- Required credential key exists
- `*_BASE_URL` override is correct when using a self-hosted endpoint

Fix examples:

```bash
openshell provider create --name ollama --type openai --credential OPENAI_API_KEY=empty --config OPENAI_BASE_URL=http://host.openshell.internal:11434/v1

openshell provider update ollama --type openai --credential OPENAI_API_KEY=empty --config OPENAI_BASE_URL=http://host.openshell.internal:11434/v1
```

### Step 3: Check Local Host Reachability

For host-backed local inference, confirm the upstream server:

- Binds to `0.0.0.0`, not only `127.0.0.1`
- Runs on the same machine as the gateway
- Is reachable through `host.openshell.internal`, the host's LAN IP, or another reachable hostname

Common mistakes:

- **Base URL uses `127.0.0.1` or `localhost`**: usually wrong for managed inference. Replace with `host.openshell.internal` or the host's LAN IP.
- **Server binds only to loopback**: reconfigure it to bind to `0.0.0.0`.
- **Inference engine runs as a system service**: changing the bind address may require updating the service configuration and restarting the service before the new listener becomes reachable.

### Step 4: Check Request Shape

Managed inference only works for `https://inference.local` and supported inference API paths.

Supported patterns include:

- `POST /v1/chat/completions`
- `POST /v1/completions`
- `POST /v1/responses`
- `POST /v1/messages`
- `GET /v1/models`

Common mistakes:

- **Wrong scheme**: `http://inference.local` instead of `https://inference.local`
- **Unsupported path**: request does not match a known inference API
- **Protocol mismatch**: Anthropic client against an `openai` provider, or vice versa

Fix guidance:

- Use a supported path and provider type
- Point OpenAI-compatible SDKs at `https://inference.local/v1`
- If the SDK requires an API key, pass any non-empty placeholder such as `test`

### Step 5: Probe from a Sandbox

Run a minimal request from inside a sandbox:

```bash
openshell sandbox create -- curl https://inference.local/v1/chat/completions --json '{"messages":[{"role":"user","content":"hello"}],"max_tokens":10}'
```

Interpretation:

- **`cluster inference is not configured`**: set the managed route with `openshell inference set`
- **`connection not allowed by policy`** on `inference.local`: unsupported method or path
- **`no compatible route`**: provider type and client API shape do not match
- **Connection refused / upstream unavailable / verification failures**: base URL, bind address, topology, or credentials are wrong

### Step 6: Reapply or Repair the Managed Route

After fixing the provider, repoint `inference.local`:

```bash
openshell inference set --provider <name> --model <id>
```

If the endpoint is intentionally offline and you only want to save the config:

```bash
openshell inference set --provider <name> --model <id> --no-verify
```

Inference updates are hot-reloaded to all sandboxes on the active gateway within about 5 seconds by default.

### Step 7: Diagnose Direct External Inference

If the application calls `api.openai.com`, `api.anthropic.com`, or another external host directly, this is not a managed inference issue.

Check instead:

1. The application is configured to call the external hostname directly
2. A provider with the needed credentials exists
3. The sandbox is launched with that provider attached
4. `network_policies` allow that host, port, and HTTP rules

Use the `generate-sandbox-policy` skill when the user needs help authoring policy YAML.

## Fix: Local Host Inference Timeouts (Firewall)

Use this fix when a sandbox can reach `https://inference.local`, but OpenShell reports an upstream timeout against a host-local backend such as Ollama.

Example symptom:

```json
{"error":"request to http://host.docker.internal:11434/v1/models timed out"}
```

### When This Happens

This failure commonly appears on Linux hosts that:

- Run the OpenShell gateway in Docker
- Route `inference.local` to a host-local OpenAI-compatible endpoint such as Ollama
- Have a host firewall or networking configuration that denies container-to-host traffic by default

In this case, OpenShell routing is usually working correctly. The failing hop is container-to-host traffic on the backend port.

### Why CoreDNS Is Not the Cause

This is not the same issue as the Colima CoreDNS fix.

OpenShell injects `host.docker.internal` and `host.openshell.internal` into sandbox pods with `hostAliases`. That path bypasses cluster DNS lookup. If the request still times out, the usual cause is host firewall or network policy, not CoreDNS.

### Verify the Problem

1. Confirm the model server works on the host:

```bash
curl -sS http://127.0.0.1:11434/v1/models
```

2. Confirm the host gateway address also works on the host:

```bash
curl -sS http://172.17.0.1:11434/v1/models
```

3. Test the same endpoint from the OpenShell cluster container:

```bash
docker exec openshell-cluster-<gateway> wget -qO- -T 5 http://host.docker.internal:11434/v1/models
```

If steps 1 and 2 succeed but step 3 times out, the host firewall or network configuration is blocking the container-to-host path.

### Fix

Allow the Docker bridge network used by the OpenShell cluster to reach the host-local inference port. The exact command depends on your firewall tooling (iptables, nftables, firewalld, UFW, etc.), but the rule should allow:

- **Source**: the Docker bridge subnet used by the OpenShell cluster container (commonly `172.18.0.0/16`)
- **Destination**: the host gateway IP injected into sandbox pods for `host.docker.internal` (commonly `172.17.0.1`)
- **Port**: the inference server port (e.g. `11434/tcp` for Ollama)

To find the actual values on your system:

```bash
# Docker bridge subnet for the OpenShell cluster network
docker network inspect $(docker network ls --filter name=openshell -q) --format '{{range .IPAM.Config}}{{.Subnet}}{{end}}'

# Host gateway IP visible from inside the container
docker exec openshell-cluster-<gateway> cat /etc/hosts | grep host.docker.internal
```

Adjust the source subnet, destination IP, or port to match your local Docker network layout.

### Verify the Fix

1. Re-run the cluster container check:

```bash
docker exec openshell-cluster-<gateway> wget -qO- -T 5 http://host.docker.internal:11434/v1/models
```

2. Re-test from a sandbox:

```bash
curl -sS https://inference.local/v1/models
```

Both commands should return the upstream model list.

### If It Still Fails

- Confirm the backend listens on a host-reachable address: `ss -ltnp | rg ':11434\b'`
- Confirm the provider points at the host alias path you expect: `openshell provider get <provider-name>`
- Confirm the active inference route: `openshell inference get`
- Inspect sandbox logs for upstream timeout details: `openshell logs <sandbox-name> --since 10m`

## Common Failure Patterns

| Symptom | Likely cause | Fix |
|---------|--------------|-----|
| `openshell inference get` shows `Not configured` | No managed inference route configured | `openshell inference set --provider <name> --model <id>` |
| `failed to verify inference endpoint` | Bad base URL, wrong credentials, wrong provider type, or upstream not reachable | Fix provider config, then rerun `openshell inference set`; use `--no-verify` only when the endpoint is intentionally offline |
| Base URL uses `127.0.0.1` | Loopback points at the wrong runtime | Use `host.openshell.internal` or another gateway-reachable host |
| Local engine works only when gateway is local | Gateway moved to remote host | Run the engine on the gateway host, add a tunnel, or use direct external access |
| `connection not allowed by policy` on `inference.local` | Unsupported path or method | Use a supported inference API path |
| `no compatible route` | Provider type does not match request shape | Switch provider type or change the client API |
| Direct call to external host is denied | Missing policy or provider attachment | Update `network_policies` and launch sandbox with the right provider |
| SDK fails on empty auth token | Client requires a non-empty API key even though OpenShell injects the real one | Use any placeholder token such as `test` |
| Upstream timeout from container to host-local backend | Host firewall or network config blocks container-to-host traffic | Allow the Docker bridge subnet to reach the inference port on the host gateway IP (see firewall fix section above) |

## Full Diagnostic Dump

Run this when you want a compact report before deciding on a fix:

```bash
echo "=== Gateway Status ==="
openshell status

echo "=== Gateway Info ==="
openshell gateway info

echo "=== Managed Inference ==="
openshell inference get

echo "=== Providers ==="
openshell provider list

echo "=== Selected Provider ==="
openshell provider get <provider-name>

echo "=== Sandbox Probe ==="
openshell sandbox create -- curl https://inference.local/v1/chat/completions --json '{"messages":[{"role":"user","content":"hello"}],"max_tokens":10}'
```

When you report back, state:

1. Which inference path is failing (`inference.local` vs direct external)
2. Whether gateway topology is part of the problem
3. The most likely root cause
4. The exact fix commands the user should run
5 changes: 3 additions & 2 deletions .agents/skills/openshell-cli/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -208,7 +208,7 @@ openshell sandbox delete sandbox-1 sandbox-2 sandbox-3 # Multiple at once

This is the most important multi-step workflow. It enables a tight feedback cycle where sandbox policy is refined based on observed activity.

**Key concept**: Policies have static fields (immutable after creation: `filesystem_policy`, `landlock`, `process`) and dynamic fields (hot-reloadable on a running sandbox: `network_policies`, `inference`). Only dynamic fields can be updated without recreating the sandbox.
**Key concept**: Policies have static fields (immutable after creation: `filesystem_policy`, `landlock`, `process`) and one dynamic field (`network_policies`). Only `network_policies` can be updated without recreating the sandbox.

```
Create sandbox with initial policy
Expand Down Expand Up @@ -272,7 +272,7 @@ Edit `current-policy.yaml` to allow the blocked actions. **For policy content au
- Enforcement modes (`audit` vs `enforce`)
- Binary matching patterns

Only `network_policies` and `inference` sections can be modified at runtime. If `filesystem_policy`, `landlock`, or `process` need changes, the sandbox must be recreated.
Only `network_policies` can be modified at runtime. If `filesystem_policy`, `landlock`, or `process` need changes, the sandbox must be recreated.

### Step 5: Push the updated policy

Expand Down Expand Up @@ -564,4 +564,5 @@ $ openshell sandbox upload --help
|-------|------------|
| `generate-sandbox-policy` | Creating or modifying policy YAML content (network rules, L7 inspection, access presets, endpoint configuration) |
| `debug-openshell-cluster` | Diagnosing cluster startup or health failures |
| `debug-inference` | Diagnosing `inference.local`, host-backed local inference, and provider base URL issues |
| `tui-development` | Developing features for the OpenShell TUI (`openshell term`) |
2 changes: 1 addition & 1 deletion .agents/skills/openshell-cli/cli-reference.md
Original file line number Diff line number Diff line change
Expand Up @@ -270,7 +270,7 @@ View sandbox logs. Supports one-shot and streaming.

### `openshell policy set <name> --policy <PATH>`

Update the policy on a live sandbox. Only dynamic fields (`network_policies`, `inference`) can be changed at runtime.
Update the policy on a live sandbox. Only the dynamic `network_policies` field can be changed at runtime.

| Flag | Default | Description |
|------|---------|-------------|
Expand Down
3 changes: 2 additions & 1 deletion .agents/skills/triage-issue/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -91,7 +91,7 @@ Check whether the issue body contains a substantive agent diagnostic section. Lo
>
> This issue was opened without an agent investigation.
>
> OpenShell is an agent-first project before we triage this, please point your coding agent at the repo and have it investigate. Your agent can load skills like `debug-openshell-cluster` (for cluster issues), `openshell-cli` (for usage questions), or `generate-sandbox-policy` (for policy help).
> OpenShell is an agent-first project - before we triage this, please point your coding agent at the repo and have it investigate. Your agent can load skills like `debug-openshell-cluster` (for cluster issues), `debug-inference` (for inference setup issues), `openshell-cli` (for usage questions), or `generate-sandbox-policy` (for policy help).
>
> See [CONTRIBUTING.md](https://github.com/NVIDIA/OpenShell/blob/main/CONTRIBUTING.md#before-you-open-an-issue) for the full workflow.
>
Expand Down Expand Up @@ -123,6 +123,7 @@ Based on the sub-agent's analysis, also attempt to validate the report directly:
- For bug reports: check the relevant code paths, look for the described failure mode
- For feature requests: assess feasibility against the existing architecture
- For cluster/infrastructure issues: reference the `debug-openshell-cluster` skill's known failure patterns
- For inference and provider-topology issues: reference the `debug-inference` skill's known failure patterns
- For CLI/usage issues: reference the `openshell-cli` skill's command reference

## Step 5: Classify
Expand Down
Loading
Loading