llama-dash

llama-dash turns a self-hosted local inference box into an observable, policy-controlled AI gateway: one UI for model state, request history, API keys, routing rules, proxy metrics, and client setup. The implemented inference backend is currently llama-swap over llama.cpp.

It is the single public entrypoint for OpenAI-compatible and Anthropic-compatible clients. llama-dash owns proxy policy, logging, auth, routing, and backend normalization, your selected inference backend owns local model processes and inference when traffic is routed to local models.

OpenAI SDK / Claude Code / Continue / Open WebUI
                    │
                    ▼
              llama-dash :3000
      dashboard · auth · logs · routing · metrics
             │                     │
             ▼                     ▼
      llama-swap :8080         direct /v1 upstreams
  llama.cpp models · peers      OpenAI · Anthropic

✨ What it does

Dashboard — live stats, sparklines, model timeline, upstream health, GPU monitoring, SSE-backed live refresh with ETag polling fallback, and update status.
Model management — load/unload models, per-model stats, load history, config snippet.
Request logging — every completed /v1/* call is queued for SQLite logging with searchable UI, histogram, detail view, credential injection audit metadata, retention controls, and token cost estimates from a startup-cached models.dev pricing catalog.
Transparent proxy — streaming SSE preserved, bounded body capture for logs, token counts scraped in-flight. OpenAI (/v1/chat/completions) and Anthropic (/v1/messages) shapes both supported — for example, point Claude Code at llama-dash via ANTHROPIC_BASE_URL to proxy and track your Claude code usage as well.
API keys — per-key rate limits (RPM/TPM), model and MCP relay allow-lists editable from detail page, hashed at rest, per-key stats and model usage breakdown.
Dashboard auth — Better Auth username/password and passkey session gate for the UI and /api/* with first-visit signup; /v1/* proxy auth stays API-key based.
Policies — custom routing rules with real proxy enforcement for continue, model rewrite, and policy reject actions, plus explicit auth passthrough, direct HTTPS /v1 upstream targets, MCP relay endpoints with Claude Code snippets, encrypted upstream credential injection, header credential placeholder replacement, per-key system prompt injection, and global request size limits.
Attribution — configurable header mapping for client, end-user, and session metadata with setup examples for common clients.
Request auditing — per-key usage tracking across all proxied calls.
Prometheus metrics — /metrics exposes proxy request, token, credential injection, latency-window, queue, upstream, running-model, and GPU gauges.
GPU monitoring — NVIDIA, AMD, and Apple Silicon. VRAM, utilization, temp, power.
Config editor — edit llama-swap config.yaml in-browser with on-demand validation, enforced pre-save schema checks, and auto-reload.
Inference backend facade — backend health, model list/running state, lifecycle actions, logs, and config are capability-driven so future runtimes can be added without weakening the llama-swap experience.
Endpoints — copyable base URL, API key selector, code examples for curl, Python, TypeScript, Home Assistant, Claude Code, opencode, Open WebUI, and more.
Playground — supports chat, image, speech, and transcribe; Speech can turn plain text or a server-extracted article URL into audio, with paragraph-chunked article playback for faster starts.

_{Dashboard Live traffic, tokens, model residency, upstream and GPU health}	_{Playground Chat against local endpoints with request/response inspection}	_{Request detail Routing, attribution, latency, tokens, and payload metadata}	_{Logs Raw llama-swap, proxy, and upstream streams in one viewer}

_{Model detail Load history, stats, recent requests, and config context}	_{Speech playground Read any article and audio testing}	_{Policies Aliases, routing rules, passthrough auth, and request limits}	_{Requests Searchable history with filters, sorting, and histogram}

⚡ Quick start (Docker Compose)

Choose the compose file that matches your GPU vendor. Both setups use ./config/config.yaml for llama-swap config, ./models/ for model files, and expose llama-dash on http://localhost:3000.

AMD / ROCm

cp config/config.example.yaml config/config.yaml  # edit models
docker compose -f docker-compose.amd.yaml up -d

docker-compose.amd.yaml runs ghcr.io/mostlygeek/llama-swap:rocm, passes through /dev/kfd and /dev/dri, and also mounts /dev/dri into llama-dash so AMD GPU stats work in the dashboard.

NVIDIA / CUDA

cp config/config.example.yaml config/config.yaml  # edit models
docker compose -f docker-compose.nvidia.yaml up -d

docker-compose.nvidia.yaml runs ghcr.io/mostlygeek/llama-swap:cuda and requests gpus: all for the llama-swap service. This requires the NVIDIA Container Toolkit on the host.

🏗️ Manual setup

Requirements

Node 24+
pnpm
A reachable llama-swap instance

Install

cp .env.example .env   # edit INFERENCE_BASE_URL to point at your instance
pnpm install
pnpm db:migrate        # creates data/dash.db
pnpm dev               # http://localhost:5173

🏔️ Environment

Copy .env.example to .env and fill in the values.

Variable	Default	Notes
`INFERENCE_BACKEND`	`llama-swap`	Active inference backend. Only `llama-swap` is currently implemented.
`INFERENCE_BASE_URL`	`http://localhost:8080`	Inference backend base URL. No trailing slash.
`INFERENCE_INSECURE`	`false`	Skip TLS verification for inference backend with self-signed certs.
`INFERENCE_CONFIG_FILE`	(empty)	Absolute path to the backend config file. Required for the llama-swap config editor.
`DATABASE_PATH`	`data/dash.db`	SQLite file, relative to CWD. SQLite `:memory:` and `file:` URI paths are preserved for tests/special deployments.
`BETTER_AUTH_SECRET`		Secret for signing Better Auth session data; `openssl rand -base64 33`
`BETTER_AUTH_URL`	inferred	Optional external base URL for Better Auth redirects/cookies. Set this to the public HTTPS origin when using passkeys outside localhost.
`CREDENTIAL_ENCRYPTION_KEY`		32+ character secret used to encrypt stored upstream provider credentials. Required before creating, placeholder-replacing, or injecting upstream credentials.
`UPSTREAM_HEADERS_TIMEOUT_MS`	`600000`	Upstream proxy fetch headers timeout (ms); `0` disables. Keep generous for long non-streaming jobs (image gen) that send no response headers until done.
`UPSTREAM_BODY_TIMEOUT_MS`	`0`	Upstream proxy fetch body timeout (ms); `0` disables.

See docs/2026_05_03_inference_backends.md for the backend abstraction, capability model, and future Ollama notes.

⚙️ How it's wired

src/server/proxy/* — the /v1/* pass-through: streaming SSE preserved, proxy context/body snapshots kept isolated, bounded request/response capture for logs, token counts scraped from responses as they fly by, and one queued SQLite row per completed request.
src/server/pricing.ts — startup-cached models.dev model pricing used to estimate logged request cost from upstream usage counters.
src/server/admin/* — the /api/* admin surface consumed by the UI, with grouped route modules under src/server/admin/routes/* for models, requests, config, keys, aliases, routing, MCP relays, upstream credentials, settings, and system health. JSON GET responses support conditional ETag polling, /api/events streams lightweight dashboard events, and /api/log-events streams llama-swap logs only while the Logs page is mounted.
src/server/article-extract.ts — server-side article URL fetcher for the Speech playground. It validates public HTTP(S) URLs, caps HTML fetch size/time, parses readable article text with Readability, and returns editable text for TTS.
src/server/mcp-relay/* — configured /mcp-relays/:slug reverse proxies for coding-agent MCP HTTP transports. Relays require x-llama-dash-api-key, the key must allow the relay, stored upstream credentials are injected into outbound headers, responses stream through, and the exchange is logged without exposing provider secrets. Successful relay requests are metadata-only by default; failures keep the normal bounded debug capture policy.
src/server/auth.ts — Better Auth setup for dashboard username/password and passkey sessions; protects UI and /api/*, not /v1/*. Signup is only allowed while no dashboard user exists.
src/server/gpu-poller.ts — polls nvidia-smi / rocm-smi / system_profiler every 10s, caches result in memory, and publishes GPU-change events for live dashboard refresh. AMD APUs use GTT (not VRAM) for actual usable memory; Apple shows unified memory and core count when available.
src/server/model-watcher.ts — polls the inference backend running-model capability every 15s, diffs state, writes load/unload events to model_events table, and publishes model-change events.
src/server/inference/* — selected inference backend facade plus backend-specific adapters and hints.
src/server/llama-swap/client.ts — typed client over llama-swap's HTTP API.
src/server/db/* — Drizzle schema, SQLite initialization, and request/model-event indexes for common dashboard query paths. Apply migrations explicitly with pnpm db:migrate.
src/server/request-log-maintenance.ts — hourly request-log retention cleanup plus manual prune/compact admin actions to keep the SQLite file bounded on high-frequency relay traffic.
src/server/metrics.ts — Prometheus text metrics for proxy requests, tokens, latency window gauges, queue depth/drops, upstream reachability, running models, and GPU gauges at /metrics.
Dockerfile, prod-server.mjs, docker-compose.amd.yaml, docker-compose.nvidia.yaml — production container packaging for llama-dash by itself or bundled with llama-swap.
src/routes/* — thin TanStack Start route entrypoints for /, /login, /models, /models/:id, /requests, /logs, /system, /playground, /config, /settings, /keys, /keys/:id, /attribution, /policies, /endpoints.
src/features/* — feature-local page components and helpers grouped by route area (dashboard, requests, keys, models, playground, etc.).
src/lib/queries.ts — TanStack Query hooks with SSE-driven cache invalidation for request, model, GPU, and system changes plus slow ETag polling fallback.

✴️ Claude Code / Anthropic passthrough

Route any Anthropic SDK (including claude-code) through llama-dash for logging, filtering, and per-request inspection. Supports Anthropic subscriptions. Traffic flows:

Claude Code ──► llama-dash :5173 (log + filter) ──► api.anthropic.com

Client config (~/.claude/settings.json):

{
  "env": {
    "ANTHROPIC_BASE_URL": "http://<llama-dash-host>:3000",
    "CLAUDE_CODE_ENABLE_GATEWAY_MODEL_DISCOVERY": "1"
  }
}

Leave ANTHROPIC_AUTH_TOKEN unset when using subscription OAuth — Claude Code manages the bearer itself and llama-dash passes it through unchanged.

In llama-dash, configure an explicit routing rule in Policies for /v1/* target path or Claude source model names, using continue, passthrough auth, preserved client Authorization, and direct target https://api.anthropic.com/v1. This will result in all Anthropic requests being transparently proxied through while logging all traffic in llama-dash and applying filters.

For provider-key flows, store the upstream token in the Policies credential vault and add a credential binding to the routing rule. Bindings can either set an outbound header automatically or replace a client-sent header placeholder like {{llama-dash:credential:anthropic-prod}}; injected values are redacted from request logs. The request detail view records non-secret credential name/slug metadata for audit, and the credential vault warns before deleting credentials still referenced by routing rules or MCP relays. Credential-bearing passthrough rules still require a valid llama-dash API key; they are not evaluated as pre-auth public passthrough rules.

For remote MCP servers, create an MCP relay in Policies instead of pointing Claude Code directly at the provider. Configure Claude with the relay URL and a separate llama-dash key header:

{
  "mcpServers": {
    "hyperline_sandbox": {
      "type": "http",
      "url": "http://<llama-dash-host>:5173/mcp-relays/hyperline-sandbox",
      "headers": {
        "x-llama-dash-api-key": "sk-..."
      }
    }
  }
}

The provider bearer token stays encrypted in llama-dash and is injected only when an API key that is explicitly allowed to use the relay forwards the MCP request upstream. The Policies page shows this Claude Code snippet for each configured relay.

🤖 Acknowledgements

This project was developed with significant assistance from LLMs. Architecture decisions, implementation, and documentation were all shaped through human-AI collaboration.

📝 License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 572 Commits
.github		.github
config		config
docs		docs
drizzle		drizzle
public		public
src		src
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
biome.json		biome.json
docker-compose.amd.yaml		docker-compose.amd.yaml
docker-compose.nvidia.yaml		docker-compose.nvidia.yaml
drizzle.config.ts		drizzle.config.ts
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
pnpm-workspace.yaml		pnpm-workspace.yaml
prod-server.mjs		prod-server.mjs
tsconfig.json		tsconfig.json
vite.config.ts		vite.config.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

llama-dash

✨ What it does

⚡ Quick start (Docker Compose)

AMD / ROCm

NVIDIA / CUDA

🏗️ Manual setup

Requirements

Install

🏔️ Environment

⚙️ How it's wired

✴️ Claude Code / Anthropic passthrough

🤖 Acknowledgements

📝 License

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

llama-dash

✨ What it does

⚡ Quick start (Docker Compose)

AMD / ROCm

NVIDIA / CUDA

🏗️ Manual setup

Requirements

Install

🏔️ Environment

⚙️ How it's wired

✴️ Claude Code / Anthropic passthrough

🤖 Acknowledgements

📝 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages