From 2baf0f8ead0c3268cf5deb7b0968b220aaeb7006 Mon Sep 17 00:00:00 2001 From: "Luma (Enclave AI)" Date: Mon, 11 May 2026 18:50:14 +0000 Subject: [PATCH] docs: revamp README, add comprehensive docs/ structure --- README.md | 179 +------------------------- docs/cloud-models.md | 141 +++++++++++++++++++++ docs/design.md | 67 ++++++++++ docs/faq.md | 164 ++++++++++++++++++++++++ docs/getting-started.md | 156 +++++++++++++++++++++++ docs/install.md | 257 ++++++++++++++++++++++++++++++++++++++ docs/models.md | 64 ++++++++++ docs/multi-machine.md | 201 +++++++++++++++++++++++++++++ docs/overview.md | 122 ++++++++++++++++++ docs/secret-management.md | 138 ++++++++++++++++++++ docs/smart-router.md | 111 ++++++++++++++++ docs/troubleshooting.md | 10 +- 12 files changed, 1429 insertions(+), 181 deletions(-) create mode 100644 docs/cloud-models.md create mode 100644 docs/design.md create mode 100644 docs/faq.md create mode 100644 docs/getting-started.md create mode 100644 docs/install.md create mode 100644 docs/models.md create mode 100644 docs/multi-machine.md create mode 100644 docs/overview.md create mode 100644 docs/secret-management.md create mode 100644 docs/smart-router.md diff --git a/README.md b/README.md index d26cb7b..651ddd6 100644 --- a/README.md +++ b/README.md @@ -1,188 +1,15 @@ # ai-stack -A self-hosted AI stack optimised for **Intel Arc iGPU** on Linux, built around Ollama + OpenCode. Provides local LLM inference (ollama-arc), cloud API routing (LiteLLM), unified routing (Olla), and Obsidian vault RAG (retriever). The primary AI interface is **OpenCode** (CLI + Obsidian sidebar plugin). +A self-hosted, zero-cost AI development stack built around local LLMs. Run powerful language models on your own hardware — no token spend, no cloud dependency. Free-tier cloud models available as an optional augment. -Built and documented through real-world homelab experience on Intel Arc hardware. +Built around [Ollama](https://ollama.com), [OpenCode](https://opencode.ai), and a lightweight smart router that automatically selects the right model for each request. --- -## What's included - -| Component | Purpose | -|-----------|---------| -| **Ollama (ava-agentone/ollama-intel)** | LLM inference with Intel Arc iGPU acceleration via OneAPI/SYCL | -| **LiteLLM** | Cloud API gateway (Claude, Gemini) | -| **Olla** | Unified LLM router with load balancing | -| **Smart Router** | Content-based model selection (OpenCode → router → Olla) | -| **Retriever** | Lightweight Obsidian vault RAG (sqlite-vec + FTS5, hybrid search) | -| **OpenCode** | Primary AI interface — CLI tool + Obsidian sidebar plugin | - ---- - -## Hardware requirements - -| Component | Minimum | Recommended | -|-----------|---------|-------------| -| CPU | Intel Core Ultra (Meteor Lake) | Intel Core Ultra 9 185H | -| RAM | 16 GB | 32 GB | -| GPU | Intel Arc iGPU | Intel Arc iGPU (any Meteor/Arrow Lake) | -| Storage | 50 GB free | 100 GB+ free (models are large) | -| OS | Ubuntu 22.04 | Ubuntu 24.04 | - ---- - -## Quick start - -```bash -# 1. Clone the repo -git clone https://github.com/growlf/ai-stack.git -cd ai-stack - -# 2. Configure -cp .env.example .env -nano .env # set your username, paths, and API keys (or skip — install.sh can set up Bitwarden) - -# 3. Install -chmod +x install.sh scripts/check-arc-gpu.sh -./install.sh - -# The installer will: -# - Prompt to install OpenCode CLI + Bun -# - Auto-install the OpenCode Obsidian plugin (growlf/opencode-obsidian) -# - Prompt to configure Bitwarden/VaultWarden for secret management -# - Start the stack (ollama-arc, litellm, olla, router, retriever) -# - Prompt to pull models -``` - ---- - -## Project structure - -``` -ai-stack/ -├── install.sh # Main installer -├── docker-compose.yml # Full stack definition -├── .env.example # All configurable values -├── systemd/ -│ └── ai-stack.service # Systemd unit (auto-start on boot) -├── scripts/ -│ ├── check-arc-gpu.sh # GPU pre-flight (detects card0/card1 drift) -│ ├── discover-herd.sh # mDNS discovery of remote Ollama nodes -│ ├── generate-olla-config.sh # Reads .env → writes proxy/olla.yaml -│ ├── resolve-vaultwarden.sh # Resolves placeholders via bw CLI -│ └── generate-keys.sh # Generates secure keys (LITELLM_MASTER_KEY) -├── router/ -│ └── smart-model-router.py # Content-based model routing (OpenCode → router → Olla) -├── retriever/ -│ ├── main.py # FastAPI app -│ ├── search.py # Hybrid search (FTS5 + vector, RRF fusion) -│ ├── indexer.py # Vault scanner + watchdog + chunking -│ └── Dockerfile -`scripts/olla.yaml.template` # Olla config template (generated by generate-olla-config.sh) -├── proxy/ -│ └── litellm_config.yaml # LiteLLM model registry (Claude, Gemini) -├── .opencode/ -│ └── tools/ -│ └── vault-search.ts # OpenCode tool: search vault via retriever API -└── docs/ - ├── deployment-guide.md # Setup walkthrough - ├── model-guide.md # Model recommendations and routing - ├── troubleshooting.md # Common issues and fixes - └── retriever-guide.md # Obsidian vault RAG setup -``` - ---- - -## Model stack - -| Model | Use case | -|-------|----------| -| `gemma4:27b` | Heavy lifting, large context, complex analysis | -| `mistral-small3.2:24b` | Strong function calling, 128K context | -| `qwen3.5:14b` | Improved reasoning, tool calling (recommended default) | -| `qwen2.5:14b` | Tool calling, diagnostics, sysadmin (router default for diagnostics) | -| `qwen2.5-coder:14b` | Scripts, configs, code (router default for code) | -| `deepseek-r1:14b` | Complex reasoning, root cause analysis (router default for reasoning) | -| `gemma3:12b` | Log analysis, summaries, documentation (router default for longform) | -| `nomic-embed-text` | Embeddings / RAG | - -See **[docs/model-guide.md](docs/model-guide.md)** for details. - ---- - -## Known Intel Arc quirks - -- The DRI card node (`/dev/dri/card0` vs `card1`) can drift between reboots on Meteor Lake. The `check-arc-gpu.sh` pre-flight script detects and corrects this automatically. -- Intel iGPU uses shared system RAM — `runner.vram="0 B"` in Ollama logs is expected and normal. -- Use `OLLAMA_KEEP_ALIVE=-1` to keep models resident in memory between requests. -- `renderD128` is the compute node and is stable; only the `cardN` display node drifts. - ---- - -## Multi-machine setup - -Add remote Ollama nodes via `.env`: - -``` -OLLAMA_REMOTE_WORKSTATION=http://192.168.1.50:11434:75 -``` - -Then regenerate Olla config: - -```bash -bash scripts/generate-olla-config.sh -sudo systemctl restart ai-stack.service -``` - -Or auto-discover nodes on your LAN: - -```bash -bash scripts/discover-herd.sh --apply -``` - ---- - -## Secret management (optional) - -The stack can resolve API keys from Bitwarden (or self-hosted VaultWarden) at runtime using `` placeholders in `.env`: - -``` -ANTHROPIC_API_KEY= -``` - -The `install.sh` script prompts to set this up: -- Installs `bw` CLI (via npm) -- Collects your organization ID and API credentials -- Writes `BW_CLIENT_ID`, `BW_CLIENT_SECRET`, `VAULT_MASTER_PASSWORD` to `.env` -- Creates vaultwarden placeholders for `ANTHROPIC_API_KEY`, `GEMINI_API_KEY`, and `LITELLM_MASTER_KEY` -- Resolves them immediately - -On subsequent starts, `start.sh` auto-resolves any unresolved placeholders. - -Manual resolution: -```bash -./scripts/resolve-vaultwarden.sh # resolve in-place -./scripts/resolve-vaultwarden.sh --dry-run # preview only -``` - ---- - -## Updating the stack - -```bash -cd /path/to/ai-stack - -# Pull latest images -docker compose pull - -# Restart with new images -sudo systemctl restart ai-stack.service -``` +**[Get started → docs/install.md](docs/install.md)** --- ## Licence MIT — use freely, contributions welcome. - -Built with ☕ and stubbornness. diff --git a/docs/cloud-models.md b/docs/cloud-models.md new file mode 100644 index 0000000..f0dcd58 --- /dev/null +++ b/docs/cloud-models.md @@ -0,0 +1,141 @@ +# Cloud models (Claude, Gemini) + +ai-stack is designed to work entirely locally with no cloud dependency. But free-tier access to frontier cloud models (Claude and Gemini) is available as an optional augment — useful when a task genuinely needs more capability than a 14B local model provides. + +This is handled through LiteLLM, which proxies cloud API calls through a local endpoint. From OpenCode's perspective, it's just another provider. + +--- + +## What "free tier" means + +Both Anthropic (Claude) and Google (Gemini) offer free API access: + +- **Claude** (via Anthropic): Generous free tier on Claude Haiku and Claude Sonnet. Rate-limited, but sufficient for occasional use. Requires account creation at [console.anthropic.com](https://console.anthropic.com). +- **Gemini** (via Google AI Studio): Free tier on Gemini Flash and Gemini Pro. More liberal rate limits. Requires Google account at [aistudio.google.com](https://aistudio.google.com). + +Free tiers can be revoked or changed by the providers. Check current limits at their respective developer consoles. + +--- + +## Getting API keys + +### Claude (Anthropic) + +1. Go to [console.anthropic.com](https://console.anthropic.com) +2. Create an account (free) +3. Go to **API Keys** → **Create Key** +4. Copy the key (starts with `sk-ant-...`) + +### Gemini (Google AI Studio) + +1. Go to [aistudio.google.com](https://aistudio.google.com) +2. Sign in with your Google account +3. Click **Get API Key** → **Create API key in new project** +4. Copy the key + +--- + +## Adding keys to the stack + +### Via `.env` directly + +Edit `.env` and set: + +```bash +ANTHROPIC_API_KEY=sk-ant-your-key-here +GEMINI_API_KEY=your-gemini-key-here +``` + +Then restart: +```bash +sudo systemctl restart ai-stack.service +``` + +### Via Bitwarden/VaultWarden (recommended) + +If you configured Bitwarden during install, use the placeholder format instead: + +```bash +ANTHROPIC_API_KEY= +GEMINI_API_KEY= +``` + +The stack resolves these at startup. See [docs/secret-management.md](secret-management.md) for details. + +--- + +## Verifying cloud models are available + +After adding keys and restarting: + +```bash +# Check LiteLLM is healthy +curl http://localhost:4000/health/liveness + +# List available models +curl http://localhost:4000/v1/models -H "Authorization: Bearer $LITELLM_MASTER_KEY" | python3 -m json.tool +``` + +You should see Claude and Gemini models in the list. + +--- + +## Using cloud models in OpenCode + +Cloud models are available through the LiteLLM provider (`:4000`). In OpenCode, you can switch providers or configure cloud models as a fallback. + +To direct a specific request to a cloud model, select the LiteLLM provider in OpenCode and choose the model explicitly. The smart router routes to local models by default — cloud models are not part of the automatic routing unless you configure them in the router's `MODELS` map. + +**When to use cloud models:** +- Complex reasoning that requires a frontier-scale model +- Very long documents that exceed local model context limits +- Tasks where output quality matters more than privacy/cost + +**When to stick with local:** +- Anything involving sensitive information +- Repetitive or bulk tasks (free tier has rate limits) +- When you need fast iteration (local is often faster for short tasks) + +--- + +## The LiteLLM configuration + +Cloud model definitions live in `proxy/litellm_config.yaml`. The default config includes: + +```yaml +model_list: + - model_name: claude-haiku + litellm_params: + model: anthropic/claude-haiku-20240307 + api_key: os.environ/ANTHROPIC_API_KEY + + - model_name: claude-sonnet + litellm_params: + model: anthropic/claude-sonnet-20240620 + api_key: os.environ/ANTHROPIC_API_KEY + + - model_name: gemini-flash + litellm_params: + model: gemini/gemini-1.5-flash + api_key: os.environ/GEMINI_API_KEY + + - model_name: gemini-pro + litellm_params: + model: gemini/gemini-1.5-pro + api_key: os.environ/GEMINI_API_KEY +``` + +To add more models or change which models are available, edit this file and restart the stack. + +--- + +## Rate limits and costs + +Free tiers have limits. If you hit them, LiteLLM will return a rate limit error (429). The stack does not automatically retry or fall back to another provider. + +To avoid surprises: +- Use local models for routine tasks +- Reserve cloud calls for tasks where the quality difference matters +- Watch your usage at the provider consoles + +If you start regularly hitting free tier limits and want to add paid credits, simply add credits to your Anthropic or Google AI account — no config changes needed. diff --git a/docs/design.md b/docs/design.md new file mode 100644 index 0000000..875bc4c --- /dev/null +++ b/docs/design.md @@ -0,0 +1,67 @@ +# Architecture and design decisions + +## Stack overview + +``` +OpenCode (CLI + Obsidian sidebar plugin) + | + |--- tool: retriever :42000 (vault RAG) + | FastAPI + sqlite-vec + watchdog + | hybrid search: FTS5 + vector (sqlite-vec) + | embeds via Olla → ollama (nomic-embed-text) + | vault mounted read-only at /vault + | + |--- provider: smart-router :40115 (model selection) + | |--- Olla :40114 (unified LLM router + load balancer) + | |--- ollama :11434 (local inference) + | |--- litellm :4000 (cloud API gateway: Claude, Gemini) + | |--- OLLAMA_REMOTE_* (LAN nodes, optional) + | + |--- provider: litellm :4000 (direct, optional) + +discoverer (systemd timer): mDNS scan → updates Olla config + OpenCode providers +``` + +## Why OpenCode over Open WebUI + +OpenCode provides a better development experience than Open WebUI for the target use case (code, configs, terminal work): +- Runs natively in the terminal and as an Obsidian sidebar plugin +- Tool use (vault search, shell commands) is first-class +- No persistent browser tab required + +Open WebUI was removed. Chat history volume and the admin panel were the only losses. + +## Why retriever (sqlite-vec) over Khoj + +Khoj required PostgreSQL and a full web service stack to provide Obsidian RAG. The retriever replaces it with: +- SQLite + sqlite-vec: no separate database process, file-based persistence +- FTS5 hybrid search: BM25 keyword + vector similarity, RRF fusion +- Watchdog: incremental indexing on vault file changes, no polling +- ~100MB footprint vs. Khoj's several GB + +The trade-off: no Khoj web UI or its Obsidian plugin. OpenCode's vault-search tool covers the use case directly. + +## Why Olla for routing + +Olla provides unified routing and load balancing across local Ollama nodes and the LiteLLM proxy. A single endpoint (`localhost:40114`) covers all models — local and cloud — without OpenCode needing separate provider configs per machine. + +## Smart router placement + +The smart router (`:40115`) sits in front of Olla. It classifies requests and selects models before Olla sees them. This keeps Olla's job simple (load balancing, failover) and puts routing intelligence in one place. + +The router is on the critical path for every request. Design constraint: per-request overhead must be sub-millisecond. The current implementation (compiled regex, in-memory capability registry, single JSON parse) meets this. See [docs/smart-router.md](smart-router.md) for the full latency profile. + +## Multi-machine architecture + +The stack is designed to span multiple machines. Olla aggregates local Ollama nodes and LAN-discovered nodes under one endpoint. `discover-herd.sh` handles mDNS discovery and config generation. + +The stretch goal is NetBird VPN integration — automatic WireGuard mesh between machines, allowing discovery and resource sharing over the internet, not just LAN. The foundation (Olla multi-node routing, `discover-network.sh` subnet scanning) is in place. NetBird adds the secure tunnel layer. + +## Services removed + +| Service | Replaced by | Reason | +|---------|------------|--------| +| Open WebUI | OpenCode | Better dev experience, no persistent browser tab | +| Pipelines | — | OpenWebUI-only | +| Open Terminal | — | Browser terminal, integrated with WebUI only | +| Khoj + PostgreSQL | retriever (sqlite-vec) | Heavy for single-user RAG; sqlite-vec is sufficient | diff --git a/docs/faq.md b/docs/faq.md new file mode 100644 index 0000000..ca3120a --- /dev/null +++ b/docs/faq.md @@ -0,0 +1,164 @@ +# Frequently asked questions + +--- + +## General + +### Is this really free to run? + +Yes, once you have the hardware. There are no subscription fees, no per-token costs for local models, and no usage quotas. You're running models on your own hardware using your own electricity. + +Cloud models (Claude, Gemini) via free-tier API keys are also free up to their rate limits. If you never configure cloud API keys, you'll never spend a cent. + +### Do I need a powerful GPU? + +No. You need a reasonably modern computer with enough RAM. Here's the practical breakdown: + +- **No dedicated GPU, 16–32 GB RAM**: CPU inference works. Slow (30–120 seconds per response for a 14B model), but functional. +- **Integrated GPU (Intel Arc iGPU), 16–32 GB RAM**: GPU-accelerated inference. 3–10 seconds per response for 14B models. This is the primary target of this stack. +- **Dedicated GPU (Nvidia/AMD), any VRAM**: Fast. 1–5 seconds per response depending on VRAM. + +For development work where you're not waiting on responses constantly, even CPU inference is useful. + +### Will my data leave my machine? + +Only if you explicitly configure cloud API keys (Anthropic, Google) and choose to use cloud models. All local model inference stays completely on your machine. The retriever service, smart router, and all local model calls are fully offline. + +### Can I run this on Windows or macOS? + +The stack is designed and tested on Linux. macOS may work for development purposes using Docker Desktop, but it's not officially supported and GPU acceleration won't work (Docker can't pass through GPU on macOS the same way). + +Windows is similarly untested. If you're on Windows, consider using WSL2 with Ubuntu. + +--- + +## Models + +### What's a "14B model"? What do the numbers mean? + +The number refers to parameters — roughly, how many numerical values the model has learned during training. More parameters generally means more capable but also more memory-hungry. + +- **7B–8B**: Fast, small footprint. Good for simple tasks. Fits in ~5 GB. +- **14B**: The sweet spot for most tasks on 32 GB RAM. ~8–9 GB footprint. +- **27B**: High-quality output. Needs ~16 GB GPU memory. Requires 48 GB+ total RAM to run alongside other models. + +Bigger isn't always better for every task — a 7B model fine-tuned for code can outperform a 14B general model on coding questions. + +### Why does the first response take so long? + +The model file (~8–15 GB) has to be loaded from disk into GPU memory. This happens on first use after startup or after the model has been evicted from memory (due to inactivity or loading another model). Subsequent responses in the same session are fast because the model stays loaded. + +You can force a model to stay loaded permanently with `OLLAMA_KEEP_ALIVE=-1` in `.env`. + +### Why can't I use tools with deepseek-r1? + +DeepSeek-R1 doesn't implement the OpenAI tools/function-calling API. This is a limitation of the model, not the stack. The smart router detects when a request includes tool definitions and routes it to a tools-capable model (Mistral, Qwen, etc.) instead. + +If you're calling the API directly (bypassing the smart router), you'll need to avoid sending `tools` or `functions` fields to deepseek-r1. + +### How do I know which model handled my request? + +The smart router logs every routing decision. Watch the router logs: +```bash +docker logs router --tail=20 -f +``` + +You can also query the router's capabilities endpoint to see what it would choose: +```bash +curl http://localhost:40115/v1/router/capabilities +``` + +### Can I add my own custom models? + +Yes. Pull any model that Ollama supports: +```bash +docker exec ollama ollama pull +``` + +To add it to the smart router's automatic routing, edit the `MODELS` dict in `router/smart_model_router.py` and assign it to a category. See [docs/smart-router.md](smart-router.md). + +To add a fine-tuned or custom GGUF model, place the `.gguf` file in your Ollama models directory and create a Modelfile. + +--- + +## Installation and setup + +### The installer failed partway through. How do I retry? + +The installer is designed to be re-runnable. Most steps are idempotent (safe to run multiple times). Just run `./install.sh` again. + +If a specific step is failing, check the error message and consult [docs/troubleshooting.md](troubleshooting.md). + +### I don't use Obsidian. Can I still use the stack? + +Absolutely. The retriever service is entirely optional. Skip the `RETRIEVER_VAULT_PATH` configuration and the retriever will start but remain idle. You can set `RETRIEVER_VAULT_PATH` to any folder of Markdown files — not just an Obsidian vault. + +The rest of the stack (local LLM inference, smart routing, cloud model access) works independently. + +### Can I use this without OpenCode? + +Yes. The stack exposes standard OpenAI-compatible APIs: +- Smart router: `http://localhost:40115/v1/chat/completions` +- Olla (direct): `http://localhost:40114/v1/chat/completions` +- LiteLLM: `http://localhost:4000/v1/chat/completions` + +Any tool that can connect to an OpenAI-compatible endpoint works — [Continue.dev](https://continue.dev) for VS Code, [Aider](https://aider.chat), [LM Studio](https://lmstudio.ai), or curl. + +Example with curl: +```bash +curl http://localhost:40115/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "auto", + "messages": [{"role": "user", "content": "Write a hello world in Go"}] + }' +``` + +### How do I update the stack? + +```bash +git pull +docker compose pull +sudo systemctl restart ai-stack.service +``` + +--- + +## Networking and performance + +### Responses are slow. What can I do? + +1. **Check if the model is loaded**: `curl http://localhost:11434/api/ps`. If empty, the model needs to load on first request. +2. **Check GPU is being used**: `docker logs ollama 2>&1 | grep -i "device\|gpu\|oneapi"`. If you see CPU-only, the GPU isn't being used. +3. **Check memory pressure**: If system RAM is near full, the OS may be swapping, which kills performance. +4. **Consider a smaller model**: A 7B model running on GPU is faster than a 14B model on CPU. + +### Can I run this on a headless server? + +Yes. The stack has no GUI dependencies. Everything is CLI and API-based. Run it on a headless server, access via SSH, and point OpenCode at the server's IP. + +### Does the smart router add latency? + +Negligibly — under 1ms per request. See [docs/smart-router.md](smart-router.md) for the full latency profile. + +### Can multiple people use this at once? + +Yes, with limits. Ollama queues requests — if two people send requests simultaneously, one waits while the other is processed. For a small team (2–4 people with light usage), a single local Ollama instance is usually fine. + +For heavier concurrent use, add more machines via Olla's multi-node routing — see [docs/multi-machine.md](multi-machine.md). + +--- + +## Contributing + +### How do I contribute? + +See [CONTRIBUTING.md](../CONTRIBUTING.md). Bug reports, documentation improvements, and hardware compatibility reports are all welcome. + +### I have a different GPU. Will this work? + +Intel Arc iGPU is the primary tested hardware. Nvidia works through standard Ollama (no special configuration). AMD ROCm support through Ollama is improving — check the [Ollama documentation](https://ollama.com) for current status. + +CPU-only works but is slow for 14B+ models. + +Hardware-specific guides are in [docs/hardware/](hardware/). diff --git a/docs/getting-started.md b/docs/getting-started.md new file mode 100644 index 0000000..e632534 --- /dev/null +++ b/docs/getting-started.md @@ -0,0 +1,156 @@ +# Getting started + +You've installed ai-stack and the services are running. Now what? + +--- + +## Your first conversation + +Open a terminal and type: + +```bash +opencode +``` + +You'll see the OpenCode interface. Type a message and press Enter. + +**Try something simple first:** +``` +What's the difference between a process and a thread? +``` + +What happens: +1. Your message hits the smart router +2. The router classifies it (general question → default model) +3. The request goes to Olla, then to your local Ollama instance +4. The model generates a response on your hardware + +The first response on a freshly started model takes 5–30 seconds while the model loads into GPU memory. Subsequent responses in the same session are faster. + +--- + +## Understanding what you're talking to + +By default, OpenCode routes through the smart router at `http://localhost:40115`. The router picks the best available local model for your request. You can check what model handled your last request — the router logs its decisions. + +To see what models are available: +```bash +docker exec ollama ollama list +``` + +To see what's currently loaded in memory: +```bash +curl http://localhost:11434/api/ps | python3 -m json.tool +``` + +--- + +## The smart router in action + +The router classifies your messages automatically. You don't need to think about model selection. + +**Try these to see routing in action:** + +For a reasoning question (routes to DeepSeek-R1): +``` +Why does a TCP connection require a three-way handshake? Walk me through the reasoning. +``` + +For a code question (routes to Qwen-Coder): +``` +Write a bash script that finds all files modified in the last 24 hours and prints their sizes. +``` + +For a diagnostic question (routes to Qwen 2.5): +``` +My systemd service keeps restarting. What's the first thing I should check? +``` + +For a long-form task (routes to Gemma3): +``` +Summarize the key differences between microservices and monolithic architectures for a presentation. +``` + +See [docs/smart-router.md](smart-router.md) for the full routing logic. + +--- + +## Searching your notes (Obsidian vault RAG) + +If you set `RETRIEVER_VAULT_PATH` in `.env` and your vault has been indexed, you can ask questions about your own notes directly in OpenCode. + +The vault-search tool is automatically available in OpenCode when running from the project directory. When you ask a question that might be answered from your notes, the model calls the tool automatically. + +**Try:** +``` +What have I written about Docker networking? +``` + +Or from inside the ai-stack directory, the vault search is always available as a tool. The model will search your vault and incorporate the results into its answer. + +Check indexing status: +```bash +curl http://localhost:42000/health +``` + +Look for `indexed_files` to confirm your vault is indexed. If it's 0, the vault may still be indexing (give it a few minutes for a large vault) or the path may be wrong. + +--- + +## Using cloud models + +If you added API keys for Claude or Gemini during setup, those models are available through LiteLLM. You can switch to a cloud model when you need more capability than local models provide. + +In OpenCode, switch the provider to `litellm` to access cloud models directly. See [docs/cloud-models.md](cloud-models.md) for configuration details. + +--- + +## Useful commands + +```bash +# Check all services are healthy +curl http://localhost:40114/internal/health # Olla +curl http://localhost:11434/api/tags # Ollama +curl http://localhost:42000/health # Retriever +curl http://localhost:40115/health # Smart router + +# See the smart router's current model routing decisions +curl http://localhost:40115/v1/router/capabilities + +# Pull a new model +docker exec ollama ollama pull qwen2.5-coder:14b + +# Watch logs from all services +docker compose logs --tail=20 -f + +# Watch only the smart router (to see routing decisions) +docker logs router --tail=20 -f + +# Restart everything +sudo systemctl restart ai-stack.service +``` + +--- + +## Adding more models + +The more models you have, the better the router can match tasks to the right tool. See [docs/models.md](models.md) for the recommended stack and what each model does well. + +To pull the full recommended model set: +```bash +docker exec ollama ollama pull mistral-small3.2:24b # tool calling +docker exec ollama ollama pull qwen3.5:14b # general default +docker exec ollama ollama pull qwen2.5-coder:14b # code +docker exec ollama ollama pull qwen2.5:14b # diagnostics +docker exec ollama ollama pull deepseek-r1:14b # reasoning +docker exec ollama ollama pull gemma3:12b # long-form +docker exec ollama ollama pull nomic-embed-text:latest # embeddings +``` + +Each model is 7–15 GB. Pull only what you have storage for. + +--- + +## Something isn't working? + +→ [docs/troubleshooting.md](troubleshooting.md) — solutions to common issues diff --git a/docs/install.md b/docs/install.md new file mode 100644 index 0000000..4d5d582 --- /dev/null +++ b/docs/install.md @@ -0,0 +1,257 @@ +# Installation + +This guide walks you through setting up ai-stack from scratch. It's written for people who are new to local AI and Docker, but experienced users can skip the explanatory sections. + +**Time to complete:** 30–60 minutes (most of that is downloading model files) + +--- + +## Before you start + +### What you need + +- **Linux** — Ubuntu 22.04+ or Fedora 38+ recommended; any modern distro works +- **Docker Engine 24+** and the Docker Compose v2 plugin +- **Git** +- **32 GB RAM** recommended (16 GB minimum — limits you to smaller models) +- **50 GB free disk space** minimum (models are large; 100 GB+ is comfortable) + +Don't have Docker? See [Installing Docker](#installing-docker) below. + +### What you'll have at the end + +- A local AI assistant accessible through OpenCode (terminal CLI and Obsidian plugin) +- Automatic model routing (the stack picks the right model for each task) +- Optional: your Obsidian vault searchable via natural language +- Optional: free-tier Claude and Gemini available as cloud fallbacks + +--- + +## Step 1: Clone the repository + +```bash +git clone https://github.com/growlf/ai-stack.git +cd ai-stack +``` + +--- + +## Step 2: Configure your environment + +```bash +cp .env.example .env +``` + +Open `.env` in a text editor. The key settings to check: + +```bash +nano .env # or: code .env or: vim .env +``` + +**Required settings:** + +| Variable | What to set | Example | +|----------|------------|---------| +| `STACK_USER` | Your Linux username | `yourname` | +| `RETRIEVER_VAULT_PATH` | Full path to your Obsidian vault, or any folder of Markdown files | `/home/yourname/Documents/notes` | + +**Security settings — generate these, don't leave defaults:** + +```bash +# Generate a strong random password for the database +openssl rand -base64 32 +# Paste the result as LITELLM_MASTER_KEY in .env +``` + +Set `LITELLM_MASTER_KEY` to the generated value. The installer can also set this up via Bitwarden if you prefer centralized secret management — see [docs/secret-management.md](secret-management.md). + +**Optional — cloud API keys:** + +Leave these blank if you only want local models (which is fine): + +```bash +ANTHROPIC_API_KEY= # Claude (claude.ai/settings → API Keys) +GEMINI_API_KEY= # Gemini (aistudio.google.com → API Keys) +``` + +Free-tier accounts for both services work here. See [docs/cloud-models.md](cloud-models.md) for details. + +--- + +## Step 3: Run the installer + +```bash +chmod +x install.sh +./install.sh +``` + +The installer is interactive. It will prompt you through: + +1. **GPU detection** — checks for Intel Arc, Nvidia, or AMD GPU and configures accordingly +2. **OpenCode CLI** — installs the OpenCode terminal interface (requires Bun runtime; installer handles this) +3. **Obsidian plugin** — installs the OpenCode Obsidian sidebar plugin if you use Obsidian +4. **Secret management** — optional Bitwarden/VaultWarden setup for secure API key storage +5. **Stack startup** — brings up all services via Docker Compose +6. **Model download** — prompts you to pull recommended models + +**The model download takes a while.** A 14B model is ~8–9 GB. On a decent connection this takes 5–15 minutes per model. The installer will prompt you for which models to pull. + +--- + +## Step 4: Verify the stack is running + +Once the installer finishes, check that all services are healthy: + +```bash +# Check the LLM router (Olla) +curl http://localhost:40114/internal/health + +# Check local model runner (Ollama) +curl http://localhost:11434/api/tags + +# Check the retriever (vault search) +curl http://localhost:42000/health + +# Check the smart router +curl http://localhost:40115/health +``` + +Each should return a JSON response. If any return an error, see [Troubleshooting](#troubleshooting) below. + +Check which models are loaded: +```bash +docker exec ollama ollama list +``` + +--- + +## Step 5: Configure OpenCode + +The installer creates a config file at `~/.opencode/config.json`. Open a new terminal and try: + +```bash +opencode +``` + +If OpenCode opens and you can type a message, the stack is working. + +The default provider routes through the smart router, which sends your request to the best available local model. You should get a response within a few seconds (longer if the model isn't loaded yet — first load takes 5–30 seconds depending on model size and hardware). + +--- + +## Step 6: Test your vault search (optional) + +If you set `RETRIEVER_VAULT_PATH`, the retriever will index your notes on startup. Check the index status: + +```bash +curl http://localhost:42000/health +``` + +Look for `indexed_files` — if it's 0, indexing is either in progress or the path is wrong. Indexing a large vault can take a few minutes. + +Test a search: +```bash +curl -X POST http://localhost:42000/search \ + -H 'Content-Type: application/json' \ + -d '{"query": "something you know is in your notes", "top_k": 5}' +``` + +--- + +## Automatic startup + +The installer creates a systemd service that starts ai-stack automatically on boot: + +```bash +# Check status +sudo systemctl status ai-stack.service + +# Stop the stack +sudo systemctl stop ai-stack.service + +# Start the stack +sudo systemctl start ai-stack.service +``` + +--- + +## Installing Docker + +If you don't have Docker installed: + +**Ubuntu / Debian:** +```bash +# Add Docker's official repository +curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg +echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null + +# Install Docker Engine and Compose plugin +sudo apt update +sudo apt install docker-ce docker-ce-cli containerd.io docker-compose-plugin + +# Add your user to the docker group (avoids needing sudo for docker commands) +sudo usermod -aG docker $USER +newgrp docker +``` + +**Fedora:** +```bash +sudo dnf install docker docker-compose-plugin +sudo systemctl enable --now docker +sudo usermod -aG docker $USER +newgrp docker +``` + +Verify: +```bash +docker --version # should be 24+ +docker compose version # should be v2.x +``` + +--- + +## Troubleshooting + +### "docker: permission denied" + +You need to add your user to the docker group: +```bash +sudo usermod -aG docker $USER +``` +Then log out and back in, or run `newgrp docker`. + +### Services won't start + +Check logs for the failing service: +```bash +docker compose logs --tail=30 +``` + +### GPU not detected + +For Intel Arc: the `check-arc-gpu.sh` script runs automatically and configures the correct device node. If you get GPU errors: +```bash +ls /dev/dri/ +bash scripts/check-arc-gpu.sh +``` + +See [docs/hardware/arc.md](hardware/arc.md) for Intel Arc-specific guidance. + +### Model downloads are slow + +Model files are large (8–16 GB each). This is normal. You can monitor progress: +```bash +docker exec ollama ollama list +``` + +### OpenCode command not found + +The installer uses Bun to install OpenCode globally. If `opencode` isn't found: +```bash +export PATH="$HOME/.bun/bin:$PATH" +``` +Add this to your `~/.bashrc` or `~/.zshrc` to make it permanent. + +### More help + +→ [docs/troubleshooting.md](troubleshooting.md) — comprehensive issue guide diff --git a/docs/models.md b/docs/models.md new file mode 100644 index 0000000..928c35a --- /dev/null +++ b/docs/models.md @@ -0,0 +1,64 @@ +# Model reference + +## Recommended stack + +| Model | Size (RAM) | Tools | Use case | +|-------|-----------|-------|----------| +| `mistral-small3.2:24b` | ~15 GB | yes | Function calling, 128K context, strong instruction following | +| `qwen3.5:14b` | ~8.5 GB | yes | General default — reasoning + tool calling | +| `qwen2.5:14b` | ~8.3 GB | yes | Diagnostics, sysadmin, structured output | +| `qwen2.5-coder:14b` | ~8.3 GB | yes | Code, scripts, configs, debugging | +| `deepseek-r1:14b` | ~8.3 GB | **no** | Complex reasoning, root cause analysis — no tool support | +| `gemma3:12b` | ~7.8 GB | **no** | Long log analysis, summaries, documentation | +| `gemma4:27b` | ~16 GB | **no** | Heavy lifting, large context — load on demand | +| `nomic-embed-text` | ~274 MB | — | Embeddings for RAG (retriever service) | + +## Tool support + +Models marked **no** in the Tools column do not support the OpenAI-compatible `tools`/`functions` API. Sending a tool-equipped request to these models returns an error from Ollama. + +The smart router handles this automatically — requests with tool definitions are always routed to a tools-capable model regardless of content classification. You do not need to manually avoid these models when using OpenCode with tools configured. + +If you query Ollama directly (bypassing the router), use only tools-capable models for tool-calling requests. + +## Model selection notes + +**`deepseek-r1:14b`** — Chain-of-thought reasoning. Thinks before responding, typically adding 20–60 seconds to response time. Worth it for complex root cause analysis or architectural decisions. Not suitable for quick queries or anything requiring tool calls. + +**`gemma4:27b`** — Requires ~16 GB GPU memory. On systems with 32 GB RAM, loading this model alongside others will cause page-outs. Pull on demand rather than keeping resident. Use for tasks that need large context or complex analysis where smaller models underperform. + +**`qwen2.5-coder:14b`** — Understands YAML, Dockerfiles, systemd units, and bash idioms better than the base `qwen2.5` model. Prefer this for anything involving file structure, shell commands, or configuration. + +**`nomic-embed-text`** — Only used by the retriever service for vault embeddings. Not suitable for chat. + +## Memory management + +Ollama loads models into GPU memory on first use and keeps them resident for `OLLAMA_KEEP_ALIVE` duration (default: `-1`, meaning indefinitely). On systems with limited RAM, loading a second large model will evict the first. + +```bash +# Check what's currently loaded +curl http://localhost:11434/api/ps | python3 -m json.tool + +# Pull a model +docker exec ollama ollama pull qwen3.5:14b + +# List all installed models +docker exec ollama ollama list + +# Remove a model +docker exec ollama ollama rm llama3:8b +``` + +## Pulling the full recommended stack + +```bash +docker exec ollama ollama pull mistral-small3.2:24b +docker exec ollama ollama pull qwen3.5:14b +docker exec ollama ollama pull qwen2.5-coder:14b +docker exec ollama ollama pull qwen2.5:14b +docker exec ollama ollama pull deepseek-r1:14b +docker exec ollama ollama pull gemma3:12b +docker exec ollama ollama pull nomic-embed-text:latest +``` + +`gemma4:27b` is optional — pull only if your system has 48 GB+ RAM or you have a specific need for it. diff --git a/docs/multi-machine.md b/docs/multi-machine.md new file mode 100644 index 0000000..96734a3 --- /dev/null +++ b/docs/multi-machine.md @@ -0,0 +1,201 @@ +# Multi-machine setup + +ai-stack can span multiple computers. If you have more than one machine running Ollama, the Olla load balancer distributes requests across all of them. This gives you access to more models simultaneously and faster parallel processing. + +--- + +## How it works + +Olla maintains a list of Ollama endpoints. When a request comes in, it picks the best available endpoint based on load and health. You don't configure which machine handles which request — Olla handles that automatically. + +``` +OpenCode → Smart Router → Olla + ├── ollama (local, this machine) + ├── 192.168.1.50:11434 (workstation) + └── 192.168.1.75:11434 (server) +``` + +--- + +## Requirements for additional nodes + +Each additional machine needs: +- Ollama installed and running +- Port `11434` reachable from your main machine +- The models you want available already pulled + +Ollama does not need the full ai-stack — just the Ollama service itself. + +### Installing Ollama on a second machine + +```bash +curl -fsSL https://ollama.com/install.sh | sh +sudo systemctl enable --now ollama + +# Pull models on the remote machine +ollama pull qwen3.5:14b +ollama pull qwen2.5-coder:14b +``` + +By default, Ollama only listens on `localhost`. To allow connections from other machines: + +```bash +# Edit the Ollama service +sudo systemctl edit ollama.service +``` + +Add: +```ini +[Service] +Environment="OLLAMA_HOST=0.0.0.0" +``` + +Then: +```bash +sudo systemctl daemon-reload +sudo systemctl restart ollama +``` + +Verify from your main machine: +```bash +curl http://192.168.1.50:11434/api/tags +``` + +--- + +## Adding machines manually + +Edit `.env` on your main machine and add a line for each remote node: + +```bash +OLLAMA_REMOTE_WORKSTATION=http://192.168.1.50:11434:75 +OLLAMA_REMOTE_SERVER=http://192.168.1.75:11434:50 +``` + +The format is `http://HOST:PORT:WEIGHT`. Weight controls how often Olla routes to this node relative to others (higher = more requests). The local `ollama` service defaults to weight 100. + +Then regenerate the Olla config and restart: + +```bash +bash scripts/generate-olla-config.sh +sudo systemctl restart ai-stack.service +``` + +Verify Olla sees all nodes: +```bash +curl http://localhost:40114/internal/status/endpoints +``` + +--- + +## Auto-discovery via mDNS + +Instead of manually managing IP addresses, the `discover-herd.sh` script scans your local network for machines advertising Ollama via mDNS (`_ollama._tcp`). This works when machines are on the same local network segment. + +```bash +# Discover nodes and preview what would be added +bash scripts/discover-herd.sh + +# Discover and apply (updates olla.yaml and restarts Olla) +bash scripts/discover-herd.sh --apply +``` + +To advertise Ollama via mDNS on a remote machine, install `avahi-daemon` and register the service: + +```bash +# On the remote machine +sudo apt install avahi-daemon + +# Create an mDNS service registration +sudo tee /etc/avahi/services/ollama.service > /dev/null << 'EOF' + + + + Ollama on $HOSTNAME + + _ollama._tcp + 11434 + + +EOF + +sudo systemctl restart avahi-daemon +``` + +Once registered, `discover-herd.sh` will find the machine automatically. + +--- + +## Automatic periodic discovery + +To keep node discovery up to date without manual intervention, set up a systemd timer: + +```bash +# Copy the timer (included in ai-stack) +sudo cp systemd/ai-stack-discover.timer /etc/systemd/system/ +sudo cp systemd/ai-stack-discover.service /etc/systemd/system/ + +sudo systemctl daemon-reload +sudo systemctl enable --now ai-stack-discover.timer +``` + +This runs discovery every 5 minutes and updates Olla's config when new nodes appear. + +--- + +## Multi-machine over the internet (NetBird VPN) + +For machines on different networks — at home and at work, or distributed across locations — you need a secure tunnel between them. + +The recommended approach is [NetBird](https://netbird.io), a WireGuard-based mesh VPN that automatically connects machines without manual port forwarding or firewall configuration. + +### Setting up NetBird + +1. Create a free account at [app.netbird.io](https://app.netbird.io) +2. Install the NetBird agent on each machine: + ```bash + curl -fsSL https://pkgs.netbird.io/install.sh | sh + ``` +3. Connect each machine: + ```bash + netbird up + ``` + Follow the browser link to authenticate. + +Once connected, each machine gets a stable IP in the `100.x.x.x` range (NetBird's virtual subnet). These IPs are stable across network changes and reboots. + +### Adding NetBird nodes to Olla + +Use the NetBird IP instead of the LAN IP: + +```bash +OLLAMA_REMOTE_HOMELAB=http://100.64.0.5:11434:60 +``` + +Regenerate config and restart as usual. The smart router and Olla work the same way regardless of whether the connection is local or over VPN. + +### Network discovery over NetBird + +The `discover-network.sh` script can scan the NetBird subnet in addition to the local LAN. Set the `NETBIRD_SUBNET` variable in `.env`: + +```bash +NETBIRD_SUBNET=100.64.0.0/10 +``` + +Then discovery automatically includes NetBird-connected machines. + +--- + +## Monitoring distributed requests + +To see how requests are being distributed across nodes: + +```bash +# Real-time Olla routing +curl http://localhost:40114/internal/status/endpoints + +# Which models are available across all nodes +curl http://localhost:40115/v1/router/capabilities +``` + +The smart router only routes to models that are available on at least one connected node. If a model is pulled on the remote machine but not the local one, the router will still use it — Olla handles the routing to wherever it lives. diff --git a/docs/overview.md b/docs/overview.md new file mode 100644 index 0000000..38a1e1f --- /dev/null +++ b/docs/overview.md @@ -0,0 +1,122 @@ +# What is ai-stack? + +ai-stack is an open-source toolkit for running AI language models on your own computer — completely free, with no ongoing subscription, no per-token cost, and no data leaving your machine. + +Once set up, you have a fully functional AI development assistant that can help with code, answer questions, search your notes, and more — without ever pinging a paid API. + +--- + +## Why run AI locally? + +Most AI tools you've used (ChatGPT, Claude.ai, Gemini) are cloud services. Every message you send goes to a company's server, gets processed, and comes back as a response. This means: + +- **You pay per use** — directly (subscription) or indirectly (your data) +- **Your conversations are stored** — and may be used to train future models +- **You need internet** — no connectivity, no AI +- **You have a quota** — rate limits, tier restrictions, usage caps + +Running locally eliminates all of this. The model runs on your hardware. Your data stays on your machine. There's no usage meter. + +The trade-off: local models are smaller than frontier cloud models. GPT-4 or Claude Opus have vastly more parameters than what you can run locally. But for most development tasks — writing code, explaining concepts, answering questions about your own notes — a 14-billion parameter local model is genuinely useful and fast enough to be practical. + +--- + +## What can I do with this? + +Once the stack is running, you can: + +- **Write and debug code** — ask questions, get explanations, have the AI write and fix scripts +- **Search your notes** — if you use Obsidian, the retriever service makes your entire vault searchable via natural language +- **Run multiple models** — route different types of tasks to different models automatically (reasoning tasks to DeepSeek, code to Qwen-Coder, etc.) +- **Use cloud models as a backup** — if a task genuinely needs a frontier model (like Claude Sonnet or Gemini Pro), you can route to those via free-tier APIs without changing your workflow +- **Span multiple machines** — if you have more than one computer on your network, Olla can load-balance requests across all of them + +--- + +## How does it work? + +The stack has four main services: + +### 1. Ollama — the model runner + +[Ollama](https://ollama.com) manages downloading, storing, and running language models on your hardware. It handles the complex GPU memory management and exposes a simple API. Think of it as the engine. + +ai-stack uses a version of Ollama that supports Intel Arc GPUs (via Intel's OneAPI/SYCL framework). If you have Nvidia or AMD hardware, standard Ollama works the same way. + +### 2. LiteLLM — the cloud gateway + +[LiteLLM](https://litellm.ai) is a proxy that gives you one consistent API to talk to cloud models (Claude, Gemini, etc.). Free-tier access to Claude and Gemini works through here — you provide an API key, LiteLLM handles the translation. + +This is entirely optional. If you don't add API keys, LiteLLM just sits idle. + +### 3. Olla — the load balancer + +[Olla](https://github.com/elestio/olla) sits in front of Ollama and LiteLLM and presents them as a single unified endpoint. If you add more machines running Ollama to your network, Olla discovers them and spreads requests across all of them. + +You don't interact with Olla directly — OpenCode and the smart router use it automatically. + +### 4. Smart Router — the model selector + +The smart router (`router/smart_model_router.py`) looks at every request before it reaches Olla and picks the best model for the job. If you ask a coding question, it routes to the code model. If you ask a reasoning question, it routes to the reasoning model. If your request includes tool definitions (like searching your vault), it routes to a model that supports tool calling. + +This happens in under a millisecond and is invisible to you. + +### 5. Retriever — the note searcher + +The retriever service indexes your Obsidian vault and makes it searchable via natural language. It uses a combination of keyword search and vector similarity to find relevant notes. OpenCode can call it as a tool, so you can ask "what did I write about DNS?" and get actual results from your notes. + +No internet required — embeddings run locally through Ollama. + +### How requests flow + +``` +You type a message in OpenCode + ↓ +Smart Router (classifies your request, picks the best model) + ↓ +Olla (routes to the selected model, load-balances if multiple machines) + ↓ +Ollama (runs the model on your GPU) — or — LiteLLM (calls cloud API) + ↓ +Response comes back to you + +If your request needs your notes: +Smart Router detects tool use → routes to tool-capable model +Model calls vault-search tool → Retriever searches your vault +Results returned to model → included in response +``` + +--- + +## What hardware do I need? + +You need a reasonably modern computer with Linux. The more RAM you have, the larger models you can run. You do not need an expensive dedicated GPU — this stack is designed to work on integrated graphics and CPUs as well. + +| RAM | What you can run | +|-----|----------------| +| 16 GB | 7B models (Mistral 7B, Llama 3.1 8B) — functional but limited | +| 32 GB | 14B models (Qwen 14B, DeepSeek-R1 14B) — solid for development work | +| 48 GB+ | 27B models (Gemma4 27B) — high-quality output, handles complex tasks | + +If you have an Intel Arc iGPU (found in Intel Core Ultra processors), the stack has specific support for using it to accelerate inference. Nvidia and AMD GPUs work through standard Ollama. CPU-only also works — slower, but functional. + +See [docs/hardware/](hardware/) for hardware-specific setup guides. + +--- + +## Do I need internet access? + +For the initial setup: yes. You need to download Docker images and model files (which can be several gigabytes each). + +After setup: no. The entire stack runs offline. The only exception is if you configure cloud API keys (Claude, Gemini) — those calls go to the cloud, but they're optional. + +--- + +## Where do I start? + +→ **[docs/install.md](install.md)** — complete setup walkthrough, start here + +After install: +→ **[docs/getting-started.md](getting-started.md)** — your first conversation with a local model +→ **[docs/models.md](models.md)** — which models to use for what +→ **[docs/smart-router.md](smart-router.md)** — how automatic model selection works diff --git a/docs/secret-management.md b/docs/secret-management.md new file mode 100644 index 0000000..1c883c0 --- /dev/null +++ b/docs/secret-management.md @@ -0,0 +1,138 @@ +# Secret management + +ai-stack needs a few secrets: API keys for cloud models, a master key for LiteLLM, and database credentials. The simplest approach is storing them directly in `.env`. The more secure approach is using Bitwarden or a self-hosted VaultWarden instance. + +--- + +## The simple approach: `.env` file + +All secrets go in `.env` at the root of the project. This file is gitignored — it will never be committed to the repository. + +```bash +LITELLM_MASTER_KEY=your-generated-key +ANTHROPIC_API_KEY=sk-ant-your-key +GEMINI_API_KEY=your-gemini-key +``` + +Generate a strong `LITELLM_MASTER_KEY`: +```bash +openssl rand -base64 32 +``` + +**What to protect:** +- Never commit `.env` to git (it's gitignored, but verify with `git status`) +- Don't share your `.env` file +- Use different keys per machine + +This is sufficient for personal use on a machine you control. + +--- + +## The secure approach: Bitwarden / VaultWarden + +For teams, shared machines, or if you want centralized secret management, the stack supports resolving API keys from a Bitwarden vault at runtime. Secrets never live in plain text on disk — they're fetched when the stack starts. + +### What is VaultWarden? + +[VaultWarden](https://github.com/dani-garcia/vaultwarden) is a self-hosted Bitwarden-compatible server. If you want zero cloud dependency for your secrets, run VaultWarden on your own hardware. The official Bitwarden cloud service also works. + +### Setting up during install + +The `install.sh` script prompts you to configure Bitwarden integration. If you said no during install, you can run the setup manually: + +```bash +bash scripts/setup-bitwarden.sh +``` + +This installs the `bw` CLI (via npm) and walks you through: +1. Entering your Bitwarden server URL (or `https://vault.bitwarden.com` for cloud) +2. Creating an API key in Bitwarden (Account Settings → Security → API Key) +3. Storing the `BW_CLIENT_ID` and `BW_CLIENT_SECRET` in `.env` +4. Setting `VAULT_MASTER_PASSWORD` in `.env` + +### Using vault placeholders + +Once configured, use the `` placeholder format in `.env`: + +```bash +ANTHROPIC_API_KEY= +GEMINI_API_KEY= +LITELLM_MASTER_KEY= +``` + +The format is ``. The organization ID comes from your Bitwarden organization settings. The item name matches the name of the item you created in the vault. + +On each stack startup, `start.sh` calls `resolve-vaultwarden.sh` which: +1. Authenticates with Bitwarden using the stored credentials +2. Looks up each placeholder +3. Replaces placeholders with actual values in memory (not written to disk) +4. Starts the services with resolved values + +### Manual resolution + +```bash +# Resolve all placeholders and preview (dry run — doesn't modify .env) +./scripts/resolve-vaultwarden.sh --dry-run + +# Resolve and update .env in place +./scripts/resolve-vaultwarden.sh +``` + +### Storing secrets in Bitwarden + +For each secret you want to manage: +1. Open Bitwarden (web vault or app) +2. Create a new **Login** item +3. Name it exactly as you'll use in the placeholder (e.g., `anthropic-api-key`) +4. Store the key in the **Password** field +5. Assign it to your organization + +--- + +## Generating keys + +### LITELLM_MASTER_KEY + +The LiteLLM master key authenticates requests to the LiteLLM proxy. Generate a strong one: + +```bash +openssl rand -base64 32 +``` + +This key is used internally — you'll rarely need to type it manually. + +### Rotating keys + +To rotate `LITELLM_MASTER_KEY`: +1. Generate a new key: `openssl rand -base64 32` +2. Update it in `.env` (or in Bitwarden if using vault placeholders) +3. Restart the stack: `sudo systemctl restart ai-stack.service` + +Existing sessions using the old key will fail immediately. There are no stored session tokens to invalidate — just the key itself. + +--- + +## What's in `.env` regardless of Bitwarden + +Even with Bitwarden configured, these values live in `.env` in plain text (they're needed to authenticate with Bitwarden itself): + +```bash +BW_CLIENT_ID=user.your-client-id +BW_CLIENT_SECRET=your-client-secret +VAULT_MASTER_PASSWORD=your-master-password +``` + +Protect your `.env` file with appropriate filesystem permissions: +```bash +chmod 600 .env +``` + +--- + +## Security checklist + +- `chmod 600 .env` — restrict file permissions +- Verify `git status` before any commit — `.env` should never appear +- Use `openssl rand -base64 32` for all generated keys — not guessable strings +- Rotate API keys periodically, especially if you suspect exposure +- If using cloud APIs, monitor your API key usage at the provider console diff --git a/docs/smart-router.md b/docs/smart-router.md new file mode 100644 index 0000000..d057843 --- /dev/null +++ b/docs/smart-router.md @@ -0,0 +1,111 @@ +# Smart Router + +The smart router (`router/smart_model_router.py`) sits between OpenCode and Olla. Every request passes through it. It classifies the request and selects the most appropriate model before forwarding to Olla. + +## What it does + +1. Inspects the incoming request +2. If the request contains tool definitions, routes to the designated tools-capable model +3. Otherwise, classifies the message content using keyword patterns and routes to the model best suited for that task +4. Verifies the selected model is actually loaded and capable before routing +5. Falls back gracefully if the preferred model is unavailable + +## Routing logic + +### Tool requests (highest priority) + +If the request body contains a `tools` or `functions` field, the router bypasses content classification entirely and routes directly to the tools model (`mistral-small3.2:24b` by default). This prevents tool-call failures on reasoning models like `deepseek-r1` that don't support the tools API. + +### Content classification + +For requests without tools, the router scores the message against keyword patterns: + +| Category | Patterns match | Routes to | +|----------|---------------|-----------| +| `tools` | tool, function, schema, json, api call | `mistral-small3.2:24b` | +| `reasoning` | why, analyze, explain, root cause, decision | `deepseek-r1:14b` | +| `code` | script, bash, python, docker, yaml, config | `qwen2.5-coder:14b` | +| `diagnostic` | log, error, service, process, permission | `qwen2.5:14b` | +| `longform` | summarize, document, report, long | `gemma3:12b` | +| `default` | (no strong match) | `qwen3.5:14b` | + +The message is capped at 500 characters for classification — long documents are classified on their opening content. + +### Capability verification + +Before routing, the router checks the capability registry to confirm: +- The selected model is currently loaded in Ollama +- The model supports the required features (tools, etc.) + +If the model isn't available or capable, the router falls back to the `default` model. + +## Capability registry + +The registry is built at startup by querying Olla's `/v1/models` endpoint. It refreshes every 5 minutes in the background — never on the request path. + +Tool support is inferred from model family: + +| Supports tools | Does not support tools | +|---------------|----------------------| +| mistral, llama3.x, qwen2.5, qwen3, phi3.5, phi4, command-r, granite3 | deepseek-r1, gemma3, gemma4, nomic, mxbai, snowflake | + +If Olla is unreachable at startup, the router falls back to a static model map and retries the registry on the next refresh cycle. In-flight requests are never blocked by registry refreshes. + +## Debug endpoint + +```bash +curl http://localhost:40115/v1/router/capabilities +``` + +Returns the current registry state: which models are loaded, which support tools, and what the router would select for a tools request. Useful for diagnosing routing decisions without reading logs. + +## Health check + +```bash +curl http://localhost:40115/health +``` + +Response includes `models_loaded` count and `registry_stale` flag. + +## Latency profile + +The router adds negligible latency to each request: + +| Step | Cost | +|------|------| +| Tools/functions check | nanoseconds (dict key lookup) | +| Content classification | sub-millisecond (compiled regex on ≤500 chars) | +| Capability registry lookup | nanoseconds (in-memory dict) | +| JSON encode/decode | unavoidable proxy overhead | + +**Total per-request overhead: well under 1ms.** The capability registry refresh happens in the background every 5 minutes — never on the hot path. + +The startup registry load (one Olla API call) adds ~100ms to first startup. After that, routing decisions are pure in-memory. + +## Configuration + +All settings via environment variables in `.env`: + +| Variable | Default | Purpose | +|----------|---------|---------| +| `CAPABILITY_REFRESH_INTERVAL` | `300` | Seconds between registry refreshes | +| `ROUTER_PORT` | `40115` | Port the router listens on | +| `ENABLE_AUTO_TOOLS` | `false` | Automatically inject available tools into requests going to tool-capable models (planned feature — see below) | + +To change the model assigned to a category, edit the `MODELS` dict at the top of `router/smart_model_router.py`. + +--- + +## Planned: automatic tool injection + +The current router handles tools defensively — if a request includes tool definitions, it routes to a capable model; if the request has no tools, they're not added. + +A planned follow-on feature (`ENABLE_AUTO_TOOLS`) will change this: when routing to a tool-capable model, the router will automatically inject available tool definitions (vault search, etc.) into the request even when the client didn't ask for them. This allows the model to use tools proactively — for example, searching your vault when answering a question about your notes without you needing to explicitly invoke the tool. + +**Design decisions already made:** +- Tool schemas will be defined in `router/tools.json` (OpenAI function-calling format) +- Injection happens in the router's `handle_request()`, not at the LiteLLM layer +- The env var defaults to `false` until the feature is tested and stable +- The model decides whether to invoke tools — injection makes them available, it doesn't force their use + +This feature is tracked as a GitHub issue. It is not in the current release. diff --git a/docs/troubleshooting.md b/docs/troubleshooting.md index 66d239b..47d052c 100644 --- a/docs/troubleshooting.md +++ b/docs/troubleshooting.md @@ -13,7 +13,7 @@ Lessons learned from building this stack. Check here before opening issues. **Check:** ```bash ls -la /dev/dri/ -docker logs ollama-arc 2>&1 | grep -i "device\|gpu\|arc\|oneapi" +docker logs ollama 2>&1 | grep -i "device\|gpu\|arc\|oneapi" ``` **Common causes:** @@ -53,8 +53,8 @@ docker logs retriever --tail 20 2. **Embeddings failing** — The retriever needs Olla healthy and `nomic-embed-text` pulled: ```bash curl localhost:40114/internal/health - docker exec ollama-arc ollama list | grep nomic - docker exec ollama-arc ollama pull nomic-embed-text:latest + docker exec ollama ollama list | grep nomic + docker exec ollama ollama pull nomic-embed-text:latest ``` 3. **Vault not mounted** — Check the container: @@ -123,8 +123,8 @@ olla: **Fix:** Stop containers by name: ```bash -docker stop ollama-arc litellm olla retriever -docker rm ollama-arc litellm olla retriever +docker stop ollama litellm olla retriever +docker rm ollama litellm olla retriever ``` Then fix the typo in `docker-compose.yml` and restart.