growlf · growlf · May 11, 2026 · May 11, 2026
@@ -1,188 +1,15 @@
 # ai-stack
 
-A self-hosted AI stack optimised for **Intel Arc iGPU** on Linux, built around Ollama + OpenCode. Provides local LLM inference (ollama-arc), cloud API routing (LiteLLM), unified routing (Olla), and Obsidian vault RAG (retriever). The primary AI interface is **OpenCode** (CLI + Obsidian sidebar plugin).
+A self-hosted, zero-cost AI development stack built around local LLMs. Run powerful language models on your own hardware — no token spend, no cloud dependency. Free-tier cloud models available as an optional augment.
 
-Built and documented through real-world homelab experience on Intel Arc hardware.
+Built around [Ollama](https://ollama.com), [OpenCode](https://opencode.ai), and a lightweight smart router that automatically selects the right model for each request.
 
 ---
 
-## What's included
-
-| Component | Purpose |
-|-----------|---------|
-| **Ollama (ava-agentone/ollama-intel)** | LLM inference with Intel Arc iGPU acceleration via OneAPI/SYCL |
-| **LiteLLM** | Cloud API gateway (Claude, Gemini) |
-| **Olla** | Unified LLM router with load balancing |
-| **Smart Router** | Content-based model selection (OpenCode → router → Olla) |
-| **Retriever** | Lightweight Obsidian vault RAG (sqlite-vec + FTS5, hybrid search) |
-| **OpenCode** | Primary AI interface — CLI tool + Obsidian sidebar plugin |
-
----
-
-## Hardware requirements
-
-| Component | Minimum | Recommended |
-|-----------|---------|-------------|
-| CPU | Intel Core Ultra (Meteor Lake) | Intel Core Ultra 9 185H |
-| RAM | 16 GB | 32 GB |
-| GPU | Intel Arc iGPU | Intel Arc iGPU (any Meteor/Arrow Lake) |
-| Storage | 50 GB free | 100 GB+ free (models are large) |
-| OS | Ubuntu 22.04 | Ubuntu 24.04 |
-
----
-
-## Quick start
-
-```bash
-# 1. Clone the repo
-git clone https://github.com/growlf/ai-stack.git
-cd ai-stack
-
-# 2. Configure
-cp .env.example .env
-nano .env   # set your username, paths, and API keys (or skip — install.sh can set up Bitwarden)
-
-# 3. Install
-chmod +x install.sh scripts/check-arc-gpu.sh
-./install.sh
-
-# The installer will:
-#   - Prompt to install OpenCode CLI + Bun
-#   - Auto-install the OpenCode Obsidian plugin (growlf/opencode-obsidian)
-#   - Prompt to configure Bitwarden/VaultWarden for secret management
-#   - Start the stack (ollama-arc, litellm, olla, router, retriever)
-#   - Prompt to pull models
-```
-
----
-
-## Project structure
-
-```
-ai-stack/
-├── install.sh                  # Main installer
-├── docker-compose.yml          # Full stack definition
-├── .env.example                # All configurable values
-├── systemd/
-│   └── ai-stack.service        # Systemd unit (auto-start on boot)
-├── scripts/
-│   ├── check-arc-gpu.sh        # GPU pre-flight (detects card0/card1 drift)
-│   ├── discover-herd.sh        # mDNS discovery of remote Ollama nodes
-│   ├── generate-olla-config.sh # Reads .env → writes proxy/olla.yaml
-│   ├── resolve-vaultwarden.sh  # Resolves <vaultwarden:...> placeholders via bw CLI
-│   └── generate-keys.sh        # Generates secure keys (LITELLM_MASTER_KEY)
-├── router/
-│   └── smart-model-router.py   # Content-based model routing (OpenCode → router → Olla)
-├── retriever/
-│   ├── main.py                 # FastAPI app
-│   ├── search.py               # Hybrid search (FTS5 + vector, RRF fusion)
-│   ├── indexer.py              # Vault scanner + watchdog + chunking
-│   └── Dockerfile
-`scripts/olla.yaml.template`    # Olla config template (generated by generate-olla-config.sh)
-├── proxy/
-│   └── litellm_config.yaml     # LiteLLM model registry (Claude, Gemini)
-├── .opencode/
-│   └── tools/
-│       └── vault-search.ts     # OpenCode tool: search vault via retriever API
-└── docs/
-    ├── deployment-guide.md     # Setup walkthrough
-    ├── model-guide.md          # Model recommendations and routing
-    ├── troubleshooting.md      # Common issues and fixes
-    └── retriever-guide.md      # Obsidian vault RAG setup
-```
-
----
-
-## Model stack
-
-| Model | Use case |
-|-------|----------|
-| `gemma4:27b` | Heavy lifting, large context, complex analysis |
-| `mistral-small3.2:24b` | Strong function calling, 128K context |
-| `qwen3.5:14b` | Improved reasoning, tool calling (recommended default) |
-| `qwen2.5:14b` | Tool calling, diagnostics, sysadmin (router default for diagnostics) |
-| `qwen2.5-coder:14b` | Scripts, configs, code (router default for code) |
-| `deepseek-r1:14b` | Complex reasoning, root cause analysis (router default for reasoning) |
-| `gemma3:12b` | Log analysis, summaries, documentation (router default for longform) |
-| `nomic-embed-text` | Embeddings / RAG |
-
-See **[docs/model-guide.md](docs/model-guide.md)** for details.
-
----
-
-## Known Intel Arc quirks
-
-- The DRI card node (`/dev/dri/card0` vs `card1`) can drift between reboots on Meteor Lake. The `check-arc-gpu.sh` pre-flight script detects and corrects this automatically.
-- Intel iGPU uses shared system RAM — `runner.vram="0 B"` in Ollama logs is expected and normal.
-- Use `OLLAMA_KEEP_ALIVE=-1` to keep models resident in memory between requests.
-- `renderD128` is the compute node and is stable; only the `cardN` display node drifts.
-
----
-
-## Multi-machine setup
-
-Add remote Ollama nodes via `.env`:
-
-```
-OLLAMA_REMOTE_WORKSTATION=http://192.168.1.50:11434:75
-```
-
-Then regenerate Olla config:
-
-```bash
-bash scripts/generate-olla-config.sh
-sudo systemctl restart ai-stack.service
-```
-
-Or auto-discover nodes on your LAN:
-
-```bash
-bash scripts/discover-herd.sh --apply
-```
-
----
-
-## Secret management (optional)
-
-The stack can resolve API keys from Bitwarden (or self-hosted VaultWarden) at runtime using `<vaultwarden:org-id/item-name>` placeholders in `.env`:
-
-```
-ANTHROPIC_API_KEY=<vaultwarden:abc123-xyz/anthropic-api-key>
-```
-
-The `install.sh` script prompts to set this up:
-- Installs `bw` CLI (via npm)
-- Collects your organization ID and API credentials
-- Writes `BW_CLIENT_ID`, `BW_CLIENT_SECRET`, `VAULT_MASTER_PASSWORD` to `.env`
-- Creates vaultwarden placeholders for `ANTHROPIC_API_KEY`, `GEMINI_API_KEY`, and `LITELLM_MASTER_KEY`
-- Resolves them immediately
-
-On subsequent starts, `start.sh` auto-resolves any unresolved placeholders.
-
-Manual resolution:
-```bash
-./scripts/resolve-vaultwarden.sh              # resolve in-place
-./scripts/resolve-vaultwarden.sh --dry-run    # preview only
-```
-
----
-
-## Updating the stack
-
-```bash
-cd /path/to/ai-stack
-
-# Pull latest images
-docker compose pull
-
-# Restart with new images
-sudo systemctl restart ai-stack.service
-```
+**[Get started → docs/install.md](docs/install.md)**
 
 ---
 
 ## Licence
 
 MIT — use freely, contributions welcome.
-
-Built with ☕ and stubbornness.
@@ -0,0 +1,141 @@
+# Cloud models (Claude, Gemini)
+
+ai-stack is designed to work entirely locally with no cloud dependency. But free-tier access to frontier cloud models (Claude and Gemini) is available as an optional augment — useful when a task genuinely needs more capability than a 14B local model provides.
+
+This is handled through LiteLLM, which proxies cloud API calls through a local endpoint. From OpenCode's perspective, it's just another provider.
+
+---
+
+## What "free tier" means
+
+Both Anthropic (Claude) and Google (Gemini) offer free API access:
+
+- **Claude** (via Anthropic): Generous free tier on Claude Haiku and Claude Sonnet. Rate-limited, but sufficient for occasional use. Requires account creation at [console.anthropic.com](https://console.anthropic.com).
+- **Gemini** (via Google AI Studio): Free tier on Gemini Flash and Gemini Pro. More liberal rate limits. Requires Google account at [aistudio.google.com](https://aistudio.google.com).
+
+Free tiers can be revoked or changed by the providers. Check current limits at their respective developer consoles.
+
+---
+
+## Getting API keys
+
+### Claude (Anthropic)
+
+1. Go to [console.anthropic.com](https://console.anthropic.com)
+2. Create an account (free)
+3. Go to **API Keys** → **Create Key**
+4. Copy the key (starts with `sk-ant-...`)
+
+### Gemini (Google AI Studio)
+
+1. Go to [aistudio.google.com](https://aistudio.google.com)
+2. Sign in with your Google account
+3. Click **Get API Key** → **Create API key in new project**
+4. Copy the key
+
+---
+
+## Adding keys to the stack
+
+### Via `.env` directly
+
+Edit `.env` and set:
+
+```bash
+ANTHROPIC_API_KEY=sk-ant-your-key-here
+GEMINI_API_KEY=your-gemini-key-here
+```
+
+Then restart:
+```bash
+sudo systemctl restart ai-stack.service
+```
+
+### Via Bitwarden/VaultWarden (recommended)
+
+If you configured Bitwarden during install, use the placeholder format instead:
+
+```bash
+ANTHROPIC_API_KEY=<vaultwarden:your-org-id/anthropic-api-key>
+GEMINI_API_KEY=<vaultwarden:your-org-id/gemini-api-key>
+```
+
+The stack resolves these at startup. See [docs/secret-management.md](secret-management.md) for details.
+
+---
+
+## Verifying cloud models are available
+
+After adding keys and restarting:
+
+```bash
+# Check LiteLLM is healthy
+curl http://localhost:4000/health/liveness
+
+# List available models
+curl http://localhost:4000/v1/models -H "Authorization: Bearer $LITELLM_MASTER_KEY" | python3 -m json.tool
+```
+
+You should see Claude and Gemini models in the list.
+
+---
+
+## Using cloud models in OpenCode
+
+Cloud models are available through the LiteLLM provider (`:4000`). In OpenCode, you can switch providers or configure cloud models as a fallback.
+
+To direct a specific request to a cloud model, select the LiteLLM provider in OpenCode and choose the model explicitly. The smart router routes to local models by default — cloud models are not part of the automatic routing unless you configure them in the router's `MODELS` map.
+
+**When to use cloud models:**
+- Complex reasoning that requires a frontier-scale model
+- Very long documents that exceed local model context limits
+- Tasks where output quality matters more than privacy/cost
+
+**When to stick with local:**
+- Anything involving sensitive information
+- Repetitive or bulk tasks (free tier has rate limits)
+- When you need fast iteration (local is often faster for short tasks)
+
+---
+
+## The LiteLLM configuration
+
+Cloud model definitions live in `proxy/litellm_config.yaml`. The default config includes:
+
+```yaml
+model_list:
+  - model_name: claude-haiku
+    litellm_params:
+      model: anthropic/claude-haiku-20240307
+      api_key: os.environ/ANTHROPIC_API_KEY
+
+  - model_name: claude-sonnet
+    litellm_params:
+      model: anthropic/claude-sonnet-20240620
+      api_key: os.environ/ANTHROPIC_API_KEY
+
+  - model_name: gemini-flash
+    litellm_params:
+      model: gemini/gemini-1.5-flash
+      api_key: os.environ/GEMINI_API_KEY
+
+  - model_name: gemini-pro
+    litellm_params:
+      model: gemini/gemini-1.5-pro
+      api_key: os.environ/GEMINI_API_KEY
+```
+
+To add more models or change which models are available, edit this file and restart the stack.
+
+---
+
+## Rate limits and costs
+
+Free tiers have limits. If you hit them, LiteLLM will return a rate limit error (429). The stack does not automatically retry or fall back to another provider.
+
+To avoid surprises:
+- Use local models for routine tasks
+- Reserve cloud calls for tasks where the quality difference matters
+- Watch your usage at the provider consoles
+
+If you start regularly hitting free tier limits and want to add paid credits, simply add credits to your Anthropic or Google AI account — no config changes needed.