Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
179 changes: 3 additions & 176 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,188 +1,15 @@
# ai-stack

A self-hosted AI stack optimised for **Intel Arc iGPU** on Linux, built around Ollama + OpenCode. Provides local LLM inference (ollama-arc), cloud API routing (LiteLLM), unified routing (Olla), and Obsidian vault RAG (retriever). The primary AI interface is **OpenCode** (CLI + Obsidian sidebar plugin).
A self-hosted, zero-cost AI development stack built around local LLMs. Run powerful language models on your own hardware — no token spend, no cloud dependency. Free-tier cloud models available as an optional augment.

Built and documented through real-world homelab experience on Intel Arc hardware.
Built around [Ollama](https://ollama.com), [OpenCode](https://opencode.ai), and a lightweight smart router that automatically selects the right model for each request.

---

## What's included

| Component | Purpose |
|-----------|---------|
| **Ollama (ava-agentone/ollama-intel)** | LLM inference with Intel Arc iGPU acceleration via OneAPI/SYCL |
| **LiteLLM** | Cloud API gateway (Claude, Gemini) |
| **Olla** | Unified LLM router with load balancing |
| **Smart Router** | Content-based model selection (OpenCode → router → Olla) |
| **Retriever** | Lightweight Obsidian vault RAG (sqlite-vec + FTS5, hybrid search) |
| **OpenCode** | Primary AI interface — CLI tool + Obsidian sidebar plugin |

---

## Hardware requirements

| Component | Minimum | Recommended |
|-----------|---------|-------------|
| CPU | Intel Core Ultra (Meteor Lake) | Intel Core Ultra 9 185H |
| RAM | 16 GB | 32 GB |
| GPU | Intel Arc iGPU | Intel Arc iGPU (any Meteor/Arrow Lake) |
| Storage | 50 GB free | 100 GB+ free (models are large) |
| OS | Ubuntu 22.04 | Ubuntu 24.04 |

---

## Quick start

```bash
# 1. Clone the repo
git clone https://github.com/growlf/ai-stack.git
cd ai-stack

# 2. Configure
cp .env.example .env
nano .env # set your username, paths, and API keys (or skip — install.sh can set up Bitwarden)

# 3. Install
chmod +x install.sh scripts/check-arc-gpu.sh
./install.sh

# The installer will:
# - Prompt to install OpenCode CLI + Bun
# - Auto-install the OpenCode Obsidian plugin (growlf/opencode-obsidian)
# - Prompt to configure Bitwarden/VaultWarden for secret management
# - Start the stack (ollama-arc, litellm, olla, router, retriever)
# - Prompt to pull models
```

---

## Project structure

```
ai-stack/
├── install.sh # Main installer
├── docker-compose.yml # Full stack definition
├── .env.example # All configurable values
├── systemd/
│ └── ai-stack.service # Systemd unit (auto-start on boot)
├── scripts/
│ ├── check-arc-gpu.sh # GPU pre-flight (detects card0/card1 drift)
│ ├── discover-herd.sh # mDNS discovery of remote Ollama nodes
│ ├── generate-olla-config.sh # Reads .env → writes proxy/olla.yaml
│ ├── resolve-vaultwarden.sh # Resolves <vaultwarden:...> placeholders via bw CLI
│ └── generate-keys.sh # Generates secure keys (LITELLM_MASTER_KEY)
├── router/
│ └── smart-model-router.py # Content-based model routing (OpenCode → router → Olla)
├── retriever/
│ ├── main.py # FastAPI app
│ ├── search.py # Hybrid search (FTS5 + vector, RRF fusion)
│ ├── indexer.py # Vault scanner + watchdog + chunking
│ └── Dockerfile
`scripts/olla.yaml.template` # Olla config template (generated by generate-olla-config.sh)
├── proxy/
│ └── litellm_config.yaml # LiteLLM model registry (Claude, Gemini)
├── .opencode/
│ └── tools/
│ └── vault-search.ts # OpenCode tool: search vault via retriever API
└── docs/
├── deployment-guide.md # Setup walkthrough
├── model-guide.md # Model recommendations and routing
├── troubleshooting.md # Common issues and fixes
└── retriever-guide.md # Obsidian vault RAG setup
```

---

## Model stack

| Model | Use case |
|-------|----------|
| `gemma4:27b` | Heavy lifting, large context, complex analysis |
| `mistral-small3.2:24b` | Strong function calling, 128K context |
| `qwen3.5:14b` | Improved reasoning, tool calling (recommended default) |
| `qwen2.5:14b` | Tool calling, diagnostics, sysadmin (router default for diagnostics) |
| `qwen2.5-coder:14b` | Scripts, configs, code (router default for code) |
| `deepseek-r1:14b` | Complex reasoning, root cause analysis (router default for reasoning) |
| `gemma3:12b` | Log analysis, summaries, documentation (router default for longform) |
| `nomic-embed-text` | Embeddings / RAG |

See **[docs/model-guide.md](docs/model-guide.md)** for details.

---

## Known Intel Arc quirks

- The DRI card node (`/dev/dri/card0` vs `card1`) can drift between reboots on Meteor Lake. The `check-arc-gpu.sh` pre-flight script detects and corrects this automatically.
- Intel iGPU uses shared system RAM — `runner.vram="0 B"` in Ollama logs is expected and normal.
- Use `OLLAMA_KEEP_ALIVE=-1` to keep models resident in memory between requests.
- `renderD128` is the compute node and is stable; only the `cardN` display node drifts.

---

## Multi-machine setup

Add remote Ollama nodes via `.env`:

```
OLLAMA_REMOTE_WORKSTATION=http://192.168.1.50:11434:75
```

Then regenerate Olla config:

```bash
bash scripts/generate-olla-config.sh
sudo systemctl restart ai-stack.service
```

Or auto-discover nodes on your LAN:

```bash
bash scripts/discover-herd.sh --apply
```

---

## Secret management (optional)

The stack can resolve API keys from Bitwarden (or self-hosted VaultWarden) at runtime using `<vaultwarden:org-id/item-name>` placeholders in `.env`:

```
ANTHROPIC_API_KEY=<vaultwarden:abc123-xyz/anthropic-api-key>
```

The `install.sh` script prompts to set this up:
- Installs `bw` CLI (via npm)
- Collects your organization ID and API credentials
- Writes `BW_CLIENT_ID`, `BW_CLIENT_SECRET`, `VAULT_MASTER_PASSWORD` to `.env`
- Creates vaultwarden placeholders for `ANTHROPIC_API_KEY`, `GEMINI_API_KEY`, and `LITELLM_MASTER_KEY`
- Resolves them immediately

On subsequent starts, `start.sh` auto-resolves any unresolved placeholders.

Manual resolution:
```bash
./scripts/resolve-vaultwarden.sh # resolve in-place
./scripts/resolve-vaultwarden.sh --dry-run # preview only
```

---

## Updating the stack

```bash
cd /path/to/ai-stack

# Pull latest images
docker compose pull

# Restart with new images
sudo systemctl restart ai-stack.service
```
**[Get started → docs/install.md](docs/install.md)**

---

## Licence

MIT — use freely, contributions welcome.

Built with ☕ and stubbornness.
141 changes: 141 additions & 0 deletions docs/cloud-models.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,141 @@
# Cloud models (Claude, Gemini)

ai-stack is designed to work entirely locally with no cloud dependency. But free-tier access to frontier cloud models (Claude and Gemini) is available as an optional augment — useful when a task genuinely needs more capability than a 14B local model provides.

This is handled through LiteLLM, which proxies cloud API calls through a local endpoint. From OpenCode's perspective, it's just another provider.

---

## What "free tier" means

Both Anthropic (Claude) and Google (Gemini) offer free API access:

- **Claude** (via Anthropic): Generous free tier on Claude Haiku and Claude Sonnet. Rate-limited, but sufficient for occasional use. Requires account creation at [console.anthropic.com](https://console.anthropic.com).
- **Gemini** (via Google AI Studio): Free tier on Gemini Flash and Gemini Pro. More liberal rate limits. Requires Google account at [aistudio.google.com](https://aistudio.google.com).

Free tiers can be revoked or changed by the providers. Check current limits at their respective developer consoles.

---

## Getting API keys

### Claude (Anthropic)

1. Go to [console.anthropic.com](https://console.anthropic.com)
2. Create an account (free)
3. Go to **API Keys** → **Create Key**
4. Copy the key (starts with `sk-ant-...`)

### Gemini (Google AI Studio)

1. Go to [aistudio.google.com](https://aistudio.google.com)
2. Sign in with your Google account
3. Click **Get API Key** → **Create API key in new project**
4. Copy the key

---

## Adding keys to the stack

### Via `.env` directly

Edit `.env` and set:

```bash
ANTHROPIC_API_KEY=sk-ant-your-key-here
GEMINI_API_KEY=your-gemini-key-here
```

Then restart:
```bash
sudo systemctl restart ai-stack.service
```

### Via Bitwarden/VaultWarden (recommended)

If you configured Bitwarden during install, use the placeholder format instead:

```bash
ANTHROPIC_API_KEY=<vaultwarden:your-org-id/anthropic-api-key>
GEMINI_API_KEY=<vaultwarden:your-org-id/gemini-api-key>
```

The stack resolves these at startup. See [docs/secret-management.md](secret-management.md) for details.

---

## Verifying cloud models are available

After adding keys and restarting:

```bash
# Check LiteLLM is healthy
curl http://localhost:4000/health/liveness

# List available models
curl http://localhost:4000/v1/models -H "Authorization: Bearer $LITELLM_MASTER_KEY" | python3 -m json.tool
```

You should see Claude and Gemini models in the list.

---

## Using cloud models in OpenCode

Cloud models are available through the LiteLLM provider (`:4000`). In OpenCode, you can switch providers or configure cloud models as a fallback.

To direct a specific request to a cloud model, select the LiteLLM provider in OpenCode and choose the model explicitly. The smart router routes to local models by default — cloud models are not part of the automatic routing unless you configure them in the router's `MODELS` map.

**When to use cloud models:**
- Complex reasoning that requires a frontier-scale model
- Very long documents that exceed local model context limits
- Tasks where output quality matters more than privacy/cost

**When to stick with local:**
- Anything involving sensitive information
- Repetitive or bulk tasks (free tier has rate limits)
- When you need fast iteration (local is often faster for short tasks)

---

## The LiteLLM configuration

Cloud model definitions live in `proxy/litellm_config.yaml`. The default config includes:

```yaml
model_list:
- model_name: claude-haiku
litellm_params:
model: anthropic/claude-haiku-20240307
api_key: os.environ/ANTHROPIC_API_KEY

- model_name: claude-sonnet
litellm_params:
model: anthropic/claude-sonnet-20240620
api_key: os.environ/ANTHROPIC_API_KEY

- model_name: gemini-flash
litellm_params:
model: gemini/gemini-1.5-flash
api_key: os.environ/GEMINI_API_KEY

- model_name: gemini-pro
litellm_params:
model: gemini/gemini-1.5-pro
api_key: os.environ/GEMINI_API_KEY
```

To add more models or change which models are available, edit this file and restart the stack.

---

## Rate limits and costs

Free tiers have limits. If you hit them, LiteLLM will return a rate limit error (429). The stack does not automatically retry or fall back to another provider.

To avoid surprises:
- Use local models for routine tasks
- Reserve cloud calls for tasks where the quality difference matters
- Watch your usage at the provider consoles

If you start regularly hitting free tier limits and want to add paid credits, simply add credits to your Anthropic or Google AI account — no config changes needed.
Loading