A self-hosted ChatGPT-style web app that runs on your machine using Ollama. No cloud APIs, no API keys your data stays local.
- Chat interface: streaming responses, markdown with syntax highlighting, conversation history
- Vision / image input: attach images via button, drag-and-drop, or clipboard paste — auto-routes to a vision-capable model
- Smart model routing: "Auto" mode detects code patterns and routes to your configured code model, and detects images to route to a vision model
- System prompts: set a custom system prompt per conversation to control the model's behavior
- Grounded responses: assistant messages include confidence + citations when RAG context is used
- Persistent memory: automatic memory capture from conversation turns (
preference,fact,decision) - Memory transparency: assistant responses show which memory items were injected automatically
- Optional self-hosted voice mode: push-to-talk input + spoken assistant replies via local speech provider
- Configurable models: set your default, code, and embedding models from the in-app settings panel — works with whatever you have installed
- Dark mode: mobile-responsive, conversation management
Knowledge Base
Upload documents or paste URLs to build a searchable knowledge base. When RAG is enabled for a conversation, relevant chunks are automatically retrieved and injected into the prompt context.
- Upload files: drag-and-drop or file picker (supports
.md,.txt,.pdf,.ts,.js,.py,.go,.rs,.java,.cpp,.c,.html,.css,.json,.yaml,.toml) - Index URLs: paste any URL to scrape and index its content
- Document management: view status, chunk count, file size; reindex or delete documents
- Test search: run queries against your indexed documents to verify retrieval quality
- Last cited visibility: knowledge base table shows when each document was last cited in chat
Settings
Configure model, memory, and RAG parameters from the settings page.
-
Automatic memory capture: memory extraction runs on every completed turn
-
Memory token budget: max prompt budget used for memory injection (approximation:
content.length / 4) -
Voice provider settings: one-time setup for local speech URL + STT/TTS models + voice preset
-
Voice behavior controls: enable/disable voice globally and auto-speak assistant replies
-
RAG toggle: enable or disable RAG globally
-
Chunk size / overlap: control how documents are split into chunks (100–2000 tokens, 0–500 overlap)
-
Top-K results: number of chunks retrieved per query (1–20)
-
Similarity threshold: minimum cosine similarity score for retrieved chunks (0–1)
-
Embedding model: select which Ollama model generates embeddings (default:
nomic-embed-text) -
Watched folders: add local directories for automatic file indexing via file watcher
-
Supported file types: toggle which file extensions are indexed
Memory
Memory is automatic with user review controls:
- Memory page (
/memory): review, search, filter, edit, archive, and restore auto-captured memory items - Scopes:
global: available to all conversationsconversation: only injected into a specific conversation
- Types:
preference,fact,decision - Usage tracking: each memory item records
useCountandlastUsedAt - Auto-capture tags: automatically extracted memories are tagged with
auto - Prompt injection order: system prompt → memory → RAG → conversation history
brew install ollama # or download from ollama.ai
ollama serve # or open the desktop appollama pull gemma2:9b # lightweight general model
ollama pull qwen3:14b # larger general model
ollama pull qwen3-coder:30b # coding specialist
ollama pull llama3.2-vision # vision model for image input
ollama pull nomic-embed-text # embeddings for RAGAny model from ollama.ai/library works. Pull it and it appears in the app.
pnpm install
pnpm db:setup
pnpm devOpen http://localhost:3000. The SQLite database is created automatically.
pnpm db:setup runs Prisma migrations and sets up the vector column for RAG embeddings.
Push-to-talk input and spoken assistant replies via a local Speaches sidecar. No cloud APIs — everything runs on your machine.
docker compose -f docker-compose.voice.yml up -d # start SpeachesThen open Settings → Voice and enable it. See VOICE.md for the full setup guide, model installation, troubleshooting, and GPU acceleration.
Click the gear icon in the header to open settings. Choose which of your installed models to use for:
- Default Model — general conversations
- Code Model — used by Auto mode when code is detected in your prompt
- Embedding Model — used for RAG document embeddings
Vision-capable models are tagged with (vision) in the model selector. When images are attached in Auto mode, the app auto-routes to the first available vision model.
On first run, the app auto-detects your installed models and picks defaults.
- Routing logic: edit
src/lib/router.tsto change which patterns trigger the code model. - System prompt: click the "System Prompt" toggle below the chat header to set per conversation instructions.
- Memory ranking/budget logic: edit
src/lib/memory/index.ts. - Remote Ollama: set
OLLAMA_BASE_URLin.envto point at a GPU server running Ollama. - Remote speech service: set
VOICE_BASE_URL/VOICE_API_KEYto any OpenAI-compatible self-hosted speech API. - Voice tone tuning: set
VOICE_TTS_SPEED(default0.92). A range around0.9-0.95often sounds less robotic.
When RAG is enabled, assistant messages show an experimental confidence label (high / medium / low) and cite the source documents used. When no relevant sources are found, the model still responds but is instructed to flag uncertainty.
pnpm eval:groundingRuns retrieval checks from evals/datasets/grounding.jsonl.
- Writes report to
evals/reports/grounding-baseline.json - Skips when app is unreachable (use
EVAL_STRICT=1to fail instead)
pnpm eval:grounding:exportExports draft low-confidence cases from local usage to evals/datasets/grounding.draft.jsonl.
src/lib/
├── chat/ # Chat pipeline helpers
│ ├── resolve-model # Smart router wrapper (auto → code/default model)
│ ├── context # Message building, system prompt + RAG injection
│ └── index # Barrel exports
├── memory/ # Memory selection, ranking, injection, tag/usage helpers
├── rag/ # RAG pipeline
│ ├── embeddings # Ollama embedding API
│ ├── vector-db # libSQL vector search
│ ├── chunker # Document-aware chunking
│ └── parsers/ # Markdown, PDF, code, URL parsers
├── tools/ # Agent tool definitions + executor (web_search, fetch_url)
├── voice/ # Voice STT/TTS integration (Speaches)
├── router # Pattern-matching code detection
├── ollama # Ollama HTTP API wrapper (chat, embeddings, model capabilities)
├── config # App configuration (DB-backed)
└── db # Prisma client singleton
Next.js 16 · React 19 · Tailwind v4 · TypeScript · Prisma v7 + SQLite · Ollama API
