Skip to content

Maxkrvo/OllamaChat

Repository files navigation

OllamaChat

OllamaChat Logo

A self-hosted ChatGPT-style web app that runs on your machine using Ollama. No cloud APIs, no API keys your data stays local.

Features

  • Chat interface: streaming responses, markdown with syntax highlighting, conversation history
  • Vision / image input: attach images via button, drag-and-drop, or clipboard paste — auto-routes to a vision-capable model
  • Smart model routing: "Auto" mode detects code patterns and routes to your configured code model, and detects images to route to a vision model
  • System prompts: set a custom system prompt per conversation to control the model's behavior
  • Grounded responses: assistant messages include confidence + citations when RAG context is used
  • Persistent memory: automatic memory capture from conversation turns (preference, fact, decision)
  • Memory transparency: assistant responses show which memory items were injected automatically
  • Optional self-hosted voice mode: push-to-talk input + spoken assistant replies via local speech provider
  • Configurable models: set your default, code, and embedding models from the in-app settings panel — works with whatever you have installed
  • Dark mode: mobile-responsive, conversation management
Knowledge Base

Upload documents or paste URLs to build a searchable knowledge base. When RAG is enabled for a conversation, relevant chunks are automatically retrieved and injected into the prompt context.

  • Upload files: drag-and-drop or file picker (supports .md, .txt, .pdf, .ts, .js, .py, .go, .rs, .java, .cpp, .c, .html, .css, .json, .yaml, .toml)
  • Index URLs: paste any URL to scrape and index its content
  • Document management: view status, chunk count, file size; reindex or delete documents
  • Test search: run queries against your indexed documents to verify retrieval quality
  • Last cited visibility: knowledge base table shows when each document was last cited in chat
Settings

Configure model, memory, and RAG parameters from the settings page.

  • Automatic memory capture: memory extraction runs on every completed turn

  • Memory token budget: max prompt budget used for memory injection (approximation: content.length / 4)

  • Voice provider settings: one-time setup for local speech URL + STT/TTS models + voice preset

  • Voice behavior controls: enable/disable voice globally and auto-speak assistant replies

  • RAG toggle: enable or disable RAG globally

  • Chunk size / overlap: control how documents are split into chunks (100–2000 tokens, 0–500 overlap)

  • Top-K results: number of chunks retrieved per query (1–20)

  • Similarity threshold: minimum cosine similarity score for retrieved chunks (0–1)

  • Embedding model: select which Ollama model generates embeddings (default: nomic-embed-text)

  • Watched folders: add local directories for automatic file indexing via file watcher

  • Supported file types: toggle which file extensions are indexed

Memory

Memory is automatic with user review controls:

  • Memory page (/memory): review, search, filter, edit, archive, and restore auto-captured memory items
  • Scopes:
    • global: available to all conversations
    • conversation: only injected into a specific conversation
  • Types: preference, fact, decision
  • Usage tracking: each memory item records useCount and lastUsedAt
  • Auto-capture tags: automatically extracted memories are tagged with auto
  • Prompt injection order: system prompt → memory → RAG → conversation history

Setup

1. Install Ollama

brew install ollama   # or download from ollama.ai
ollama serve          # or open the desktop app

2. Pull any models you want

ollama pull gemma2:9b          # lightweight general model
ollama pull qwen3:14b          # larger general model
ollama pull qwen3-coder:30b    # coding specialist
ollama pull llama3.2-vision    # vision model for image input
ollama pull nomic-embed-text   # embeddings for RAG

Any model from ollama.ai/library works. Pull it and it appears in the app.

3. Run

pnpm install
pnpm db:setup
pnpm dev

Open http://localhost:3000. The SQLite database is created automatically.

pnpm db:setup runs Prisma migrations and sets up the vector column for RAG embeddings.

Optional: self-hosted voice mode

Push-to-talk input and spoken assistant replies via a local Speaches sidecar. No cloud APIs — everything runs on your machine.

docker compose -f docker-compose.voice.yml up -d   # start Speaches

Then open Settings → Voice and enable it. See VOICE.md for the full setup guide, model installation, troubleshooting, and GPU acceleration.

4. Configure models

Click the gear icon in the header to open settings. Choose which of your installed models to use for:

  • Default Model — general conversations
  • Code Model — used by Auto mode when code is detected in your prompt
  • Embedding Model — used for RAG document embeddings

Vision-capable models are tagged with (vision) in the model selector. When images are attached in Auto mode, the app auto-routes to the first available vision model.

On first run, the app auto-detects your installed models and picks defaults.

Customizing

  • Routing logic: edit src/lib/router.ts to change which patterns trigger the code model.
  • System prompt: click the "System Prompt" toggle below the chat header to set per conversation instructions.
  • Memory ranking/budget logic: edit src/lib/memory/index.ts.
  • Remote Ollama: set OLLAMA_BASE_URL in .env to point at a GPU server running Ollama.
  • Remote speech service: set VOICE_BASE_URL/VOICE_API_KEY to any OpenAI-compatible self-hosted speech API.
  • Voice tone tuning: set VOICE_TTS_SPEED (default 0.92). A range around 0.9-0.95 often sounds less robotic.

Grounding and Citations

When RAG is enabled, assistant messages show an experimental confidence label (high / medium / low) and cite the source documents used. When no relevant sources are found, the model still responds but is instructed to flag uncertainty.

Evals

pnpm eval:grounding

Runs retrieval checks from evals/datasets/grounding.jsonl.

  • Writes report to evals/reports/grounding-baseline.json
  • Skips when app is unreachable (use EVAL_STRICT=1 to fail instead)
pnpm eval:grounding:export

Exports draft low-confidence cases from local usage to evals/datasets/grounding.draft.jsonl.

Architecture

src/lib/
├── chat/              # Chat pipeline helpers
│   ├── resolve-model  # Smart router wrapper (auto → code/default model)
│   ├── context        # Message building, system prompt + RAG injection
│   └── index          # Barrel exports
├── memory/            # Memory selection, ranking, injection, tag/usage helpers
├── rag/               # RAG pipeline
│   ├── embeddings     # Ollama embedding API
│   ├── vector-db      # libSQL vector search
│   ├── chunker        # Document-aware chunking
│   └── parsers/       # Markdown, PDF, code, URL parsers
├── tools/             # Agent tool definitions + executor (web_search, fetch_url)
├── voice/             # Voice STT/TTS integration (Speaches)
├── router             # Pattern-matching code detection
├── ollama             # Ollama HTTP API wrapper (chat, embeddings, model capabilities)
├── config             # App configuration (DB-backed)
└── db                 # Prisma client singleton

Stack

Next.js 16 · React 19 · Tailwind v4 · TypeScript · Prisma v7 + SQLite · Ollama API

About

A self-hosted ChatGPT-style web app that runs on your machine using Ollama. No cloud APIs, no API keys your data stays local.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors