A self-hosted RAG application for IT helpdesk support. Users ask questions in natural language and receive answers sourced directly from uploaded helpdesk documents.
When a user asks a question, the system:
- Searches your helpdesk documents for the most relevant information
- Passes that information to a local language model
- The model generates an answer based only on your documents
- The answer is returned to the user in a chat interface
This project uses RAG (Retrieval Augmented Generation) — the model answers based on your documents, not general training data.
| Tool | Role |
|---|---|
| Docker | Runs and connects all services |
| Ollama | Local LLM server — runs the chat and embedding models |
| Open WebUI | Chat interface, RAG pipeline, and built-in vector store (ChromaDB) |
Uploaded documents are split into chunks by Open WebUI's text splitter (chunk size: 1000, overlap: 100). Each chunk is passed to nomic-embed-text via the Ollama API, which returns a vector representation. The vector and original text are stored together in ChromaDB.
Your document
↓
Text splitter (chunk size: 1000, overlap: 100)
↓
nomic-embed-text → vector
↓
ChromaDB stores vector + original text
At query time, the user's question is embedded using the same nomic-embed-text model. ChromaDB performs a hybrid search (vector similarity + BM25 keyword search, weighted 0.5/0.5) to retrieve the top K most relevant chunks. The chunks are injected into a RAG prompt template alongside the user's question and passed to the chat model via Ollama.
User asks a question
↓
nomic-embed-text → vector
↓
ChromaDB hybrid search (vector similarity + BM25) → top K chunks
↓
RAG prompt template: system prompt + chunks + question
↓
llama3.1:8b generates response
↓
Answer streams back to user with source citations
| Model | Role | Size |
|---|---|---|
nomic-embed-text |
Embedding — converts text to vectors | 274MB |
llama3.1:8b |
Chat — generates responses from retrieved context | 4.7GB |
The embedding model used at ingestion time must match the one used at query time. Changing the embedding model requires re-embedding all documents.
| Spec | Minimum | Recommended |
|---|---|---|
| RAM | 16GB | 32GB |
| Disk | 20GB free | 50GB free |
| OS | Windows 10/11, macOS, Linux | Windows 11 |
| Docker | Docker Desktop | Docker Desktop |
| Model | RAM Required | Quality |
|---|---|---|
llama3.2:3b |
~4GB | Good |
llama3.1:8b |
~8GB | Very good |
llama3.1:70b |
~40GB | Excellent (requires 40GB+ RAM) |
The default chat model is llama3.1:8b. The following are tested alternatives:
Mistral 7B (mistral)
A 7B parameter model from Mistral AI. Slightly smaller than Llama 3.1 8B with similar performance. Known for strong instruction following and concise responses. Good alternative if Llama 3.1 8B is not available or underperforms on your hardware.
docker exec -it ollama ollama pull mistralLlama 3.1 70B (llama3.1:70b)
The 70B parameter variant of Llama 3.1. Significantly stronger instruction following and lower hallucination rate than the 8B model. Requires approximately 40GB RAM. Recommended for production use where accuracy is critical.
docker exec -it ollama ollama pull llama3.1:70bTo switch models, update the Base Model in Workspace → Models → IT Helpdesk and click Save & Update.
The embedding model (
nomic-embed-text) does not need to change when switching chat models.
Step 1 — Install WSL2 (Windows only)
Docker on Windows requires WSL2. Open PowerShell as Administrator and run:
wsl --installRestart your machine when prompted.
Step 2 — Install Docker Desktop
Download and install from: https://www.docker.com/products/docker-desktop
Step 3 — Clone or Download This Project
Place the project folder anywhere on your machine.
Step 4 — Configure Credentials
Create a .env file in the project root:
WEBUI_SECRET_KEY=your_secret_key_hereImportant: Never commit the
.envfile to GitHub. It is listed in.gitignoreby default.
Step 5 — Start All Services
docker-compose up -dVerify both containers are running:
docker-compose psBoth services should show status Up.
Step 6 — Pull Models
Pull the chat model:
docker exec -it ollama ollama pull llama3.1:8bPull the embedding model:
docker exec -it ollama ollama pull nomic-embed-textStep 7 — Create an Admin Account
Open http://localhost:8080 and create your admin account on first launch.
Step 8 — Configure RAG Settings
Go to Admin Settings → Documents and configure the following:
Embedding:
- Embedding Model Engine:
Ollama - Ollama URL:
http://ollama:11434 - Embedding Model:
nomic-embed-text
Retrieval:
- Full Context Mode:
OFF - Hybrid Search:
ON - Enrich Hybrid Search Text:
OFF - Top K:
10
Click Save, then click Reindex Knowledge Base Vectors at the bottom of the page under Danger Zone.
Step 9 — Upload Your Documents
- Go to Workspace → Knowledge
- Click + to create a new knowledge base
- Upload your helpdesk documents (PDF, Word, txt, markdown)
When documents are updated, delete the old version from the knowledge base and re-upload the updated file. ChromaDB re-embeds automatically on upload.
Step 10 — Create the Helpdesk Model
- Go to Workspace → Models
- Click + to create a new model
- Set:
- Name:
IT Helpdesk - Base Model:
llama3.1:8b - System Prompt:
You are an IT helpdesk assistant. Answer questions using only the information provided in the context. Do not add or assume information that is not explicitly stated. If the answer is not in the context, say "I don't have that information — please contact IT directly." - Knowledge: select your knowledge base
- Capabilities: Citations only
- Name:
- Click Save & Update
Step 11 — Test
Open http://localhost:8080, select the IT Helpdesk model and ask a question related to your uploaded documents.
| Service | URL | Purpose |
|---|---|---|
| Open WebUI | http://localhost:8080 |
User chat interface |
| Ollama API | http://localhost:11434 |
LLM API |
# Start all services
docker-compose up -d
# Stop all services
docker-compose down
# View running containers
docker-compose ps
# View logs for a specific service
docker-compose logs -f open-webui
# Pull a new Ollama model
docker exec -it ollama ollama pull llama3.1:8b
# List downloaded Ollama models
docker exec -it ollama ollama list
# Stop everything and delete all data (DESTRUCTIVE)
docker-compose down -vContainers won't start
Ensure Docker Desktop is fully running before running docker-compose up -d.
Model not responding
Check that the model has been pulled: docker exec -it ollama ollama list
Answers not sourced from documents Ensure the knowledge base is selected in the model settings under Workspace → Models. Check that Bypass Embedding and Retrieval is OFF in Admin Settings → Documents.
Wrong documents being retrieved Delete all documents from the knowledge base and re-upload them fresh. This forces ChromaDB to re-embed with the correct embedding model.
Out of memory errors
Switch to a smaller model. Replace llama3.1:8b with llama3.2:3b (~4GB RAM).
Data lost after restart
Data is safe as long as you do not use the -v flag. Only docker-compose down -v deletes volumes.
The Reindex Knowledge Base Vectors button is located at the bottom of Admin Settings → Documents under the Danger Zone section.
| Scenario | Action required |
|---|---|
| Switched embedding model (e.g. SentenceTransformers → Ollama) | Reindex |
| Changed chunk size or chunk overlap settings | Reindex |
| Documents retrieving incorrectly after settings changes | Reindex |
| Added or replaced documents in the knowledge base | Re-upload only — reindex not required |
Reindexing deletes all existing vectors in ChromaDB and re-embeds every document in the knowledge base using the currently configured embedding model. This ensures the vectors are consistent with the current settings.
Reindexing does not delete your documents — only the vector representations. Documents remain in the knowledge base and are re-processed automatically.
Do not use Reindex as a general fix for bad answers. If retrieval is returning the wrong documents, the more reliable fix is to delete and re-upload the affected documents individually. Reindex is specifically for cases where the embedding model or chunking settings have changed.
Model hallucination on broad questions
The llama3.1:8b model may occasionally supplement retrieved context with information from its training data, particularly for open-ended questions. Specific factual questions (phone numbers, step-by-step processes, specific policies) perform best. Upgrading to a larger model (70B+) or a hosted model (GPT-4o) significantly reduces this behaviour.
Phrasing sensitivity How a question is phrased affects retrieval quality. Specific questions perform better than broad ones. For example, "what is the IT helpdesk phone number" retrieves better than "how do I contact IT".
Re-embedding required after document updates When documents are modified, they must be deleted and re-uploaded to the knowledge base for changes to take effect. ChromaDB stores vectors from the original upload and does not automatically detect file changes.
User
↓
Open WebUI (http://localhost:8080)
├── ChromaDB (built-in vector store)
└── Ollama (http://ollama:11434)
├── llama3.1:8b (chat model)
└── nomic-embed-text (embedding model)
All services communicate over a private Docker bridge network (helpdesk-network).
| Document | Contents |
|---|---|
account-management.txt |
Passwords, MFA, account setup, lockouts |
hardware-support.txt |
Laptops, peripherals, mobile devices, printers |
software-support.txt |
Standard software, requests, licensing, updates |
network-and-connectivity.txt |
Wi-Fi, VPN, network drives |
email-support.txt |
Setup, mobile, shared mailboxes, security |
remote-work.txt |
Policy, home office, remote desktop |
security-policy.txt |
Acceptable use, data classification, phishing |
it-portal-and-ticketing.txt |
Portal, ticket priorities, contact details |
onboarding-offboarding.txt |
New employee setup, account termination |
backup-and-recovery.txt |
OneDrive, file recovery, disaster recovery |