A local RAG (Retrieval-Augmented Generation) system that allows you to upload PDFs and have conversations about their content using local LLMs via Ollama.
The RAG (Retrieval-Augmented Generation) pipeline processes your documents through several stages:
- Document Loading: PyMuPDF for PDF processing
- Text Splitting: Recursive character splitter (optimized chunks with overlap)
- Embeddings: Nomic embeddings via Ollama (
nomic-embed-text) - Vector Store: In-memory vector store with similarity search
- Retrieval: Context-aware similarity search with relevance scoring
- Generation: LangGraph workflow with query rewriting using
llama3.2:3b
- Ollama: Install from ollama.ai and have it running
- Docker & Docker Compose: For containerized backend and frontend
- System Requirements: 8GB+ RAM, 2GB free disk space
For Windows:
./start.batFor Linux/macOS:
./start.shOr manually with Docker Compose:
# First, ensure Ollama is running: ollama serve
# Then pull required models: ollama pull nomic-embed-text && ollama pull llama3.2:3b
docker compose upThe setup process will:
- ✅ Verify system requirements and dependencies
- 🤖 Check/download required AI models (
nomic-embed-textandllama3.2:3b) - 🐳 Build and start the backend API (FastAPI)
- 🖥️ Build and start the frontend UI (Streamlit)
- 🔗 Connect everything to your native Ollama instance
- Frontend UI: http://localhost:8501
- Backend API: http://localhost:8000
- API Documentation: http://localhost:8000/docs
- Ollama: http://localhost:11434
⚠️ Important Note: If Streamlit shows0.0.0.0:8501in the console, usehttp://127.0.0.1:8501orhttp://localhost:8501in your browser instead.
-
Upload Documents:
- Open http://localhost:8501
- Upload PDF files using the sidebar
-
Ask Questions:
- Type your questions in the chat interface
- Get AI-powered answers based on your documents
-
Manage Session:
- Clear chat history
- Remove documents
- Start fresh conversations
.
├── backend/ # FastAPI backend
│ ├── api/ # API endpoints and dependencies
│ │ └── endpoints/ # Document and chat endpoints
│ ├── rag/ # RAG pipeline components
│ │ ├── agent.py # LangGraph workflow agent
│ │ ├── loader.py # PDF document loader
│ │ ├── embedder.py # Ollama embeddings
│ │ ├── splitter.py # Text chunking
│ │ ├── vector_store.py# In-memory vector storage
│ │ └── retriever.py # Document retrieval
│ ├── services/ # Business logic services
│ ├── models/ # Pydantic DTOs
│ ├── config.py # Configuration management
│ ├── main.py # FastAPI application
│ ├── requirements.txt # Python dependencies
│ └── Dockerfile # Backend container
├── ui/ # Streamlit frontend
│ ├── ui.py # Main UI application
│ ├── ui-requirements.txt# UI dependencies
│ └── Dockerfile # Frontend container
├── architecture_diagram.png # System architecture
├── rag_diagram.png # RAG pipeline diagram
├── docker-compose.yml # Service orchestration
├── start.sh # Linux/macOS startup script
├── start.bat # Windows startup script
└── README.md
Backend:
cd backend
pip install -r requirements.txt
python main.pyFrontend:
cd ui
pip install -r ui-requirements.txt
streamlit run ui.pyNote: Ensure Ollama is running locally with required models before starting development.
POST /documents- Upload PDF documentGET /documents- List uploaded documentsDELETE /documents- Clear all documentsPOST /chat- Send message and get AI responseGET /messages- Get chat historyDELETE /messages- Clear chat historyGET /health- Health check
-
Ollama not running:
- Ensure Ollama is installed: Visit ollama.ai
- Start Ollama:
ollama serve - Verify it's running:
curl http://localhost:11434/api/version
-
Models not available:
- Download required models:
ollama pull nomic-embed-text && ollama pull llama3.2:3b - Check available models:
ollama list - The startup scripts will help with this automatically
- Download required models:
-
Connection issues:
- Verify Ollama is accessible:
curl http://localhost:11434/api/tags - Check Docker containers:
docker ps - Review logs:
docker compose logs
- Verify Ollama is accessible:
-
Out of memory:
- Ensure at least 8GB RAM available
- Close other applications
- Consider using smaller models if issues persist
# View all service logs
docker compose logs
# View specific service logs
docker compose logs backend
docker compose logs frontend
# Check Ollama status
ollama list
curl http://localhost:11434/api/version# Stop all services
docker compose down -v
# Clean Docker system
docker system prune -f
# Restart fresh
./start.sh # or ./start.bat on Windows- Backend: FastAPI, LangChain, LangGraph, Ollama integration
- Frontend: Streamlit with intuitive file upload and chat interface
- Containerization: Docker & Docker Compose
- LLMs: Native Ollama with local models
- Vector Storage: In-memory vector store for fast retrieval
- Document Processing: PyMuPDF for robust PDF handling
The system is configured to work with:
- Embedding Model:
nomic-embed-text(via Ollama) - Chat Model:
llama3.2:3b(via Ollama) - Native Ollama: Running on
localhost:11434 - Auto-discovery: Backend automatically connects to local Ollama instance

