A local RAG (Retrieval-Augmented Generation) Q&A system built with Streamlit.
Unlike traditional RAG tools, this project integrates OCR (Optical Character Recognition) capabilities, allowing you to chat not only with text documents but also with scanned PDFs and images.
Powered by DeepSeek V3 (for high-performance reasoning) and local Ollama (for privacy-preserving embedding).
- For better Chinese understanding。
- Prerequisites:
- Please download Ollama。 And run this in terminal:
ollama pull bge-m3
- 📄 Universal Document Support:
- PDF: Handles both standard text PDFs and Scanned/Image-based PDFs (Auto-triggers OCR).
- Markdown/TXT: Supports common text formats.
- 👁️ Built-in OCR Engine:
- Integrated
RapidOCR+PyMuPDFfor local text extraction. No need for third-party OCR APIs.
- Integrated
- 🧠 Hybrid AI Architecture:
- LLM: DeepSeek API (OpenAI SDK Compatible).
- Embedding: Local Ollama (
all-minilm), zero-cost & privacy-first. - Vector DB: ChromaDB for local persistence.
- 💬 Streaming Interaction:
- Real-time typewriter effect responses.
| Component | Technology | Description |
|---|---|---|
| Frontend | Streamlit | Lightweight Python Web Framework |
| LLM | DeepSeek API | High performance, low cost reasoning model |
| Embedding | Ollama | Running all-minilm locally |
| Vector DB | ChromaDB | Local vector storage |
| OCR | RapidOCR | ONNX-based offline OCR engine |
| ETL | PyMuPDF (fitz) | PDF parsing and image extraction |
Ensure you have Python 3.8+ and Ollama installed.
# Clone the repository
git clone https://github.com/YOUR_USERNAME/local-rag-ocr-bot.git
cd local-rag-ocr-botpip install -r requirements.txtNote: The OCR libraries are relatively large, so the download might take a moment.
Pull the embedding model in your terminal:
ollama pull all-minilmMake sure the Ollama service is running in the background.
Copy the example configuration file:
# Windows
copy .env.example .env
# Mac/Linux
cp .env.example .envOpen .env and fill in your DeepSeek API Key:
# Your DeepSeek API Key
DEEPSEEK_API_KEY=sk-xxxxxxxxxxxxxxxxxxxxxxxx
# Keep others as default
DEEPSEEK_BASE_URL=https://api.deepseek.com
OLLAMA_BASE_URL=http://localhost:11434
OLLAMA_MODEL=all-minilm
CHROMA_DB_PATH=./chroma_dbstreamlit run app.pyThe browser will automatically open at http://localhost:8501.
.
├── app.py # Main Streamlit application
├── rag_engine.py # Core logic (OCR, Vectorization, RAG)
├── requirements.txt # Python dependencies
├── .env.example # Env template (Safe to commit)
├── .gitignore # Git ignore rules
├── README.md # English Documentation
└── README_CN.md # Chinese Documentation
- OCR Speed: If you upload a scanned PDF, the system performs page-by-page recognition. Depending on your CPU, this may take longer than processing standard text. Please watch the terminal for progress.
- DeepSeek Quota: Ensure your API Key has sufficient balance.
- Reset Data: To clear the knowledge base, click the "Clear Knowledge Base" button in the sidebar or manually delete the local
chroma_dbfolder.
Special thanks to the following tools that made this project possible:
- Cursor: For the incredible AI-assisted coding experience.
- Google Gemini: For providing architectural advice and debugging help.
- DeepSeek: For the powerful reasoning API.
This project is licensed under the MIT License. Feel free to Fork and Star!