Query-Buddy is a Retrieval-Augmented Generation (RAG) pipeline for domain-specific question answering. It ingests multiple data formats (text, PDF, images, audio), cleans and segments content into chunks, converts them into dense vector embeddings, and indexes them in a vector database. User queries are embedded, matched against the index for the most relevant chunks, and those chunks are used as context so the LLM produces accurate, traceable answers grounded in your data.
| Layer | Technology |
|---|---|
| Ingestion | PyPDF2, BeautifulSoup (web), Whisper (audio), Tesseract/Pillow (images) |
| Chunking | LangChain RecursiveCharacterTextSplitter |
| Embeddings | SentenceTransformers (all-MiniLM-L6-v2) |
| Vector DB | ChromaDB |
| LLM | Ollama (e.g. Llama 3) |
| UI | Streamlit |
- Multimodal ingestion: PDF, plain text, web pages, images (OCR/captioning), and audio (ASR/transcription)
- Unified schema: All modalities produce text + metadata in a common format for chunking and embedding
- RAG pipeline: Ingest → Chunk → Embed → Index, orchestrated via
main.py - Query interface: Web UI (Streamlit) and Python API to compare plain LLM vs RAG answers
- Modality-aware metadata: Chunks retain source type and file metadata for filtering and citations
-
Clone the repository (or use your local copy):
git clone <repository-url> cd Query-Buddy
-
Create a virtual environment (recommended):
python -m venv venv # Windows venv\Scripts\activate # macOS/Linux source venv/bin/activate
-
Install Python dependencies:
pip install -r requirements.txt
-
Install Ollama and pull the LLM (for local LLM answers):
ollama pull llama3
-
Optional – system dependencies for full multimodal support:
- Tesseract OCR: Required for image text extraction. Install from Tesseract and ensure it is on your PATH.
- FFmpeg: Often needed for Whisper/audio processing. Install per your OS.
Run the full pipeline (ingest → chunk → embed → index):
python main.pySkip ingestion (if data is already in data/raw):
python main.py --skip-ingestRun the Streamlit app (compare RAG vs plain LLM):
streamlit run app.pyQuery from Python:
from rag_core import answer_plain_llm, answer_rag
# RAG answer (uses your indexed documents)
answer_rag("Your question here?", top_k=3)
# Plain LLM answer (no retrieval)
answer_plain_llm("Your question here?")Data locations:
- Put source files in
data/raw_sources/or indata/audio/anddata/image/for dedicated modalities. The ingestion step discovers and processes them.
Query-Buddy/
├── main.py # Pipeline orchestrator (ingest → chunk → embed → index)
├── ingest_sources.py # Central ingestion dispatcher (PDF, text, web, audio, image)
├── load_audio.py # Audio transcription (e.g. Whisper)
├── load_image.py # Image OCR / captioning (e.g. Tesseract)
├── chunk_text.py # Text chunking with modality metadata
├── embed_store.py # Embedding generation and metadata export
├── vector_store.py # ChromaDB indexing and retrieval
├── rag_core.py # RAG retrieval and LLM answering
├── query.py # Query interface (e.g. filtering, citations)
├── app.py # Streamlit web UI
├── requirements.txt # Python dependencies
├── data/
│ ├── raw/ # Processed text output from ingestion
│ ├── raw_sources/ # Source PDFs, text files, etc.
│ ├── audio/ # Audio files
│ ├── image/ # Image files (if used)
│ ├── chunks.json # Chunks produced by chunk_text.py
│ └── embeddings/ # Embeddings and metadata for vector_store
├── chroma_db/ # ChromaDB persistence (created at runtime)
└── README.md
Contributions are welcome. Suggested workflow:
- Fork the repository.
- Create a branch for your feature or fix.
- Make changes and add tests if applicable.
- Open a pull request with a short description of the change.
For larger features (e.g. new modalities or loaders), open an issue first to align with the project structure and common schema.
- Ollama must be installed and running locally for LLM answers. Ensure
ollama pull llama3(or your chosen model) has been run. - First run: Running
python main.pywithout existing data will ingest from configured web URLs and any files indata/raw_sources/,data/audio/, anddata/image/. Place your documents there before ingestion if you want to query your own content. - Multimodal loaders: Audio and image ingestion depend on
load_audio.pyandload_image.py(and their system dependencies). If those modules or dependencies are missing, ingestion skips those modalities and continues with the rest.
This project is licensed under the MIT License. See the LICENSE file for the full text.
- Ollama – Local LLM runtime
- ChromaDB – Vector database
- Sentence Transformers – Embedding models
- LangChain Text Splitters – Document chunking
- Streamlit – Web app framework
