Query-Buddy

Query-Buddy is a Retrieval-Augmented Generation (RAG) pipeline for domain-specific question answering. It ingests multiple data formats (text, PDF, images, audio), cleans and segments content into chunks, converts them into dense vector embeddings, and indexes them in a vector database. User queries are embedded, matched against the index for the most relevant chunks, and those chunks are used as context so the LLM produces accurate, traceable answers grounded in your data.

Tech Stack

Layer	Technology
Ingestion	PyPDF2, BeautifulSoup (web), Whisper (audio), Tesseract/Pillow (images)
Chunking	LangChain `RecursiveCharacterTextSplitter`
Embeddings	SentenceTransformers (`all-MiniLM-L6-v2`)
Vector DB	ChromaDB
LLM	Ollama (e.g. Llama 3)
UI	Streamlit

Features

Multimodal ingestion: PDF, plain text, web pages, images (OCR/captioning), and audio (ASR/transcription)
Unified schema: All modalities produce text + metadata in a common format for chunking and embedding
RAG pipeline: Ingest → Chunk → Embed → Index, orchestrated via main.py
Query interface: Web UI (Streamlit) and Python API to compare plain LLM vs RAG answers
Modality-aware metadata: Chunks retain source type and file metadata for filtering and citations

Setup Instructions

Clone the repository (or use your local copy):
```
git clone <repository-url>
cd Query-Buddy
```

Create a virtual environment (recommended):

python -m venv venv
# Windows
venv\Scripts\activate
# macOS/Linux
source venv/bin/activate

Install Python dependencies:
```
pip install -r requirements.txt
```
Install Ollama and pull the LLM (for local LLM answers):
```
ollama pull llama3
```
Optional – system dependencies for full multimodal support:
- Tesseract OCR: Required for image text extraction. Install from Tesseract and ensure it is on your PATH.
- FFmpeg: Often needed for Whisper/audio processing. Install per your OS.

Usage

Run the full pipeline (ingest → chunk → embed → index):

python main.py

Skip ingestion (if data is already in data/raw):

python main.py --skip-ingest

Run the Streamlit app (compare RAG vs plain LLM):

streamlit run app.py

Query from Python:

from rag_core import answer_plain_llm, answer_rag

# RAG answer (uses your indexed documents)
answer_rag("Your question here?", top_k=3)

# Plain LLM answer (no retrieval)
answer_plain_llm("Your question here?")

Data locations:

Put source files in data/raw_sources/ or in data/audio/ and data/image/ for dedicated modalities. The ingestion step discovers and processes them.

Video Demo

File Structure

Query-Buddy/
├── main.py              # Pipeline orchestrator (ingest → chunk → embed → index)
├── ingest_sources.py    # Central ingestion dispatcher (PDF, text, web, audio, image)
├── load_audio.py        # Audio transcription (e.g. Whisper)
├── load_image.py        # Image OCR / captioning (e.g. Tesseract)
├── chunk_text.py        # Text chunking with modality metadata
├── embed_store.py       # Embedding generation and metadata export
├── vector_store.py      # ChromaDB indexing and retrieval
├── rag_core.py          # RAG retrieval and LLM answering
├── query.py             # Query interface (e.g. filtering, citations)
├── app.py               # Streamlit web UI
├── requirements.txt    # Python dependencies
├── data/
│   ├── raw/             # Processed text output from ingestion
│   ├── raw_sources/     # Source PDFs, text files, etc.
│   ├── audio/           # Audio files
│   ├── image/           # Image files (if used)
│   ├── chunks.json      # Chunks produced by chunk_text.py
│   └── embeddings/      # Embeddings and metadata for vector_store
├── chroma_db/           # ChromaDB persistence (created at runtime)
└── README.md

Contribution

Contributions are welcome. Suggested workflow:

Fork the repository.
Create a branch for your feature or fix.
Make changes and add tests if applicable.
Open a pull request with a short description of the change.

For larger features (e.g. new modalities or loaders), open an issue first to align with the project structure and common schema.

Important Note

Ollama must be installed and running locally for LLM answers. Ensure ollama pull llama3 (or your chosen model) has been run.
First run: Running python main.py without existing data will ingest from configured web URLs and any files in data/raw_sources/, data/audio/, and data/image/. Place your documents there before ingestion if you want to query your own content.
Multimodal loaders: Audio and image ingestion depend on load_audio.py and load_image.py (and their system dependencies). If those modules or dependencies are missing, ingestion skips those modalities and continues with the rest.

License

This project is licensed under the MIT License. See the LICENSE file for the full text.

References

Ollama – Local LLM runtime
ChromaDB – Vector database
Sentence Transformers – Embedding models
LangChain Text Splitters – Document chunking
Streamlit – Web app framework

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Query-Buddy

Tech Stack

Features

Setup Instructions

Usage

Video Demo

File Structure

Contribution

Important Note

License

References

About

Uh oh!

Releases

Packages

Contributors 5

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
.ocr_cache		.ocr_cache
data		data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
app.py		app.py
chunk_text.py		chunk_text.py
embed_store.py		embed_store.py
ingest_sources.py		ingest_sources.py
load_audio.py		load_audio.py
load_image.py		load_image.py
main.py		main.py
query.py		query.py
rag_core.py		rag_core.py
requirements.txt		requirements.txt
test_vector.py		test_vector.py
vector_store.py		vector_store.py

License

okjazim/Query-Buddy

Folders and files

Latest commit

History

Repository files navigation

Query-Buddy

Tech Stack

Features

Setup Instructions

Usage

Video Demo

File Structure

Contribution

Important Note

License

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Uh oh!

Languages

Packages