🤖 Multi-Document RAG Search Engine

A hybrid RAG (Retrieval-Augmented Generation) chatbot that answers questions from your documents and live web search - built with LangChain, FAISS, and Streamlit.

👤 Author

Name: Vimal solanki
Email: vimal162002@email.com

🎥 Demo & Explanation

🖥️ Live Demo: multi-document-rag-ai-chatbot.streamlit.app
📹 Video Explanation: Watch here

📌 Project Overview

Organizations usually store knowledge across multiple unstructured documents like PDFs, reports, and notes.
However, these documents are static and do not contain real-time information.

This creates a gap where users need both internal knowledge and up-to-date data.

To solve this problem, I built a hybrid RAG chatbot that combines document retrieval with live web search.

This system can:

📄 Search across multiple documents at the same time
🌐 Fetch real-time information from the web
🔀 Hybrid mode: checks docs first, then web if needed
📌 Clearly distinguish between document-based and real-time answers

This project solves that by combining:

📄 Multi-document semantic search (your private files)
🌐 Live web search via Tavily (real-time facts)
🔀 Hybrid mode: checks docs first, then web if needed

⚙️ Tech Stack

Component	Tool
Language	Python
LLM	LLaMA 3.3 70B via Groq
Embeddings	sentence-transformers/all-MiniLM-L6-v2 (FREE, local)
Vector DB	FAISS
Web Search	Tavily
Orchestration	LangChain
UI	Streamlit

🏗️ Project Structure

multi-document-rag-search-engine/
│
├── config/
│   └── settings.py          # loads all .env variables
│
├── core/
│   ├── ingestion.py         # loads and chunks PDF/TXT files
│   ├── embedding.py         # HuggingFace embedding model
│   ├── vector_store.py      # FAISS index management
│   └── chain.py             # RAG pipeline (retrieve → context → answer)
│
├── tools/
│   └── tavily_search.py     # Tavily web search integration
│
├── ui/
│   ├── chat.py              # main controller connecting UI and backend
│   └── components.py        # all Streamlit UI components
│
├── data/documents/          # sample documents for testing
├── main.py                  # app entry point
├── .env                     # API keys and config (not committed)
└── requirements.txt

🔄 How It Works — Architecture

🧠 Component Roles

main.py  ──────────────────▶  Streamlit UI (entry point)
    │
    ▼
chat.py  ──────────────────▶  Controller (connects UI ↔ backend)
    │
    ├──▶  ingestion.py    ──▶  Load & chunk documents
    │
    ├──▶  embedding.py    ──▶  Convert text to vectors
    │
    ├──▶  vector_store.py ──▶  Store & search in FAISS
    │
    ├──▶  chain.py        ──▶  RAG pipeline (retrieve → LLM → answer)
    │
    └──▶  tavily_search.py──▶  Live web search

🚀 Getting Started

1. Clone the repository

git clone https://github.com/yourusername/multi-document-rag-search-engine.git
cd multi-document-rag-search-engine

2. Install dependencies

pip install -r requirements.txt

3. Set up environment variables

cp .env.example .env
# Add your API keys in .env

EMBEDDING_MODEL=sentence-transformers/all-MiniLM-L6-v2
CHUNK_SIZE=1000
CHUNK_OVERLAP=150
TOP_K_RESULTS=6

GPT_MODEL_NAME=llama-3.3-70b-versatile
GROQ_API_KEY=your_groq_api_key
TEMPERATURE=0

TAVILY_API_KEY=your_tavily_api_key
TOP_K_WEB_RESULTS=3

4. Run the app

streamlit run main.py

💡 How to Use

Upload one or more PDF or TXT files
Click Process Documents
Select retrieval mode — Documents, Web, or Hybrid
Ask your question and get answers with citations

📊 Example Queries

Query	Best Mode
"Explain attention mechanism"	📄 Documents
"Latest news about GPT-5"	🌐 Web
"How does RAG compare to current LLM tools?"	🔀 Hybrid

🔑 API Keys Required

Service	Free Tier	Link
Groq	✅ Yes	console.groq.com
Tavily	✅ Yes	tavily.com
HuggingFace Embeddings	✅ 100% Free	Runs locally

🎯 Key Learnings

Built a hybrid RAG system with document + web retrieval
Used FAISS for fast semantic vector search
Integrated Tavily for real-time web results
Implemented citation-aware answer generation
Deployed a full AI app with Streamlit

📄 License

This project is for educational and portfolio purposes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🤖 Multi-Document RAG Search Engine

👤 Author

🎥 Demo & Explanation

📌 Project Overview

⚙️ Tech Stack

🏗️ Project Structure

🔄 How It Works — Architecture

🧠 Component Roles

🚀 Getting Started

💡 How to Use

📊 Example Queries

🔑 API Keys Required

🎯 Key Learnings

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
config		config
core		core
data/documents		data/documents
tools		tools
ui		ui
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

🤖 Multi-Document RAG Search Engine

👤 Author

🎥 Demo & Explanation

📌 Project Overview

⚙️ Tech Stack

🏗️ Project Structure

🔄 How It Works — Architecture

🧠 Component Roles

🚀 Getting Started

💡 How to Use

📊 Example Queries

🔑 API Keys Required

🎯 Key Learnings

📄 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages