Multi-Modal Document Intelligence (RAG-Based QA System)

A dual-path Multi-Modal Retrieval-Augmented Generation system that combines ColPali visual retrieval with pdfplumber structured extraction, using Gemini for multimodal answer generation.

Architecture

                         ┌───────────────┐
                         │  PDF Document  │
                         └───────┬───────┘
                                 │
                 ┌───────────────┼───────────────┐
                 │                               │
         ┌───────▼───────┐               ┌───────▼───────┐
         │   Path A       │               │   Path B       │
         │  (Traditional) │               │   (ColPali)    │
         │                │               │                │
         │  pdfplumber    │               │  pdf2image     │
         │  ├─ Text       │               │  → page images │
         │  └─ Tables     │               │                │
         │       │        │               │  ColPali       │
         │       ▼        │               │  (PaliGemma)   │
         │  Semantic      │               │  multi-vector  │
         │  Chunking      │               │  embeddings    │
         │  (512 chars,   │               │       │        │
         │   64 overlap)  │               │       ▼        │
         │       │        │               │  Qdrant        │
         │       ▼        │               │  (MaxSim)      │
         │  Per-page text │               │  native        │
         │  store         │               │  multi-vector  │
         └───────┬───────┘               └───────┬───────┘
                 │                               │
                 │  ┌────────────────────────────┘
                 │  │  Query: ColPali retrieves top pages
                 │  │
                 ▼  ▼
         ┌───────────────────────────────────────┐
         │  Gemini (multimodal LLM)               │
         │                                        │
         │  Receives BOTH:                        │
         │  • Page images (visual: charts, layout)│
         │  • Extracted text/tables (structured)  │
         │                                        │
         │  → Answer with [Page X] citations      │
         └────────────────────────────────────────┘

Why dual-path?

Path	Strength	Weakness
Path A (pdfplumber)	Precise text/table extraction, structured data	Loses layout, misses charts/figures
Path B (ColPali)	Understands full visual layout, charts, figures natively	No structured text for the LLM
Combined	Best of both — ColPali retrieves the right pages, pdfplumber provides clean text	—

Modules

Module	Purpose
`src/ingestion/pdf_parser.py`	Dual-path parser: pdfplumber text/tables + pdf2image pages
`src/ingestion/chunker.py`	Semantic chunking with sentence boundaries + table preservation
`src/embeddings/colpali_embedder.py`	ColPali/ColSmol page + query embedding
`src/retrieval/colpali_retriever.py`	Qdrant index with native MaxSim multi-vector search
`src/generation/generator.py`	Gemini multimodal generation (images + structured text)
`src/evaluation/evaluator.py`	Benchmark suite with keyword recall + ROUGE scoring
`src/pipeline.py`	Dual-path pipeline orchestrator

Setup

Prerequisites

Python 3.10+
Poppler: brew install poppler (macOS) or apt install poppler-utils (Ubuntu)
Docker (for Qdrant)
GPU recommended (falls back to ColSmol-256M on CPU)

Installation

pip install -r requirements.txt

Start Qdrant

docker run -d -p 6333:6333 qdrant/qdrant

Configuration

cp .env.example .env
# Edit .env and add your Google API key (for Gemini)

Usage

Streamlit Web App

streamlit run app.py

CLI

# Ingest a document (both paths run automatically)
python cli.py ingest documents/report.pdf

# Ask a question
python cli.py query "What is the GDP growth rate?"

# Interactive chat
python cli.py chat

# Run evaluation
python cli.py evaluate --output eval_report.json

Design Choices

Choice	Rationale
ColPali for retrieval	Embeds pages visually — handles text, tables, charts without extraction
MaxSim late interaction	ColBERT-style token-level matching via Qdrant native multi-vector support
pdfplumber for text	Provides structured text/tables to Gemini alongside page images
Semantic chunking	Sentence-boundary splitting with overlap; tables preserved as single chunks
Gemini for generation	Multimodal LLM that takes both images and text in a single prompt
ColSmol fallback	Lightweight CPU-compatible model when no GPU is available
Keyword + ROUGE eval	Keyword recall for heuristic scoring, ROUGE when reference answers exist

Evaluation

The evaluation suite benchmarks across:

Text comprehension (easy/medium)
Table data extraction (easy/medium)
Image/chart interpretation (medium)
Cross-modal reasoning (hard)

Metrics:

Overall score (heuristic: answer presence + citations + keyword recall)
ROUGE-1/2/L (when reference answers provided)
Keyword recall rate
Citation rate
Extracted text usage rate
Per-query latency

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
src		src
.gitignore		.gitignore
README.md		README.md
Technical_Report.md		Technical_Report.md
app.py		app.py
cli.py		cli.py
config.py		config.py
evaluation_report.json		evaluation_report.json
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multi-Modal Document Intelligence (RAG-Based QA System)

Architecture

Why dual-path?

Modules

Setup

Prerequisites

Installation

Start Qdrant

Configuration

Usage

Streamlit Web App

CLI

Design Choices

Evaluation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Multi-Modal Document Intelligence (RAG-Based QA System)

Architecture

Why dual-path?

Modules

Setup

Prerequisites

Installation

Start Qdrant

Configuration

Usage

Streamlit Web App

CLI

Design Choices

Evaluation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages