A dual-path Multi-Modal Retrieval-Augmented Generation system that combines ColPali visual retrieval with pdfplumber structured extraction, using Gemini for multimodal answer generation.
┌───────────────┐
│ PDF Document │
└───────┬───────┘
│
┌───────────────┼───────────────┐
│ │
┌───────▼───────┐ ┌───────▼───────┐
│ Path A │ │ Path B │
│ (Traditional) │ │ (ColPali) │
│ │ │ │
│ pdfplumber │ │ pdf2image │
│ ├─ Text │ │ → page images │
│ └─ Tables │ │ │
│ │ │ │ ColPali │
│ ▼ │ │ (PaliGemma) │
│ Semantic │ │ multi-vector │
│ Chunking │ │ embeddings │
│ (512 chars, │ │ │ │
│ 64 overlap) │ │ ▼ │
│ │ │ │ Qdrant │
│ ▼ │ │ (MaxSim) │
│ Per-page text │ │ native │
│ store │ │ multi-vector │
└───────┬───────┘ └───────┬───────┘
│ │
│ ┌────────────────────────────┘
│ │ Query: ColPali retrieves top pages
│ │
▼ ▼
┌───────────────────────────────────────┐
│ Gemini (multimodal LLM) │
│ │
│ Receives BOTH: │
│ • Page images (visual: charts, layout)│
│ • Extracted text/tables (structured) │
│ │
│ → Answer with [Page X] citations │
└────────────────────────────────────────┘
| Path | Strength | Weakness |
|---|---|---|
| Path A (pdfplumber) | Precise text/table extraction, structured data | Loses layout, misses charts/figures |
| Path B (ColPali) | Understands full visual layout, charts, figures natively | No structured text for the LLM |
| Combined | Best of both — ColPali retrieves the right pages, pdfplumber provides clean text | — |
| Module | Purpose |
|---|---|
src/ingestion/pdf_parser.py |
Dual-path parser: pdfplumber text/tables + pdf2image pages |
src/ingestion/chunker.py |
Semantic chunking with sentence boundaries + table preservation |
src/embeddings/colpali_embedder.py |
ColPali/ColSmol page + query embedding |
src/retrieval/colpali_retriever.py |
Qdrant index with native MaxSim multi-vector search |
src/generation/generator.py |
Gemini multimodal generation (images + structured text) |
src/evaluation/evaluator.py |
Benchmark suite with keyword recall + ROUGE scoring |
src/pipeline.py |
Dual-path pipeline orchestrator |
- Python 3.10+
- Poppler:
brew install poppler(macOS) orapt install poppler-utils(Ubuntu) - Docker (for Qdrant)
- GPU recommended (falls back to ColSmol-256M on CPU)
pip install -r requirements.txtdocker run -d -p 6333:6333 qdrant/qdrantcp .env.example .env
# Edit .env and add your Google API key (for Gemini)streamlit run app.py# Ingest a document (both paths run automatically)
python cli.py ingest documents/report.pdf
# Ask a question
python cli.py query "What is the GDP growth rate?"
# Interactive chat
python cli.py chat
# Run evaluation
python cli.py evaluate --output eval_report.json| Choice | Rationale |
|---|---|
| ColPali for retrieval | Embeds pages visually — handles text, tables, charts without extraction |
| MaxSim late interaction | ColBERT-style token-level matching via Qdrant native multi-vector support |
| pdfplumber for text | Provides structured text/tables to Gemini alongside page images |
| Semantic chunking | Sentence-boundary splitting with overlap; tables preserved as single chunks |
| Gemini for generation | Multimodal LLM that takes both images and text in a single prompt |
| ColSmol fallback | Lightweight CPU-compatible model when no GPU is available |
| Keyword + ROUGE eval | Keyword recall for heuristic scoring, ROUGE when reference answers exist |
The evaluation suite benchmarks across:
- Text comprehension (easy/medium)
- Table data extraction (easy/medium)
- Image/chart interpretation (medium)
- Cross-modal reasoning (hard)
Metrics:
- Overall score (heuristic: answer presence + citations + keyword recall)
- ROUGE-1/2/L (when reference answers provided)
- Keyword recall rate
- Citation rate
- Extracted text usage rate
- Per-query latency