DSAI 413 — Assignment 2. A dual-mode system over the MIMIC-CXR dataset:
- Report Generation — chest X-ray image → radiology report
- QA Mode — natural-language question → RAG-grounded answer over reports
Mandatory models: google/medgemma-1.5-4b-it and ColPali (vidore/colpali-v1.3), each compared against a lighter baseline (OpenCLIP ViT-B/32 for report retrieval; sentence-transformers MiniLM for text RAG).
The codebase is developed locally and executed on Google Colab (T4, 16 GB VRAM) — a 4 GB laptop GPU cannot host MedGemma-1.5-4B + ColPali. Heavy scripts (indexing, evaluation) are runnable as python -m src.eval and from notebooks/run_on_colab.ipynb.
Design priorities (in order):
- Simplicity over sophistication — zero-shot, no fine-tuning.
- Small subsets — 400 image/report pairs (300 indexed, 100 test).
- Lazy loading — models load on first call.
- One config file —
config.yamlowns every knob.
# Local (dev only — no inference)
git clone <this-repo>.git chest-xray-system
cd chest-xray-system
pip install -r requirements.txt
# Required: Hugging Face token (for MedGemma + ColPali weights)
cp .env.example .env
# then edit .env and set HF_TOKEN=hf_...Note. Local install is for editing only. All inference runs on Colab — see the Colab Quickstart below.
Open notebooks/run_on_colab.ipynb in Google Colab on a T4 runtime. The notebook will:
git clonethis repopip install -r requirements.txt- Prompt for
HF_TOKENviagetpass - Run
python data/download.pyto pull the MIMIC-CXR subset (400 pairs) - Launch
python app/app.pywith Gradioshare=True→ a public URL
End-to-end on a fresh T4 should take < 15 minutes to a working share link.
- Source.
simhadrisadaram/mimic-cxr-datasetviakagglehub. - Subset. 400 (image, report) pairs sampled with
seed=42, split 300 index / 100 test. - Manifest.
data/sample/manifest.csvwith columnsid, image_path, report. - QA dataset.
data/sample/qa_dataset.json— 3 clinical QA pairs per report, generated by MedGemma (yes/no, open-ended, location/laterality).
All data lives under data/sample/ (gitignored) and is regenerated on Colab.
# Smoke test: 10 test images, MedGemma vs CLIP retrieval baseline
python -m src.report_mode# Build the ColPali and text indexes (one-time, persisted to data/sample/)
python -m src.qa_mode --build-index
# Smoke test
python -m src.qa_modepython -m src.eval
# Writes results/comparison.json + results/comparison.mdpython app/app.py
# Opens local UI; on Colab, share=True from config.yaml prints a public URL.| Model | ROUGE-L F1 | BERTScore F1 | Latency mean (s) |
|---|---|---|---|
| MedGemma 1.5-4B | 0.347 | 0.890 | 10.90 |
| OpenCLIP retrieval | 0.283 | 0.887 | 0.33 |
| Retriever | Recall@3 | Judge accuracy | correct | partial | wrong | unparseable+err | latency mean (s) |
|---|---|---|---|---|---|---|---|
| ColPali v1.3 | 0.000 | 0.400 | 6 | 3 | 5 | 1 | 70.17 |
| MiniLM-L6 text | 0.133 | 0.467 | 7 | 1 | 7 | 0 | 11.01 |
Headlines: MedGemma beats CLIP retrieval by +23% on ROUGE-L for report generation. MiniLM-L6 narrowly beats ColPali v1.3 on Recall@3, strict-correct judge accuracy, and latency (6.4× faster). See report/REPORT.md for methodology, qualitative observations, and limitations.
Filled in Phase 7. Highlights expected to include:
- Zero-shot only; no domain fine-tuning.
- 400-pair subset is not representative of full MIMIC-CXR distribution.
- LLM-as-judge introduces evaluation bias (same model family as the system under test).
- Rendered-report ColPali pages are synthetic, not native PDFs.
chest-xray-system/
├── README.md
├── requirements.txt
├── config.yaml
├── .env.example
├── data/
│ ├── download.py
│ ├── build_qa_dataset.py
│ └── sample/ # populated at runtime (gitignored)
├── src/
│ ├── models/
│ │ ├── medgemma.py
│ │ ├── colpali_retriever.py
│ │ ├── clip_retriever.py
│ │ └── text_retriever.py
│ ├── report_mode.py
│ ├── qa_mode.py
│ └── eval.py
├── app/
│ └── app.py
├── notebooks/
│ └── run_on_colab.ipynb
├── results/
└── report/
└── REPORT.md