Knowledge Elicitation with Large Language Models for Interpretable Cancer Stage Identification from Pathology Reports
It covers how to generate the dataset, build the RAG context from the AJCC 7th manual, and run four methods (ZSCOT, RAG, KEwRAG, KEwLTM).
├── per_cancer_type/ # generated by the notebook (23 CSVs + 2 summary files)
├── rag/
│ ├── context/ # created by rag/create_context.py (23 JSONs)
│ ├── log/ # logs from context creation
│ ├── ajcc_7thed_cancer_staging_manual.pdf
│ └── create_context.py
├── create_dataset.ipynb # builds per-cancer CSVs
├── evaluate_results.ipynb # not required to run
├── kew_methods.py # implementations of ZSCOT, RAG, KEwRAG, KEwLTM
├── run_experiments_hardcoded.py # batch driver
├── utils.py
└── README.md
CUDA_VISIBLE_DEVICES="2,3,4,5" \
vllm serve mistralai/Mixtral-8x7B-Instruct-v0.1 \
--download-dir $MODEL_PATH \
--tensor-parallel-size 4 \
--disable-custom-all-reduce \
--guided-decoding-backend xgrammarOpen and run create_dataset.ipynb. It writes:
per_cancer_type/<TCGA>_T14N03.csvfor 23 cancer types,per_cancer_type/ALL_T14N03_by_cancer_type.csv,per_cancer_type/SUMMARY_counts_per_type.csv.
Run once to create the JSON contexts per cancer type.
# from repo root
python rag/create_context.pyWhat it does:
-
Loads
rag/ajcc_7thed_cancer_staging_manual.pdfwith PyMuPDF and merges page text. -
Encodes chunks with
nvidia/NV-Embed-v2(device map = auto) and stores them in a persistent Chroma DB under/secure/shared_data/rag_embedding_model/chroma_db. -
For each of the 23 TCGA codes, forms two instruction‑tuned queries (“rules that help predict T…” and “…N…”) and retrieves top‑k passages.
-
Writes
rag/context/context_<TCGA>.jsonwith two keys:rag_raw_t14: joined passages for T,rag_raw_n03: joined passages for N.
All methods use the same OpenAI‑compatible client (OpenAI(api_key="dummy_key", base_url="http://localhost:8000/v1")). Prompts are cancer‑type specific and task‑specific (T or N). Labels are: {T1,T2,T3,T4} and {N0,N1,N2,N3}.
-
ZSCOT: Zero‑shot chain‑of‑thought on the report only. Adds columns:
zscot_reasoning,zscot_stage,zscot_raw_llm(on failures). -
RAG: Injects the retrieved manual excerpt (
context_<TCGA>.json) into the prompt and predicts. Adds:rag_reasoning,rag_stage,rag_raw_llm. -
KEwRAG: First induces rules from the RAG excerpt (no reports), then applies those rules to each report. Adds:
kewrag_rules(list),kewrag_rules_raw_llm(if the rule induction failed),kewrag_reasoning,kewrag_stage,kewrag_raw_llm. -
KEwLTM: Induces a long‑term memory (a list of rules) from a small training subset of reports, optionally gating updates with fuzzy similarity (threshold default 80). Then uses the memory to infer stages for the remaining reports. Adds per‑row:
is_train(True for the firsttrain_sizerows used for induction),kewltm_memory_snapshot(list at that point),kewltm_memory_snapshot_str,kewltm_reasoning,kewltm_stage,kewltm_raw_llm/kewltm_train_raw_llmas needed, pluskewltm_memory(the final memory repeated for convenience).
Evaluation: for each method, if all predictions are present, macro P/R/F1 over 4 classes are logged. KEwLTM reports metrics only on the test split (non‑training rows).
The driver script loops over every *_T14N03.csv under per_cancer_type/, runs any subset of methods for both tasks present (T and/or N), and writes one CSV and one log file per run. It also writes a manifest.
Run it as:
python run_experiments_hardcoded.pyWhat it does:
- Discovers all
*_T14N03.csvfiles. - For each file, infers the TCGA code and maps it to a printable cancer name for prompts.
- Checks for
T14andN03columns to decide which task(s) to run. - For RAG/KEwRAG, it requires
rag/context/context_<TCGA>.json. If the JSON is missing, it skips those methods for that cancer and prints a warning. - For KEwLTM, chooses
train_sizedynamically by fraction of labeled rows for that task (after filtering out rows with missing labels). You can override withFORCE_TRAIN_SIZE. - Writes one results CSV and one log per (method, task, TCGA). File names include the model id, seed, and a timestamp.
- Writes a manifest CSV in
OUT_DIRwith, for each run: TCGA, printable name, dataset path, task, method, context JSON path, model id, seed, train fraction, computed train size, edit threshold, results CSV, and log path.
- Per‑run CSV: the original dataset columns plus method‑specific columns described in §4.
- Logs: metrics such as macro P/R/F1 are printed at the end of each method’s run.
- Manifest: one row per executed run with file paths, parameters, and sizes. Useful when you want to audit or regroup results later.