Knowledge Elicitation with Large Language Models for Interpretable Cancer Stage Identification from Pathology Reports

It covers how to generate the dataset, build the RAG context from the AJCC 7th manual, and run four methods (ZSCOT, RAG, KEwRAG, KEwLTM).

0) Repository layout

├── per_cancer_type/                     # generated by the notebook (23 CSVs + 2 summary files)
├── rag/
│   ├── context/                         # created by rag/create_context.py (23 JSONs)
│   ├── log/                             # logs from context creation
│   ├── ajcc_7thed_cancer_staging_manual.pdf
│   └── create_context.py
├── create_dataset.ipynb                 # builds per-cancer CSVs
├── evaluate_results.ipynb               # not required to run
├── kew_methods.py                       # implementations of ZSCOT, RAG, KEwRAG, KEwLTM
├── run_experiments_hardcoded.py         # batch driver
├── utils.py
└── README.md

1) Start the LLM server (vLLM)

CUDA_VISIBLE_DEVICES="2,3,4,5" \
vllm serve mistralai/Mixtral-8x7B-Instruct-v0.1 \
  --download-dir $MODEL_PATH \
  --tensor-parallel-size 4 \
  --disable-custom-all-reduce \
  --guided-decoding-backend xgrammar

2) Build the dataset (`create_dataset.ipynb`)

Open and run create_dataset.ipynb. It writes:

per_cancer_type/<TCGA>_T14N03.csv for 23 cancer types,
per_cancer_type/ALL_T14N03_by_cancer_type.csv,
per_cancer_type/SUMMARY_counts_per_type.csv.

3) Build RAG contexts from the AJCC manual

Run once to create the JSON contexts per cancer type.

# from repo root
python rag/create_context.py

What it does:

Loads rag/ajcc_7thed_cancer_staging_manual.pdf with PyMuPDF and merges page text.
Encodes chunks with nvidia/NV-Embed-v2 (device map = auto) and stores them in a persistent Chroma DB under /secure/shared_data/rag_embedding_model/chroma_db.
For each of the 23 TCGA codes, forms two instruction‑tuned queries (“rules that help predict T…” and “…N…”) and retrieves top‑k passages.
Writes rag/context/context_<TCGA>.json with two keys:
- rag_raw_t14: joined passages for T,
- rag_raw_n03: joined passages for N.

4) Methods implemented in `kew_methods.py` (what gets run)

All methods use the same OpenAI‑compatible client (OpenAI(api_key="dummy_key", base_url="http://localhost:8000/v1")). Prompts are cancer‑type specific and task‑specific (T or N). Labels are: {T1,T2,T3,T4} and {N0,N1,N2,N3}.

ZSCOT: Zero‑shot chain‑of‑thought on the report only. Adds columns: zscot_reasoning, zscot_stage, zscot_raw_llm (on failures).
RAG: Injects the retrieved manual excerpt (context_<TCGA>.json) into the prompt and predicts. Adds: rag_reasoning, rag_stage, rag_raw_llm.
KEwRAG: First induces rules from the RAG excerpt (no reports), then applies those rules to each report. Adds: kewrag_rules (list), kewrag_rules_raw_llm (if the rule induction failed), kewrag_reasoning, kewrag_stage, kewrag_raw_llm.
KEwLTM: Induces a long‑term memory (a list of rules) from a small training subset of reports, optionally gating updates with fuzzy similarity (threshold default 80). Then uses the memory to infer stages for the remaining reports. Adds per‑row: is_train (True for the first train_size rows used for induction), kewltm_memory_snapshot (list at that point), kewltm_memory_snapshot_str, kewltm_reasoning, kewltm_stage, kewltm_raw_llm/kewltm_train_raw_llm as needed, plus kewltm_memory (the final memory repeated for convenience).

Evaluation: for each method, if all predictions are present, macro P/R/F1 over 4 classes are logged. KEwLTM reports metrics only on the test split (non‑training rows).

5) Batch runs across all cancers (`run_experiments_hardcoded.py`)

The driver script loops over every *_T14N03.csv under per_cancer_type/, runs any subset of methods for both tasks present (T and/or N), and writes one CSV and one log file per run. It also writes a manifest.

Run it as:

python run_experiments_hardcoded.py

What it does:

Discovers all *_T14N03.csv files.
For each file, infers the TCGA code and maps it to a printable cancer name for prompts.
Checks for T14 and N03 columns to decide which task(s) to run.
For RAG/KEwRAG, it requires rag/context/context_<TCGA>.json. If the JSON is missing, it skips those methods for that cancer and prints a warning.
For KEwLTM, chooses train_size dynamically by fraction of labeled rows for that task (after filtering out rows with missing labels). You can override with FORCE_TRAIN_SIZE.
Writes one results CSV and one log per (method, task, TCGA). File names include the model id, seed, and a timestamp.
Writes a manifest CSV in OUT_DIR with, for each run: TCGA, printable name, dataset path, task, method, context JSON path, model id, seed, train fraction, computed train size, edit threshold, results CSV, and log path.

6) Outputs to expect

Per‑run CSV: the original dataset columns plus method‑specific columns described in §4.
Logs: metrics such as macro P/R/F1 are printed at the end of each method’s run.
Manifest: one row per executed run with file paths, parameters, and sizes. Useful when you want to audit or regroup results later.

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
__pycache__		__pycache__
per_cancer_type		per_cancer_type
rag		rag
runs		runs
runs2		runs2
runs3		runs3
runs_med		runs_med
.gitignore		.gitignore
README.md		README.md
create_dataset.ipynb		create_dataset.ipynb
evaluate_results.ipynb		evaluate_results.ipynb
kew_methods.py		kew_methods.py
nvembed_requirements.txt		nvembed_requirements.txt
run_experiments_hardcoded.py		run_experiments_hardcoded.py
utils.py		utils.py
vllm_env_requirements.txt		vllm_env_requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Knowledge Elicitation with Large Language Models for Interpretable Cancer Stage Identification from Pathology Reports

0) Repository layout

1) Start the LLM server (vLLM)

2) Build the dataset (`create_dataset.ipynb`)

3) Build RAG contexts from the AJCC manual

4) Methods implemented in `kew_methods.py` (what gets run)

5) Batch runs across all cancers (`run_experiments_hardcoded.py`)

6) Outputs to expect

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

CYang-CCI-Lab/selfCorrectionAgent

Folders and files

Latest commit

History

Repository files navigation

Knowledge Elicitation with Large Language Models for Interpretable Cancer Stage Identification from Pathology Reports

0) Repository layout

1) Start the LLM server (vLLM)

2) Build the dataset (create_dataset.ipynb)

3) Build RAG contexts from the AJCC manual

4) Methods implemented in kew_methods.py (what gets run)

5) Batch runs across all cancers (run_experiments_hardcoded.py)

6) Outputs to expect

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

2) Build the dataset (`create_dataset.ipynb`)

4) Methods implemented in `kew_methods.py` (what gets run)

5) Batch runs across all cancers (`run_experiments_hardcoded.py`)

Packages