Skip to content

*Purpose:* implementation of long-term memory prompting approach in identifying T and N stages from TCGA Pathology Reports

Notifications You must be signed in to change notification settings

CYang-CCI-Lab/selfCorrectionAgent

Repository files navigation

Knowledge Elicitation with Large Language Models for Interpretable Cancer Stage Identification from Pathology Reports

It covers how to generate the dataset, build the RAG context from the AJCC 7th manual, and run four methods (ZSCOT, RAG, KEwRAG, KEwLTM).

0) Repository layout

├── per_cancer_type/                     # generated by the notebook (23 CSVs + 2 summary files)
├── rag/
│   ├── context/                         # created by rag/create_context.py (23 JSONs)
│   ├── log/                             # logs from context creation
│   ├── ajcc_7thed_cancer_staging_manual.pdf
│   └── create_context.py
├── create_dataset.ipynb                 # builds per-cancer CSVs
├── evaluate_results.ipynb               # not required to run
├── kew_methods.py                       # implementations of ZSCOT, RAG, KEwRAG, KEwLTM
├── run_experiments_hardcoded.py         # batch driver
├── utils.py
└── README.md

1) Start the LLM server (vLLM)

CUDA_VISIBLE_DEVICES="2,3,4,5" \
vllm serve mistralai/Mixtral-8x7B-Instruct-v0.1 \
  --download-dir $MODEL_PATH \
  --tensor-parallel-size 4 \
  --disable-custom-all-reduce \
  --guided-decoding-backend xgrammar

2) Build the dataset (create_dataset.ipynb)

Open and run create_dataset.ipynb. It writes:

  • per_cancer_type/<TCGA>_T14N03.csv for 23 cancer types,
  • per_cancer_type/ALL_T14N03_by_cancer_type.csv,
  • per_cancer_type/SUMMARY_counts_per_type.csv.

3) Build RAG contexts from the AJCC manual

Run once to create the JSON contexts per cancer type.

# from repo root
python rag/create_context.py

What it does:

  • Loads rag/ajcc_7thed_cancer_staging_manual.pdf with PyMuPDF and merges page text.

  • Encodes chunks with nvidia/NV-Embed-v2 (device map = auto) and stores them in a persistent Chroma DB under /secure/shared_data/rag_embedding_model/chroma_db.

  • For each of the 23 TCGA codes, forms two instruction‑tuned queries (“rules that help predict T…” and “…N…”) and retrieves top‑k passages.

  • Writes rag/context/context_<TCGA>.json with two keys:

    • rag_raw_t14: joined passages for T,
    • rag_raw_n03: joined passages for N.

4) Methods implemented in kew_methods.py (what gets run)

All methods use the same OpenAI‑compatible client (OpenAI(api_key="dummy_key", base_url="http://localhost:8000/v1")). Prompts are cancer‑type specific and task‑specific (T or N). Labels are: {T1,T2,T3,T4} and {N0,N1,N2,N3}.

  • ZSCOT: Zero‑shot chain‑of‑thought on the report only. Adds columns: zscot_reasoning, zscot_stage, zscot_raw_llm (on failures).

  • RAG: Injects the retrieved manual excerpt (context_<TCGA>.json) into the prompt and predicts. Adds: rag_reasoning, rag_stage, rag_raw_llm.

  • KEwRAG: First induces rules from the RAG excerpt (no reports), then applies those rules to each report. Adds: kewrag_rules (list), kewrag_rules_raw_llm (if the rule induction failed), kewrag_reasoning, kewrag_stage, kewrag_raw_llm.

  • KEwLTM: Induces a long‑term memory (a list of rules) from a small training subset of reports, optionally gating updates with fuzzy similarity (threshold default 80). Then uses the memory to infer stages for the remaining reports. Adds per‑row: is_train (True for the first train_size rows used for induction), kewltm_memory_snapshot (list at that point), kewltm_memory_snapshot_str, kewltm_reasoning, kewltm_stage, kewltm_raw_llm/kewltm_train_raw_llm as needed, plus kewltm_memory (the final memory repeated for convenience).

Evaluation: for each method, if all predictions are present, macro P/R/F1 over 4 classes are logged. KEwLTM reports metrics only on the test split (non‑training rows).

5) Batch runs across all cancers (run_experiments_hardcoded.py)

The driver script loops over every *_T14N03.csv under per_cancer_type/, runs any subset of methods for both tasks present (T and/or N), and writes one CSV and one log file per run. It also writes a manifest.

Run it as:

python run_experiments_hardcoded.py

What it does:

  • Discovers all *_T14N03.csv files.
  • For each file, infers the TCGA code and maps it to a printable cancer name for prompts.
  • Checks for T14 and N03 columns to decide which task(s) to run.
  • For RAG/KEwRAG, it requires rag/context/context_<TCGA>.json. If the JSON is missing, it skips those methods for that cancer and prints a warning.
  • For KEwLTM, chooses train_size dynamically by fraction of labeled rows for that task (after filtering out rows with missing labels). You can override with FORCE_TRAIN_SIZE.
  • Writes one results CSV and one log per (method, task, TCGA). File names include the model id, seed, and a timestamp.
  • Writes a manifest CSV in OUT_DIR with, for each run: TCGA, printable name, dataset path, task, method, context JSON path, model id, seed, train fraction, computed train size, edit threshold, results CSV, and log path.

6) Outputs to expect

  • Per‑run CSV: the original dataset columns plus method‑specific columns described in §4.
  • Logs: metrics such as macro P/R/F1 are printed at the end of each method’s run.
  • Manifest: one row per executed run with file paths, parameters, and sizes. Useful when you want to audit or regroup results later.

About

*Purpose:* implementation of long-term memory prompting approach in identifying T and N stages from TCGA Pathology Reports

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •