PurcellLab · sanjaysgk · May 26, 2026 · May 26, 2026
diff --git a/docs/llm.md b/docs/llm.md
@@ -0,0 +1,178 @@
+# LLM guide — complete operating instructions
+
+> This page is a single, self-contained reference written for an LLM assistant to
+> help an end user run **every** MHC-TP use case correctly. It restates the
+> install, inputs, commands, flags, recipes, outputs, and the matching method so
+> no other page is required.
+
+## What MHC-TP does
+
+`mhc-tp` is a command-line tool. Given a **GibbsCluster** output folder, it
+correlates each peptide cluster's position-specific scoring matrix (PSSM) against
+a reference library of HLA/MHC **class I + II** binding motifs (human and mouse)
+and reports, per cluster, the best-matching allotype(s). It writes a ranked CSV
+and a standalone interactive HTML report.
+
+- Package / import name: `mhc-tp` / `mhc_tp`. Console command: `mhc-tp`.
+- Python 3.9–3.11. No internet needed at search time (only once, to fetch the reference).
+
+## Install
+
+```bash
+pip install mhc-tp
+```
+
+Editable from source (for development):
+
+```bash
+git clone https://github.com/PurcellLab/MHC-TP.git
+cd MHC-TP
+pip install -e .
+```
+
+## Reference data (required, fetched once)
+
+Reference motif parquets are downloaded from the GitHub release, not bundled.
+
+```bash
+mhc-tp fetch -s human     # human  |  mouse  |  all
+```
+
+`fetch` options: `-s/--species {human,mouse,all}` (default `all`), `-d/--dest DIR`
+(override the data dir; otherwise a per-user data dir, also settable via
+`$MHC_TP_DATA_DIR`). After fetching, `mhc-tp search` finds the parquet
+automatically — you only pass `-r/--reference` to use a custom file.
+
+## Input: the GibbsCluster folder
+
+The positional `gibbs_folder` argument is a GibbsCluster run directory. It **must
+contain a `matrices/` subdirectory** with the per-cluster matrix files
+(`gibbs.<g>of<N>.mat`). If a `images/gibbs.KLDvsClusters.tab` file is present, the
+report also shows each cluster's KLD (information content). GibbsCluster's own
+logo PNGs, if present, are reused in the report.
+
+## Command: `mhc-tp search`
+
+```text
+mhc-tp search [-h] [-r REFERENCE] [-s SPECIES] [-c {I,II,all}]
+              [-t THRESHOLD] [--topNHits TOPNHITS] [--always-top-n]
+              [-o OUTPUT] [--no-html] [-l]
+              [--log-level {debug,info,warning,error,critical}]
+              [--threads THREADS]
+              gibbs_folder
+```
+
+| flag | meaning | default |
+|------|---------|---------|
+| `gibbs_folder` | GibbsCluster run dir (must have `matrices/`) | required |
+| `-s, --species` | `human` or `mouse` | `human` |
+| `-c, --class` | restrict reference to MHC class `I`, `II`, or `all` | `all` |
+| `-r, --reference` | path to a `<species>.parquet` (else the fetched one) | auto |
+| `-t, --threshold` | minimum Pearson correlation (PCC) to report | `0.70` |
+| `--topNHits` | allotype matches to keep per cluster | `3` |
+| `--always-top-n` | keep each cluster's top-N even below threshold (flagged "below cutoff") | off |
+| `-o, --output` | output directory | `output` |
+| `--no-html` | write only the CSV, skip the HTML report | off |
+| `-l, --log` | also save the coloured session log to the output dir | off |
+| `--log-level` | `debug`/`info`/`warning`/`error`/`critical` | `info` |
+| `--threads` | max CPU threads (also `$MHC_TP_THREADS`) | `4` |
+
+## Use-case recipes (copy-paste)
+
+```bash
+# 1. Basic human search (class I + II reference)
+mhc-tp search runs/sampleA -s human -o results/
+
+# 2. Mouse sample
+mhc-tp fetch -s mouse
+mhc-tp search runs/mouseA -s mouse -o results_mouse/
+
+# 3. Restrict to MHC class I only (faster, class-I immunopeptidome)
+mhc-tp search runs/sampleA -s human -c I -o results/
+
+# 4. Class II only
+mhc-tp search runs/sampleA -s human -c II -o results/
+
+# 5. Keep the top 5 matches per cluster instead of 3
+mhc-tp search runs/sampleA -s human --topNHits 5 -o results/
+
+# 6. Guarantee a top-3 for EVERY cluster even if below threshold
+#    (weak matches are tagged "below cutoff" in the report)
+mhc-tp search runs/sampleA -s human --always-top-n -o results/
+
+# 7. Stricter / looser confidence cutoff
+mhc-tp search runs/sampleA -s human -t 0.80 -o results/
+
+# 8. CSV only (no HTML report)
+mhc-tp search runs/sampleA -s human --no-html -o results/
+
+# 9. Use a custom / local reference parquet
+mhc-tp search runs/sampleA -s human -r path/to/human.parquet -o results/
+
+# 10. Limit CPU threads and save a log
+mhc-tp search runs/sampleA -s human --threads 8 -l -o results/
+```
+
+> [!IMPORTANT]
+> **Threshold vs top-N selection order:** per cluster, all allotypes are ranked by
+> PCC, then the threshold is applied, then the top-N are taken. So by default a
+> cluster can return fewer than `--topNHits` rows (or none). With `--always-top-n`,
+> the top-N are taken regardless of threshold and the threshold only annotates
+> confidence — every cluster keeps its best N.
+
+## Outputs
+
+Written to `<output>/clust_result/`:
+
+| file | description |
+|------|-------------|
+| `correlations.csv` | columns: `cluster` (e.g. `gibbs.2of5`), `hla` (canonical display name, e.g. `HLA-B39:124`), `formatted` (raw join key, e.g. `HLAB39124`), `correlation` (PCC) |
+| `mhc-tp-result.html` | standalone interactive report (open in any browser) |
+
+The report contains: a motif-comparison carousel (per cluster solution, choose via
+dropdown), a paginated + searchable Top-N table, and a correlation analysis section
+(heatmap + force-directed network, with a PCC threshold slider that filters the
+view only). Rows/motifs below the search threshold (only present with
+`--always-top-n`) are tagged **below cutoff**.
+
+## How matching works (method)
+
+Each motif is a PSSM (`n_positions × 20`). For a cluster motif `g` and reference
+`r`, the score is the **Pearson correlation** over the cells `V` that are
+informative in the cluster motif (`g_k ≠ 0` and not NaN):
+
+```text
+PCC(g, r) = Σ_{k∈V}(g_k − ḡ)(r_k − r̄) / ( |V| · σ_g · σ_r )    ∈ [−1, 1]
+```
+
+`1.0` = identical motif shape. It is scale/offset invariant (scores the pattern of
+position preferences, not absolute magnitudes). Allele display names are returned
+verbatim from the reference (matching the embedded logo titles), e.g. `HLA-A25:08`.
+
+## Other commands
+
+```bash
+# Export embedded reference logos to PNGs
+mhc-tp export-logos -r human.parquet -o logos/ -a "HLA-B39:124,A0201"   # -a optional; default all
+```
+
+Developer-only (end users never need these): `mhc-tp build-db` and
+`mhc-tp build-ref` rebuild the reference parquets from NetMHCpan / NetMHCIIpan
+packs; embedding Seq2Logo logos (`--with-logos`) needs a separate Python 2.7 env.
+
+## Troubleshooting
+
+- **"no class II allotypes" / 0 matches with `-c II`**: the sample is likely class I
+  (short 8–11mers); use `-c I` or `-c all`.
+- **No matches**: lower `-t` (e.g. `-t 0.5`) or use `--always-top-n` to force the
+  best N per cluster.
+- **Reference not found**: run `mhc-tp fetch -s <species>` first, or pass
+  `-r path/to/<species>.parquet`.
+- **`gibbs_folder` rejected**: ensure it contains a `matrices/` subdirectory.
+
+## Citation
+
+Munday PR, Krishna SSG, Fehring J, Croft NP, Purcell AW, Li C, Braun A.
+*Immunolyser 2.0: An advanced computational pipeline for comprehensive analysis of
+immunopeptidomic data.* Comput Struct Biotechnol J. 2025;29:296–304.
+doi:10.1016/j.csbj.2025.10.007. PMID: 41209766.
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -33,6 +33,7 @@ nav:
   - Home: index.md
   - Usage: usage.md
   - API reference: api.md
+  - LLM guide: llm.md
   - Example report: example-report.html
 
 plugins: