Skip to content

sanjaysgk/ipg

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

289 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

sanjaysgk/ipg

GitHub Actions CI Status nf-test Nextflow pixi run with singularity run with docker License: MIT

Immunopeptidogenomics — cryptic peptide database construction from RNA-seq. An nf-core-style port of the 31-step bash pipeline used by the Li Lab and Purcell Lab at Monash University, implementing the cryptic peptide discovery method described in Scull et al. Mol Cell Proteomics 2021.

What this pipeline does

sanjaysgk/ipg turns paired-end RNA-seq reads into a cryptic peptide search database suitable for downstream MS/MS immunopeptidomics searches. Starting from FASTQs, it:

  1. Aligns reads with STAR (two-pass) and infers strandedness with RSeQC.
  2. Assembles novel transcripts with StringTie and reconciles them with the reference annotation via gffcompare (-R -V -C for the consensus combined GTF).
  3. Builds a variant-calling-ready BAM via the GATK4 RNA-seq Best Practices chain (FastqToSam → MergeBamAlignment → MarkDuplicates → SplitNCigarReads).
  4. Recalibrates base qualities with two passes of BaseRecalibrator + ApplyBQSR.
  5. Calls somatic variants with Mutect2 in tumour-only mode against a gnomAD-style germline allele-frequency database, then CalculateContamination / FilterMutectCalls / SelectVariants for clean PASS-only sites-only VCFs.
  6. Builds the cryptic peptide FASTA database via the IPG custom C tools (curate_vcf, revert_headers, alt_liftover, triple_translate, squish) plus gff3sort and gffread from bioconda.

Optionally, a post-MS analysis step (--step post_ms) runs the two-phase db_compare + origins workflow on PEAKS (or other search engine) results to identify and annotate cryptic-only peptides.

An MS search step (--step ms_search) runs up to four open-source search engines (MSFragger, Comet, Sage, PEAKS) in parallel against the cryptic FASTA, applies mokapot FDR control per engine, rescores PSMs with MS2Rescore, and merges results at 1% peptide-level FDR. Optional immunoinformatics gates — --run_netmhcpan, --run_netmhciipan, --run_gibbscluster, --run_flashlfq, --run_blastp_host — pick up the integrated peptide table and emit a per-sample HTML report summarising HLA binding, motif clusters, quantification, and host-background hits. See docs/usage.md for the full invocation.

The 31 legacy steps span three pipeline modes (--step db_construct | ms_search | post_ms) selected per run; the metro map below colours each mode as a separate transit line.

ipg pipeline overview

Diagram source: docs/images/ipg_overview.mmd — re-render with pixi run nf-metro render docs/images/ipg_overview.mmd -o docs/images/ipg_overview.svg.

Quick start

1. Install the dev environment

The repository ships a pixi project that pins every tool (nextflow 25.10.4, nf-core 3.5.2, nf-test 0.9.5, GATK4, STAR, samtools, bcftools, stringtie, gffcompare, gffread, RSeQC, FastQC, MultiQC, OpenJDK 17, plus the linting toolchain) to bit-for-bit reproducible versions via pixi.lock.

git clone https://github.com/sanjaysgk/ipg.git
cd ipg
pixi install

If you don't have pixi: curl -fsSL https://pixi.sh/install.sh | bash.

2. Build the chr22 test bundle (~5 minutes, one time)

pixi run bash scripts/build_test_bundle.sh

By default this writes to /fs04/scratch2/xy86/sanjay/ipg-test-data/. Override on a non-Monash machine:

TEST_BUNDLE_DIR=/some/path \
REFERENCE_DIR=/path/to/GRCh38 \
VARIANT_DIR=/path/to/variant-resources \
FASTQ_R1=/path/to/sample_R1.fq.gz \
FASTQ_R2=/path/to/sample_R2.fq.gz \
pixi run bash scripts/build_test_bundle.sh

The bundle is never committed to git — only the build script is. See scripts/build_test_bundle.sh for the exact recipe.

3. Run the pipeline against the test bundle

Profile combination Container engine Use when
-profile test,pixi none — tools from local PATH Fastest iteration on a workstation/HPC compute node where pixi can run all tools natively
-profile test,singularity apptainer / singularity Standard HPC; pulls biocontainers from community.wave.seqera.io
-profile test,docker docker Laptops, cloud, GitHub Actions CI
-profile test,monash,singularity singularity Monash M3 SLURM cluster (comp partition, xy86 account)
# Fastest (no containers, all tools from pixi):
pixi run nextflow run . -profile test,pixi --outdir results

The pipeline takes ~2 minutes end-to-end against the chr22 test bundle on a single Linux box and produces the cryptic peptide database at results/squish/<sample>_cryptic.fasta.

4. Run on real data

Create a samplesheet samplesheet.csv:

sample,fastq_1,fastq_2,strandedness
D100_liver,/path/to/D100-liver_R1.fastq.gz,/path/to/D100-liver_R2.fastq.gz,reverse
D101_pancreas,/path/to/D101-pancreas_R1.fastq.gz,/path/to/D101-pancreas_R2.fastq.gz,reverse

Then run with explicit reference paths:

pixi run nextflow run . \
    -profile monash,singularity \
    --input            samplesheet.csv \
    --outdir           /fs04/scratch2/xy86/sanjay/ipg-results \
    --fasta            /path/to/GRCh38.primary_assembly.genome.fa \
    --fasta_fai        /path/to/GRCh38.primary_assembly.genome.fa.fai \
    --fasta_dict       /path/to/GRCh38.primary_assembly.genome.dict \
    --gtf              /path/to/gencode.v44.primary_assembly.annotation.gtf \
    --star_index       /path/to/human_genome_index_GRCh38 \
    --rseqc_bed        /path/to/gencode_assembly.bed \
    --dbsnp            /path/to/dbsnp138.vcf \
    --dbsnp_tbi        /path/to/dbsnp138.vcf.idx \
    --known_indels     /path/to/Homo_sapiens_assembly38.known_indels.vcf.gz \
    --known_indels_tbi /path/to/Homo_sapiens_assembly38.known_indels.vcf.gz.tbi \
    --mills            /path/to/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz \
    --mills_tbi        /path/to/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz.tbi \
    --germline_resource     /path/to/small_exac_common_3.hg38.vcf.gz \
    --germline_resource_tbi /path/to/small_exac_common_3.hg38.vcf.gz.tbi

The --include_variant_peptides flag

By default the cryptic peptide DB is built from the reference assembly only, matching the legacy Scull et al. 2021 / D122_Lung run (verified empirically by inspecting the legacy squish.log). Pass --include_variant_peptides true to also fold the alt-reference variant-derived peptide branches (unmasked + indel) into the final database:

pixi run nextflow run . -profile monash,singularity \
    --input samplesheet.csv \
    --include_variant_peptides true \
    [other reference args]

[!IMPORTANT] > --include_variant_peptides does not switch the variant caller to matched tumour-normal mode. Variant calling is always performed in tumour-only Mutect2 mode against a gnomAD-style germline allele-frequency database; this pipeline does not support matched tumour-normal calling. The flag controls only whether the discovered variants get folded into the final cryptic peptide DB. Set true only when the sample is expected to harbour biologically meaningful somatic variants (e.g. tumour tissue, hypermutated cell lines, MMR-deficient samples). Leave at false for normal tissue, cell lines, or any sample where variant peptides would mostly add noise. This is a pipeline-level flag — to mix modes for a heterogeneous cohort, run the pipeline twice.

Post-MS analysis (--step post_ms)

After running the DB construction pipeline, the cryptic peptide FASTA is searched against MS/MS data using PEAKS Online (or another search engine such as MSFragger, Comet, or Sage). The resulting PSM CSVs are then analysed with the two-phase db_compare + origins workflow (Scull et al. 2021):

Phase 1: db_compare_v2.R  →  cryptic_only.txt
         origins -s        →  origins_discard.txt + origins_unconventional.txt

Phase 2: db_compare_v2.R (with -j discard -u unconventional)
         →  unambiguous_unconventional.txt
         origins (full Ensembl mode)  →  deep origin annotation

Post-MS samplesheet

Create a CSV with one row per sample:

sample,cryptic_psm_csv,uniprot_psm_csv,cryptic_decoy_score,uniprot_decoy_score
D122_liver,/path/to/D122_Liver_Cryptic_DB.db.psms.csv,/path/to/D122_Liver_Uniprot_DB.db.psms.csv,44.43,36.48
D101_heart,/path/to/D101_Heart_Cryptic_DB.db.psms.csv,/path/to/D101_Heart_Uniprot_DB.db.psms.csv,3.64,3.25

The cryptic_decoy_score and uniprot_decoy_score columns are the -10lgP decoy thresholds from the respective PEAKS searches.

Running post-MS analysis

pixi run nextflow run . -profile pixi,monash \
    --step post_ms \
    --post_ms_input       post_ms_samplesheet.csv \
    --uniprot_fasta       /path/to/uniprotkb_human_canonical_isoform.fasta \
    --transcriptome_fasta /path/to/D122_liver_transcriptome.fa \
    --prefix_tracking     /path/to/prefix.tracking \
    --outdir              results_post_ms

The --transcriptome_fasta and --prefix_tracking files are outputs from the DB construction step (gffread and gffcompare respectively). You can find them in the pipeline output directory from the previous run.

Post-MS output

results_post_ms/
├── post_ms/
│   ├── phase1/
│   │   ├── db_compare/     Phase 1 cryptic-only peptide lists + plots
│   │   └── origins/        Phase 1 origins (simple mode) — discard + unconventional lists
│   └── phase2/
│       ├── db_compare/     Phase 2 refined unambiguous unconventional peptides
│       └── origins/        Phase 2 origins (full Ensembl annotation)
└── pipeline_info/

Output

results/
├── star/                    STAR alignment BAMs + .Log.final.out
├── samtools/                sorted/indexed BAMs
├── rseqc/                   strandedness inference reports
├── stringtie/               assembled transcript GTFs
├── gffcompare/              prefix.combined.gtf, prefix.tracking, .stats
├── gatk4/                   Mutect2 / BQSR / contamination tables
├── revert/                  alt-reference FASTAs (only when --include_variant_peptides=true)
├── gff3sort/                sorted assembly GTF
├── tt/                      triple_translate per-branch peptide FASTAs
├── squish/
│   └── <sample>_cryptic.fasta   ← THE DELIVERABLE
├── multiqc/
│   └── multiqc_report.html      aggregated QC report
└── pipeline_info/
    ├── execution_report_<timestamp>.html
    ├── execution_timeline_<timestamp>.html
    └── pipeline_dag_<timestamp>.html

Profiles

Profile Purpose
test Use the chr22 test bundle (built by scripts/build_test_bundle.sh)
pixi Run every process from the local pixi env, no containers
singularity / apptainer Pull biocontainers via singularity/apptainer (HPC default)
docker Pull biocontainers via docker (laptop / cloud / CI default)
monash SLURM executor on the Monash M3 comp partition under the xy86 project, with shared singularity cache

Architecture

sanjaysgk/ipg/
├── main.nf                              entry point
├── workflows/ipg.nf                     main workflow — chains the 6 subworkflows
├── subworkflows/local/
│   ├── align_qc/                        steps 1-3
│   ├── transcript_assembly/             steps 4-5
│   ├── bam_prep/                        steps 6-12
│   ├── bqsr/                            steps 13-16
│   ├── mutect_calling/                  steps 17-23
│   ├── db_construct/                    steps 24-31 (branches on --include_variant_peptides)
│   └── post_ms_analysis/               --step post_ms: 2-phase db_compare + origins
├── modules/
│   ├── nf-core/                         23 upstream nf-core modules (STAR, samtools, GATK4, etc.)
│   └── local/                           10 local modules:
│       ├── curate_vcf/                  IPG custom C tool (kescull)
│       ├── revert_headers/              IPG custom C tool (kescull)
│       ├── alt_liftover/                IPG custom C tool (kescull)
│       ├── triple_translate/            IPG custom C tool (kescull)
│       ├── squish/                      IPG custom C tool (kescull)
│       ├── origins/                     IPG custom C tool — peptide origin annotation (kescull)
│       ├── db_compare/                  R script — cryptic vs UniProt PSM comparison (kescull)
│       ├── gff3sort/                    bioconda gff3sort wrapper
│       ├── gatk4_validatesamfile/       missing-from-upstream wrapper
│       └── gatk4_fastaalternatereferencemaker/  missing-from-upstream wrapper
├── containers/
│   └── ipg-tools/                       Reproducible Docker build of the kescull C tools
│                                        from a pinned commit SHA → ghcr.io/sanjaysgk/ipg-tools
├── conf/
│   ├── base.config
│   ├── modules.config                   per-process ext.args / ext.prefix / errorStrategy
│   ├── test.config                      chr22 test bundle paths
│   └── monash.config                    Monash M3 SLURM
├── bin/                                 runtime scripts staged onto PATH by modules
├── scripts/
│   └── build_test_bundle.sh             reproducible chr22 test bundle builder (dev/CI helper)
├── tests/                               nf-test pipeline-level tests
├── pixi.toml                            dev env definition (committed)
└── pixi.lock                            dev env lockfile (committed, bit-reproducible)

Authors

  • Sanjay SG Krishna (@sanjaysgk) — pipeline port, Li Lab, Monash University
  • Kate Scull — original IPG method + custom C tools, Purcell Lab, Monash University (see kescull/immunopeptidogenomics)
  • Chen Li — supervision, Li Lab, Monash University
  • Anthony W. Purcell — supervision, Purcell Lab, Monash University

Citation

If you use sanjaysgk/ipg in your research, please cite the original method paper:

Scull KE, Pandey K, Ramarathinam SH, Purcell AW. > Immunopeptidogenomics: harnessing RNA-seq to illuminate the dark immunopeptidome. Mol Cell Proteomics. 2021;20:100143. doi: 10.1016/j.mcpro.2021.100143

A full reference list for every tool in the pipeline lives in CITATIONS.md.

This pipeline is built on top of Nextflow and the nf-core framework:

The nf-core framework for community-curated bioinformatics pipelines. Ewels PA, Peltzer A, Fillinger S, Patel H, Alneberg J, Wilm A, Garcia MU, Di Tommaso P, Nahnsen S. Nat Biotechnol. 2020 Mar;38(3):276-278. doi: 10.1038/s41587-020-0439-x.

License

MIT — see LICENSE.