Skip to content

Snakemake workflow for evaluating MAG binning tools with CheckM2 and GUNC quality assessment for the TaxVamb paper

Notifications You must be signed in to change notification settings

RasmussenLab/TaxVamb-Benchmarks

Repository files navigation

Taxvamb Benchmarking Pipeline

A Snakemake workflow for benchmarking metagenomic binning tools, developed for the Taxvamb paper.

Overview

This pipeline benchmarks the following binning tools:

Taxvamb (with multiple taxonomy classifiers and databases):

  • Metabuli + GTDB
  • MMseqs2 + GTDB / TrEMBL / Kalamari
  • Centrifuge + NCBI RefSeq
  • Kraken2 + NCBI RefSeq

Other binners:

  • Vamb (default)
  • MetaBAT2
  • SemiBin2
  • COMEBin
  • MetaDecoder

All binners are assessed using CheckM2 and GUNC. Vamb and TaxVamb are evaluated both before and after applying reclustering.

Note: This workflow was originally developed and tested on the ESRUM cluster.

Installation

Prerequisites

  1. Conda/Mamba - For environment management
  2. Singularity/Apptainer - Required for MetaBAT2 (runs as Docker image). See the Singularity installation guide
  3. Pixi - For building Semibin and for Reclustering

Setup

Clone the repository and create the conda environment:

git clone https://github.com/las02/taxvamb_paper_benchmarks
conda env create -n Benchmark_binners --file=taxvamb_paper_benchmarks/envs/benchmark_env.yaml
conda activate Benchmark_binners

Additional Dependencies

Taxconverter must be installed in a separate conda environment named taxconv. See the Taxconverter documentation for installation instructions (link is to the commit used in the pipeline).

Database Requirements

The pipeline requires several databases. Install them and configure their paths in config/config.yaml (Notice: install all databases takes around ~1TB of storage):

Tool Database Version Description / Notes
Metabuli GTDB v214.1 + T2T-CHM13v2.0 Default database: Complete Genome/Chromosome, CheckM completeness > 90 and contamination < 5 + human genome (T2T-CHM13v2.0). Install using metabuli databases
MMseqs2 GTDB v220 Install using mmseqs databases
MMseqs2 TrEMBL Release 2025_01 Install using mmseqs databases
MMseqs2 Kalamari (v3.7) Install using mmseqs databases
Centrifuge NCBI RefSeq Release 229 See Centrifuge manual under "Database download and index building"
Kraken2 RefSeq (2024-12-28) Pre-built index from Langmead AWS. Includes archaea, bacteria, viral, plasmid, human, and UniVec_Core
CheckM2 Diamond db Install using checkm2 database --download
GUNC gunc db Install using gunc download_db

Running the Workflow

Quick Start

# Dry-run (preview what will be executed)
make benchmark_dryrun config=data_configs/example.tsv

# Run locally
make benchmark_run config=data_configs/example.tsv

# Run on SLURM cluster
make benchmark_run_slurm config=data_configs/example.tsv

Input Configuration File

The configuration file is a tab-separated file with three required columns:

Column Description
sample Sample identifier (same for all BAM files from one sample)
bamfile Path to BAM file from read mapping
contig Path to concatenated contig file

Example (data_configs/example.tsv):

sample              bamfile                        contig
test                test_data/bam/sample_0.bam     test_data/contigs/contigs.fasta
test                test_data/bam/sample_1.bam     test_data/contigs/contigs.fasta

⚠️ Important:

  • Header names must be exactly: sample, bamfile, and contig
  • The contig file path must be identical for all rows belonging to the same sample
  1. Contig file: A FASTA file containing all contigs. For multi-sample datasets, concatenate all sample assemblies into a single file.
  2. BAM file(s): Alignment files from mapping short reads to the concatenated contig file.

To generate the BAM file(s) and the contig files from assemblies and reads see next section:

Generating BAM Files from Raw Reads and Assemblie

If you have raw reads and assemblies, use the mapping pipeline to generate BAM files:

# Dry-run
make map_dryrun config=<read_assembly_config>

# Run on SLURM
make map_run_slurm config=<read_assembly_config>

The read/assembly configuration file format is the following:

sample      read1                              read2                              contig
sample_1    path/to/sample_1/read1.fastq.gz    path/to/sample_1/read2.fastq.gz    path/to/sample_1/assembly.fasta
sample_2    path/to/sample_2/read1.fastq.gz    path/to/sample_2/read2.fastq.gz    path/to/sample_2/assembly.fasta

Running Specific Tools

To run only specific tools, invoke Snakemake directly with target output files:

snakemake --snakefile snakefile.smk -c 100 \
    --software-deployment-method apptainer --use-conda \
    --config bam_contig=<config_file> <target_output_files>

For detailed Snakemake options, see the Snakemake documentation.

Available Target Rules

The main target rule (all) in snakefile.smk defines available outputs:

rule all:
    input:
        checkm_semibin = expand(OUTDIR / "{key}/checkm2/semibin", key=sample_id.keys()),
        checkm_comebin = expand(OUTDIR / "{key}/checkm2/comebin", key=sample_id.keys()),
        checkm_metadecoder = expand(OUTDIR / "{key}/checkm2/metadecoder", key=sample_id.keys()),
        checkm_metabat = expand(OUTDIR / "{key}/checkm2/metabat", key=sample_id.keys()),
        checkm_default_vamb = expand(OUTDIR / "{key}/checkm2/default_vamb", key=sample_id.keys()),
        checkm2_taxvamb = expand(OUTDIR / "{key}/tmp/checkm.done", key=sample_id.keys()),
        gunc = expand(OUTDIR / "{key}/tmp/gunc.done", key=sample_id.keys()),

Example: Running SemiBin + CheckM2 Only

For a config file with sample=test, set the target file as:

snakemake ... test/checkm2/semibin

Running GUNC for Specific Tools

  1. Set the target rule to {sample}/tmp/gunc.done
  2. Edit snakemake_modules/gunc.smk and modify the all_bin_dirs variable to include only desired tools:
all_bin_dirs = {
    "comebin": [OUTDIR / "{key}/comebin/comebin_res/comebin_res_bins", ".fa"],
    "semibin": [OUTDIR / "{key}/semibin/bins", ".fa"],
    # Comment out other tools...
}

Running Specific Taxvamb Configurations

  1. Set the target file to {sample}/tmp/checkm.done
  2. Edit snakemake_modules/run_checkm_on_all_taxvamb.smk and modify all_bin_dirs_clas:
all_bin_dirs_clas = {
    "run_taxvamb_gtdb_w_unknown": OUTDIR / "{key}/gtdb_taxvamb_default_w_unknown/vaevae_clusters_split.tsv",
    # Comment out other configurations...
}

Running Reclustering for Specific Binners

Use the all_reclustering target and modify all_bin_dirs_recluster in snakefile.smk:

snakemake --snakefile snakefile.smk -c 100 \
    --config bam_contig=<config_file> all_reclustering
all_bin_dirs_recluster = {
    "default_vamb": OUTDIR / "{key}/vamb_default",
    "run_taxvamb_gtdb_w_unknown": OUTDIR / "{key}/gtdb_taxvamb_default_w_unknown",
    # Comment out other binners...
}

Configuration

Resource Configuration

All resources are configured in config/config.yaml. Per-rule resources override defaults:

# Default resources (used when rule-specific values are not defined)
default_walltime: "48:00:00"
default_threads: 16
default_mem_gb: 50

# Rule-specific resources
spades:
  walltime: "15-00:00:00"
  threads: 16
  mem_gb: 60

Note: If specified resources exceed available hardware, they are automatically scaled down.

GPU Configuration

Taxvamb and Vamb

  1. Set vamb_use_gpu: True in config/config.yaml
  2. Add GPU partition settings to the relevant rules:
run_taxvamb_kraken:
  walltime: "20-00:00:00"
  mem_gb: 500
  threads: 64
  gpu: " --partition=gpuqueue --gres=gpu:1 "

SemiBin

  1. Set semibin_use_gpu: True in config/config.yaml
  2. The semibinGPU rule uses HPC-specific CUDA modules (module load cuda/12.2). You may need to adjust this in snakemake_modules/semibin.smk for your cluster.

Long-Read Datasets

For long-read data:

  • Use SemiBin with --sequencing-type=long_read flag
  • Use DBSCAN algorithm instead of k-means for reclustering
  • Use minimap2 with -ax map-hifi for read mapping

Running Without Predictor

For Taxvamb/Vamb runs without the predictor, add the --no_predictor flag.


Tools which crashed internally, and alternative ways of running them - along with the resourses used for running the tools.

For benchmarking for figure 3 in the paper the following 4 runs crashed internally.

Semibin

Semibin(v.2.1.0) crashes in the Vaginal and the Salvia samples.

  • For the Vaginal sample Semibin does not find any bins in one of the samples which causes downstream steps to crash (/log_files_for_crashed_runs/Vaginal_SemiBin_v2.1.0.log ). Semibin version 2.2.0 should fix this issue (https://github.com/BigDataBiology/SemiBin/releases/tag/v2.2.0), but does not change performance of the tool (according to patchnotes). We therefore ran semibin version 2.2.0 on this dataset, but we still got the same error (/log_files_for_crashed_runs/Vaginal_SemiBin_v2.2.0.log). This is due to another bug with no bins found using the --write-pre-reclustering-bins flag , removing this flag (as we don't need the pre-reclustering bins for this sample) and upgrading to the newest version of semibin fixes the issue.
  • For the the Salvia dataset we get the following error described in: BigDataBiology/SemiBin#211 and BigDataBiology/SemiBin#201. There does not seem to be a fix for the issue, although the maintainer seems to be looking into it. See /log_files_for_crashed_runs/Salvia_SemiBin.log for logfiles

Comebin

Comebin (v1.0.3) crashes in the Human Gut (IBS) and Forest Soil samples. In both datasets the error is described in the following Github issue: ziyewang/COMEBin#17, The logfiles for these runs can be found in: /log_files_for_crashed_runs/Human_gut_IBS_ComeBin.log and /log_files_for_crashed_runs/Forest_soil_Comebin.log Here we instead ran comebin in single-sample mode. This is equivalent to in the pipeline in the sample column of the config files, assigning each read pair and their corresponding contig file to a different sample name.


Taxonomy Formatting Scripts

The pipeline includes Python scripts to convert taxonomy classifier outputs to a standardized Taxvamb-compatible format.

format_metabuli.py

Converts Metabuli classification output to Taxvamb format.

Functionality:

  • Parses the Metabuli report file to build a hierarchical taxonomy structure (taxID → full lineage)
  • Converts classification file (contig → taxID) to full taxonomy lineage strings
  • Outputs tab-separated format: contigs\tpredictions

Usage:

python format_metabuli.py <classification_file> <report_file> > formatted_taxonomy.tsv

Pipeline integration: Called in Metabuli-based Taxvamb runs (see snakemake_modules/taxvamb_using_mmseqs_classifications.smk)

format_trembl_kalmari.py

Standardizes MMseqs2 taxonomy output from TrEMBL or Kalamari databases.

Functionality:

  • Fixes missing taxonomic levels: Ensures all 7 standard ranks (k, p, c, o, f, g, s) are present by filling gaps with placeholder names (e.g., LEVEL_4_ADDED_FROM_g_Escherichia)
  • Filters non-standard entries: Removes intermediate taxonomy entries prefixed with -_ while preserving domain-level classifications and subspecies annotations
  • Validates output: Confirms correct number of taxonomy levels

Usage:

cut -f1,5 mmseqs_output.tsv > cut.tsv
echo -e "contigs\tpredictions" > formatted.tsv
python format_trembl_kalmari.py cut.tsv >> formatted.tsv

Pipeline integration: Used in snakemake_modules/taxvamb_using_mmseqs_classifications.smk:

  • run_taxvamb_kalmari: Processes MMseqs2 Kalamari output (column 5)
  • run_taxvamb_trembl: Processes MMseqs2 TrEMBL output (column 9)

Why These Scripts Are Needed

Different taxonomy classifiers produce outputs in varying formats with inconsistent hierarchical structures. These scripts standardize outputs to ensure Taxvamb receives consistent, validated taxonomy data regardless of the upstream classifier.

Note: The Taxconverter tool handles Metabuli conversions, but does not support TrEMBL or Kalamari databases. These are included in this pipeline to assess how different annotations affect Taxvamb performance.

About

Snakemake workflow for evaluating MAG binning tools with CheckM2 and GUNC quality assessment for the TaxVamb paper

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published