Taxvamb Benchmarking Pipeline

A Snakemake workflow for benchmarking metagenomic binning tools, developed for the Taxvamb paper.

Overview

This pipeline benchmarks the following binning tools:

Taxvamb (with multiple taxonomy classifiers and databases):

Metabuli + GTDB
MMseqs2 + GTDB / TrEMBL / Kalamari
Centrifuge + NCBI RefSeq
Kraken2 + NCBI RefSeq

Other binners:

Vamb (default)
MetaBAT2
SemiBin2
COMEBin
MetaDecoder

All binners are assessed using CheckM2 and GUNC. Vamb and TaxVamb are evaluated both before and after applying reclustering.

Note: This workflow was originally developed and tested on the ESRUM cluster.

Installation

Prerequisites

Conda/Mamba - For environment management
Singularity/Apptainer - Required for MetaBAT2 (runs as Docker image). See the Singularity installation guide
Pixi - For building Semibin and for Reclustering

Setup

Clone the repository and create the conda environment:

git clone https://github.com/las02/taxvamb_paper_benchmarks
conda env create -n Benchmark_binners --file=taxvamb_paper_benchmarks/envs/benchmark_env.yaml
conda activate Benchmark_binners

Additional Dependencies

Taxconverter must be installed in a separate conda environment named taxconv. See the Taxconverter documentation for installation instructions (link is to the commit used in the pipeline).

Database Requirements

The pipeline requires several databases. Install them and configure their paths in config/config.yaml (Notice: install all databases takes around ~1TB of storage):

Tool	Database Version	Description / Notes
Metabuli	GTDB v214.1 + T2T-CHM13v2.0	Default database: Complete Genome/Chromosome, CheckM completeness > 90 and contamination < 5 + human genome (T2T-CHM13v2.0). Install using `metabuli databases`
MMseqs2	GTDB v220	Install using `mmseqs databases`
MMseqs2	TrEMBL Release 2025_01	Install using `mmseqs databases`
MMseqs2	Kalamari (v3.7)	Install using `mmseqs databases`
Centrifuge	NCBI RefSeq Release 229	See Centrifuge manual under "Database download and index building"
Kraken2	RefSeq (2024-12-28)	Pre-built index from Langmead AWS. Includes archaea, bacteria, viral, plasmid, human, and UniVec_Core
CheckM2	Diamond db	Install using `checkm2 database --download`
GUNC	gunc db	Install using `gunc download_db`

Running the Workflow

Quick Start

# Dry-run (preview what will be executed)
make benchmark_dryrun config=data_configs/example.tsv

# Run locally
make benchmark_run config=data_configs/example.tsv

# Run on SLURM cluster
make benchmark_run_slurm config=data_configs/example.tsv

Input Configuration File

The configuration file is a tab-separated file with three required columns:

Column	Description
`sample`	Sample identifier (same for all BAM files from one sample)
`bamfile`	Path to BAM file from read mapping
`contig`	Path to concatenated contig file

Example (data_configs/example.tsv):

sample              bamfile                        contig
test                test_data/bam/sample_0.bam     test_data/contigs/contigs.fasta
test                test_data/bam/sample_1.bam     test_data/contigs/contigs.fasta

⚠️ Important:

Header names must be exactly: sample, bamfile, and contig

The contig file path must be identical for all rows belonging to the same sample

Contig file: A FASTA file containing all contigs. For multi-sample datasets, concatenate all sample assemblies into a single file.
BAM file(s): Alignment files from mapping short reads to the concatenated contig file.

To generate the BAM file(s) and the contig files from assemblies and reads see next section:

Generating BAM Files from Raw Reads and Assemblie

If you have raw reads and assemblies, use the mapping pipeline to generate BAM files:

# Dry-run
make map_dryrun config=<read_assembly_config>

# Run on SLURM
make map_run_slurm config=<read_assembly_config>

The read/assembly configuration file format is the following:

sample      read1                              read2                              contig
sample_1    path/to/sample_1/read1.fastq.gz    path/to/sample_1/read2.fastq.gz    path/to/sample_1/assembly.fasta
sample_2    path/to/sample_2/read1.fastq.gz    path/to/sample_2/read2.fastq.gz    path/to/sample_2/assembly.fasta

Running Specific Tools

To run only specific tools, invoke Snakemake directly with target output files:

snakemake --snakefile snakefile.smk -c 100 \
    --software-deployment-method apptainer --use-conda \
    --config bam_contig=<config_file> <target_output_files>

For detailed Snakemake options, see the Snakemake documentation.

Available Target Rules

The main target rule (all) in snakefile.smk defines available outputs:

rule all:
    input:
        checkm_semibin = expand(OUTDIR / "{key}/checkm2/semibin", key=sample_id.keys()),
        checkm_comebin = expand(OUTDIR / "{key}/checkm2/comebin", key=sample_id.keys()),
        checkm_metadecoder = expand(OUTDIR / "{key}/checkm2/metadecoder", key=sample_id.keys()),
        checkm_metabat = expand(OUTDIR / "{key}/checkm2/metabat", key=sample_id.keys()),
        checkm_default_vamb = expand(OUTDIR / "{key}/checkm2/default_vamb", key=sample_id.keys()),
        checkm2_taxvamb = expand(OUTDIR / "{key}/tmp/checkm.done", key=sample_id.keys()),
        gunc = expand(OUTDIR / "{key}/tmp/gunc.done", key=sample_id.keys()),

Example: Running SemiBin + CheckM2 Only

For a config file with sample=test, set the target file as:

snakemake ... test/checkm2/semibin

Running GUNC for Specific Tools

Set the target rule to {sample}/tmp/gunc.done
Edit snakemake_modules/gunc.smk and modify the all_bin_dirs variable to include only desired tools:

all_bin_dirs = {
    "comebin": [OUTDIR / "{key}/comebin/comebin_res/comebin_res_bins", ".fa"],
    "semibin": [OUTDIR / "{key}/semibin/bins", ".fa"],
    # Comment out other tools...
}

Running Specific Taxvamb Configurations

Set the target file to {sample}/tmp/checkm.done
Edit snakemake_modules/run_checkm_on_all_taxvamb.smk and modify all_bin_dirs_clas:

all_bin_dirs_clas = {
    "run_taxvamb_gtdb_w_unknown": OUTDIR / "{key}/gtdb_taxvamb_default_w_unknown/vaevae_clusters_split.tsv",
    # Comment out other configurations...
}

Running Reclustering for Specific Binners

Use the all_reclustering target and modify all_bin_dirs_recluster in snakefile.smk:

snakemake --snakefile snakefile.smk -c 100 \
    --config bam_contig=<config_file> all_reclustering

all_bin_dirs_recluster = {
    "default_vamb": OUTDIR / "{key}/vamb_default",
    "run_taxvamb_gtdb_w_unknown": OUTDIR / "{key}/gtdb_taxvamb_default_w_unknown",
    # Comment out other binners...
}

Configuration

Resource Configuration

All resources are configured in config/config.yaml. Per-rule resources override defaults:

# Default resources (used when rule-specific values are not defined)
default_walltime: "48:00:00"
default_threads: 16
default_mem_gb: 50

# Rule-specific resources
spades:
  walltime: "15-00:00:00"
  threads: 16
  mem_gb: 60

Note: If specified resources exceed available hardware, they are automatically scaled down.

GPU Configuration

Taxvamb and Vamb

Set vamb_use_gpu: True in config/config.yaml
Add GPU partition settings to the relevant rules:

run_taxvamb_kraken:
  walltime: "20-00:00:00"
  mem_gb: 500
  threads: 64
  gpu: " --partition=gpuqueue --gres=gpu:1 "

SemiBin

Set semibin_use_gpu: True in config/config.yaml
The semibinGPU rule uses HPC-specific CUDA modules (module load cuda/12.2). You may need to adjust this in snakemake_modules/semibin.smk for your cluster.

Long-Read Datasets

For long-read data:

Use SemiBin with --sequencing-type=long_read flag
Use DBSCAN algorithm instead of k-means for reclustering
Use minimap2 with -ax map-hifi for read mapping

Running Without Predictor

For Taxvamb/Vamb runs without the predictor, add the --no_predictor flag.

Tools which crashed internally, and alternative ways of running them - along with the resourses used for running the tools.

For benchmarking for figure 3 in the paper the following 4 runs crashed internally.

Semibin

Semibin(v.2.1.0) crashes in the Vaginal and the Salvia samples.

For the Vaginal sample Semibin does not find any bins in one of the samples which causes downstream steps to crash (/log_files_for_crashed_runs/Vaginal_SemiBin_v2.1.0.log ). Semibin version 2.2.0 should fix this issue (https://github.com/BigDataBiology/SemiBin/releases/tag/v2.2.0), but does not change performance of the tool (according to patchnotes). We therefore ran semibin version 2.2.0 on this dataset, but we still got the same error (/log_files_for_crashed_runs/Vaginal_SemiBin_v2.2.0.log). This is due to another bug with no bins found using the --write-pre-reclustering-bins flag , removing this flag (as we don't need the pre-reclustering bins for this sample) and upgrading to the newest version of semibin fixes the issue.
For the the Salvia dataset we get the following error described in: BigDataBiology/SemiBin#211 and BigDataBiology/SemiBin#201. There does not seem to be a fix for the issue, although the maintainer seems to be looking into it. See /log_files_for_crashed_runs/Salvia_SemiBin.log for logfiles

Comebin

Comebin (v1.0.3) crashes in the Human Gut (IBS) and Forest Soil samples. In both datasets the error is described in the following Github issue: ziyewang/COMEBin#17, The logfiles for these runs can be found in: /log_files_for_crashed_runs/Human_gut_IBS_ComeBin.log and /log_files_for_crashed_runs/Forest_soil_Comebin.log Here we instead ran comebin in single-sample mode. This is equivalent to in the pipeline in the sample column of the config files, assigning each read pair and their corresponding contig file to a different sample name.

Taxonomy Formatting Scripts

The pipeline includes Python scripts to convert taxonomy classifier outputs to a standardized Taxvamb-compatible format.

format_metabuli.py

Converts Metabuli classification output to Taxvamb format.

Functionality:

Parses the Metabuli report file to build a hierarchical taxonomy structure (taxID → full lineage)
Converts classification file (contig → taxID) to full taxonomy lineage strings
Outputs tab-separated format: contigs\tpredictions

Usage:

python format_metabuli.py <classification_file> <report_file> > formatted_taxonomy.tsv

Pipeline integration: Called in Metabuli-based Taxvamb runs (see snakemake_modules/taxvamb_using_mmseqs_classifications.smk)

format_trembl_kalmari.py

Standardizes MMseqs2 taxonomy output from TrEMBL or Kalamari databases.

Functionality:

Fixes missing taxonomic levels: Ensures all 7 standard ranks (k, p, c, o, f, g, s) are present by filling gaps with placeholder names (e.g., LEVEL_4_ADDED_FROM_g_Escherichia)
Filters non-standard entries: Removes intermediate taxonomy entries prefixed with -_ while preserving domain-level classifications and subspecies annotations
Validates output: Confirms correct number of taxonomy levels

Usage:

cut -f1,5 mmseqs_output.tsv > cut.tsv
echo -e "contigs\tpredictions" > formatted.tsv
python format_trembl_kalmari.py cut.tsv >> formatted.tsv

Pipeline integration: Used in snakemake_modules/taxvamb_using_mmseqs_classifications.smk:

run_taxvamb_kalmari: Processes MMseqs2 Kalamari output (column 5)
run_taxvamb_trembl: Processes MMseqs2 TrEMBL output (column 9)

Why These Scripts Are Needed

Different taxonomy classifiers produce outputs in varying formats with inconsistent hierarchical structures. These scripts standardize outputs to ensure Taxvamb receives consistent, validated taxonomy data regardless of the upstream classifier.

Note: The Taxconverter tool handles Metabuli conversions, but does not support TrEMBL or Kalamari databases. These are included in this pipeline to assess how different annotations affect Taxvamb performance.

Name		Name	Last commit message	Last commit date
Latest commit History 72 Commits
config		config
data_configs		data_configs
envs		envs
files_used_in_snakemake_workflow		files_used_in_snakemake_workflow
log_files_for_crashed_runs		log_files_for_crashed_runs
reclustering		reclustering
semibin_pixi		semibin_pixi
snakemake_modules		snakemake_modules
test_data		test_data
.Makefile.swp		.Makefile.swp
.gitignore		.gitignore
.swp		.swp
Makefile		Makefile
README.md		README.md
map_snakefile.smk		map_snakefile.smk
snakefile.smk		snakefile.smk

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Taxvamb Benchmarking Pipeline

Overview

Installation

Prerequisites

Setup

Additional Dependencies

Database Requirements

Running the Workflow

Quick Start

Input Configuration File

Generating BAM Files from Raw Reads and Assemblie

Running Specific Tools

Available Target Rules

Example: Running SemiBin + CheckM2 Only

Running GUNC for Specific Tools

Running Specific Taxvamb Configurations

Running Reclustering for Specific Binners

Configuration

Resource Configuration

GPU Configuration

Taxvamb and Vamb

SemiBin

Long-Read Datasets

Running Without Predictor

Tools which crashed internally, and alternative ways of running them - along with the resourses used for running the tools.

Semibin

Comebin

Taxonomy Formatting Scripts

format_metabuli.py

format_trembl_kalmari.py

Why These Scripts Are Needed

About

Uh oh!

Releases

Packages

Languages

RasmussenLab/TaxVamb-Benchmarks

Folders and files

Latest commit

History

Repository files navigation

Taxvamb Benchmarking Pipeline

Overview

Installation

Prerequisites

Setup

Additional Dependencies

Database Requirements

Running the Workflow

Quick Start

Input Configuration File

Generating BAM Files from Raw Reads and Assemblie

Running Specific Tools

Available Target Rules

Example: Running SemiBin + CheckM2 Only

Running GUNC for Specific Tools

Running Specific Taxvamb Configurations

Running Reclustering for Specific Binners

Configuration

Resource Configuration

GPU Configuration

Taxvamb and Vamb

SemiBin

Long-Read Datasets

Running Without Predictor

Tools which crashed internally, and alternative ways of running them - along with the resourses used for running the tools.

Semibin

Comebin

Taxonomy Formatting Scripts

format_metabuli.py

format_trembl_kalmari.py

Why These Scripts Are Needed

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages