A Snakemake workflow for benchmarking metagenomic binning tools, developed for the Taxvamb paper.
This pipeline benchmarks the following binning tools:
Taxvamb (with multiple taxonomy classifiers and databases):
- Metabuli + GTDB
- MMseqs2 + GTDB / TrEMBL / Kalamari
- Centrifuge + NCBI RefSeq
- Kraken2 + NCBI RefSeq
Other binners:
- Vamb (default)
- MetaBAT2
- SemiBin2
- COMEBin
- MetaDecoder
All binners are assessed using CheckM2 and GUNC. Vamb and TaxVamb are evaluated both before and after applying reclustering.
Note: This workflow was originally developed and tested on the ESRUM cluster.
- Conda/Mamba - For environment management
- Singularity/Apptainer - Required for MetaBAT2 (runs as Docker image). See the Singularity installation guide
- Pixi - For building Semibin and for Reclustering
Clone the repository and create the conda environment:
git clone https://github.com/las02/taxvamb_paper_benchmarks
conda env create -n Benchmark_binners --file=taxvamb_paper_benchmarks/envs/benchmark_env.yaml
conda activate Benchmark_binnersTaxconverter must be installed in a separate conda environment named taxconv. See the Taxconverter documentation for installation instructions (link is to the commit used in the pipeline).
The pipeline requires several databases. Install them and configure their paths in config/config.yaml (Notice: install all databases takes around ~1TB of storage):
| Tool | Database Version | Description / Notes |
|---|---|---|
| Metabuli | GTDB v214.1 + T2T-CHM13v2.0 | Default database: Complete Genome/Chromosome, CheckM completeness > 90 and contamination < 5 + human genome (T2T-CHM13v2.0). Install using metabuli databases |
| MMseqs2 | GTDB v220 | Install using mmseqs databases |
| MMseqs2 | TrEMBL Release 2025_01 | Install using mmseqs databases |
| MMseqs2 | Kalamari (v3.7) | Install using mmseqs databases |
| Centrifuge | NCBI RefSeq Release 229 | See Centrifuge manual under "Database download and index building" |
| Kraken2 | RefSeq (2024-12-28) | Pre-built index from Langmead AWS. Includes archaea, bacteria, viral, plasmid, human, and UniVec_Core |
| CheckM2 | Diamond db | Install using checkm2 database --download |
| GUNC | gunc db | Install using gunc download_db |
# Dry-run (preview what will be executed)
make benchmark_dryrun config=data_configs/example.tsv
# Run locally
make benchmark_run config=data_configs/example.tsv
# Run on SLURM cluster
make benchmark_run_slurm config=data_configs/example.tsvThe configuration file is a tab-separated file with three required columns:
| Column | Description |
|---|---|
sample |
Sample identifier (same for all BAM files from one sample) |
bamfile |
Path to BAM file from read mapping |
contig |
Path to concatenated contig file |
Example (data_configs/example.tsv):
sample bamfile contig
test test_data/bam/sample_0.bam test_data/contigs/contigs.fasta
test test_data/bam/sample_1.bam test_data/contigs/contigs.fasta
⚠️ Important:
- Header names must be exactly:
sample,bamfile, andcontig- The contig file path must be identical for all rows belonging to the same sample
- Contig file: A FASTA file containing all contigs. For multi-sample datasets, concatenate all sample assemblies into a single file.
- BAM file(s): Alignment files from mapping short reads to the concatenated contig file.
To generate the BAM file(s) and the contig files from assemblies and reads see next section:
If you have raw reads and assemblies, use the mapping pipeline to generate BAM files:
# Dry-run
make map_dryrun config=<read_assembly_config>
# Run on SLURM
make map_run_slurm config=<read_assembly_config>The read/assembly configuration file format is the following:
sample read1 read2 contig
sample_1 path/to/sample_1/read1.fastq.gz path/to/sample_1/read2.fastq.gz path/to/sample_1/assembly.fasta
sample_2 path/to/sample_2/read1.fastq.gz path/to/sample_2/read2.fastq.gz path/to/sample_2/assembly.fastaTo run only specific tools, invoke Snakemake directly with target output files:
snakemake --snakefile snakefile.smk -c 100 \
--software-deployment-method apptainer --use-conda \
--config bam_contig=<config_file> <target_output_files>For detailed Snakemake options, see the Snakemake documentation.
The main target rule (all) in snakefile.smk defines available outputs:
rule all:
input:
checkm_semibin = expand(OUTDIR / "{key}/checkm2/semibin", key=sample_id.keys()),
checkm_comebin = expand(OUTDIR / "{key}/checkm2/comebin", key=sample_id.keys()),
checkm_metadecoder = expand(OUTDIR / "{key}/checkm2/metadecoder", key=sample_id.keys()),
checkm_metabat = expand(OUTDIR / "{key}/checkm2/metabat", key=sample_id.keys()),
checkm_default_vamb = expand(OUTDIR / "{key}/checkm2/default_vamb", key=sample_id.keys()),
checkm2_taxvamb = expand(OUTDIR / "{key}/tmp/checkm.done", key=sample_id.keys()),
gunc = expand(OUTDIR / "{key}/tmp/gunc.done", key=sample_id.keys()),For a config file with sample=test, set the target file as:
snakemake ... test/checkm2/semibin- Set the target rule to
{sample}/tmp/gunc.done - Edit
snakemake_modules/gunc.smkand modify theall_bin_dirsvariable to include only desired tools:
all_bin_dirs = {
"comebin": [OUTDIR / "{key}/comebin/comebin_res/comebin_res_bins", ".fa"],
"semibin": [OUTDIR / "{key}/semibin/bins", ".fa"],
# Comment out other tools...
}- Set the target file to
{sample}/tmp/checkm.done - Edit
snakemake_modules/run_checkm_on_all_taxvamb.smkand modifyall_bin_dirs_clas:
all_bin_dirs_clas = {
"run_taxvamb_gtdb_w_unknown": OUTDIR / "{key}/gtdb_taxvamb_default_w_unknown/vaevae_clusters_split.tsv",
# Comment out other configurations...
}Use the all_reclustering target and modify all_bin_dirs_recluster in snakefile.smk:
snakemake --snakefile snakefile.smk -c 100 \
--config bam_contig=<config_file> all_reclusteringall_bin_dirs_recluster = {
"default_vamb": OUTDIR / "{key}/vamb_default",
"run_taxvamb_gtdb_w_unknown": OUTDIR / "{key}/gtdb_taxvamb_default_w_unknown",
# Comment out other binners...
}All resources are configured in config/config.yaml. Per-rule resources override defaults:
# Default resources (used when rule-specific values are not defined)
default_walltime: "48:00:00"
default_threads: 16
default_mem_gb: 50
# Rule-specific resources
spades:
walltime: "15-00:00:00"
threads: 16
mem_gb: 60Note: If specified resources exceed available hardware, they are automatically scaled down.
- Set
vamb_use_gpu: Trueinconfig/config.yaml - Add GPU partition settings to the relevant rules:
run_taxvamb_kraken:
walltime: "20-00:00:00"
mem_gb: 500
threads: 64
gpu: " --partition=gpuqueue --gres=gpu:1 "- Set
semibin_use_gpu: Trueinconfig/config.yaml - The
semibinGPUrule uses HPC-specific CUDA modules (module load cuda/12.2). You may need to adjust this insnakemake_modules/semibin.smkfor your cluster.
For long-read data:
- Use SemiBin with
--sequencing-type=long_readflag - Use DBSCAN algorithm instead of k-means for reclustering
- Use minimap2 with
-ax map-hififor read mapping
For Taxvamb/Vamb runs without the predictor, add the --no_predictor flag.
Tools which crashed internally, and alternative ways of running them - along with the resourses used for running the tools.
For benchmarking for figure 3 in the paper the following 4 runs crashed internally.
Semibin(v.2.1.0) crashes in the Vaginal and the Salvia samples.
- For the Vaginal sample Semibin does not find any bins in one of the samples which causes downstream steps to crash (
/log_files_for_crashed_runs/Vaginal_SemiBin_v2.1.0.log). Semibin version 2.2.0 should fix this issue (https://github.com/BigDataBiology/SemiBin/releases/tag/v2.2.0), but does not change performance of the tool (according to patchnotes). We therefore ran semibin version 2.2.0 on this dataset, but we still got the same error (/log_files_for_crashed_runs/Vaginal_SemiBin_v2.2.0.log). This is due to another bug with no bins found using the--write-pre-reclustering-binsflag , removing this flag (as we don't need the pre-reclustering bins for this sample) and upgrading to the newest version of semibin fixes the issue. - For the the Salvia dataset we get the following error described in: BigDataBiology/SemiBin#211 and BigDataBiology/SemiBin#201. There does not seem to be a fix for the issue, although the maintainer seems to be looking into it. See
/log_files_for_crashed_runs/Salvia_SemiBin.logfor logfiles
Comebin (v1.0.3) crashes in the Human Gut (IBS) and Forest Soil samples.
In both datasets the error is described in the following Github issue: ziyewang/COMEBin#17,
The logfiles for these runs can be found in: /log_files_for_crashed_runs/Human_gut_IBS_ComeBin.log and /log_files_for_crashed_runs/Forest_soil_Comebin.log
Here we instead ran comebin in single-sample mode. This is equivalent to in the pipeline in the sample column of the config files, assigning each read pair and their corresponding contig file to a different sample name.
The pipeline includes Python scripts to convert taxonomy classifier outputs to a standardized Taxvamb-compatible format.
Converts Metabuli classification output to Taxvamb format.
Functionality:
- Parses the Metabuli report file to build a hierarchical taxonomy structure (taxID → full lineage)
- Converts classification file (contig → taxID) to full taxonomy lineage strings
- Outputs tab-separated format:
contigs\tpredictions
Usage:
python format_metabuli.py <classification_file> <report_file> > formatted_taxonomy.tsvPipeline integration: Called in Metabuli-based Taxvamb runs (see snakemake_modules/taxvamb_using_mmseqs_classifications.smk)
Standardizes MMseqs2 taxonomy output from TrEMBL or Kalamari databases.
Functionality:
- Fixes missing taxonomic levels: Ensures all 7 standard ranks (k, p, c, o, f, g, s) are present by filling gaps with placeholder names (e.g.,
LEVEL_4_ADDED_FROM_g_Escherichia) - Filters non-standard entries: Removes intermediate taxonomy entries prefixed with
-_while preserving domain-level classifications and subspecies annotations - Validates output: Confirms correct number of taxonomy levels
Usage:
cut -f1,5 mmseqs_output.tsv > cut.tsv
echo -e "contigs\tpredictions" > formatted.tsv
python format_trembl_kalmari.py cut.tsv >> formatted.tsvPipeline integration: Used in snakemake_modules/taxvamb_using_mmseqs_classifications.smk:
run_taxvamb_kalmari: Processes MMseqs2 Kalamari output (column 5)run_taxvamb_trembl: Processes MMseqs2 TrEMBL output (column 9)
Different taxonomy classifiers produce outputs in varying formats with inconsistent hierarchical structures. These scripts standardize outputs to ensure Taxvamb receives consistent, validated taxonomy data regardless of the upstream classifier.
Note: The Taxconverter tool handles Metabuli conversions, but does not support TrEMBL or Kalamari databases. These are included in this pipeline to assess how different annotations affect Taxvamb performance.