CheckU evaluates bacterial and archaeal genomes with the UNI56 universal single-copy marker set. The program reads amino acid FASTA files or nucleotide assemblies, calls genes with Pyrodigal when needed, and scores markers with PyHMMER. Results include raw completeness, calibrated completeness, contamination, and per-marker hit tables.
- FASTA inputs in plain or gzip form (
.faa,.fa,.fna, and friends)
Make sure you have Pixi installed:
curl -fsSL https://pixi.sh/install.sh | shInstall CheckU with Pixi:
pixi global install \
-c conda-forge \
-c bioconda \
-c https://repo.prefix.dev/astrogenomics \
checkuSmall test data sets ship with CheckU. After installation you can confirm the pipeline by running:
checku testSee the Expected Results section below for the expected output tables.
pip install checkuIf you want to download the code and develop locally:
git clone https://github.com/juanvillada/checku
cd checku
pixi installchecku --helpIf you are running from the repository with Pixi:
pixi run python -m checku --helpYou should see the command line help without errors.
- Provide either a single FASTA file or a directory of FASTA files.
- Protein files are used as-is. Nucleotide files trigger Pyrodigal gene calls.
- Compressed files (
.gz) are supported; they are unpacked into the run workspace.
If you are running from the repository with Pixi, replace checku below with pixi run python -m checku.
The examples below use the bundled test data from a source checkout. Replace the
paths with your own FASTA inputs, or run checku test after installation.
The diagram below shows the main stages executed by CheckU.
graph TD
A([Start run]) --> B[Collect FASTA inputs from file or directory]
B --> C[Materialize gzipped files under `work/` when needed]
C --> D{Detect sequence type}
D -->|Protein| E[Use supplied protein FASTA]
D -->|Nucleotide| F[Predict proteins with Pyrodigal]
F --> E
E --> G[Search UNI56 HMMs with pyhmmer]
G --> H[Aggregate marker hits and completeness statistics]
H --> I[Write `checku_summary.tsv`]
H --> J[Write `details/checku_presence.tsv`]
H --> K[Write raw hit tables in `details/hits/`]
H --> L[Update checkpoint data and logs]
H -.-> M[Optional: delete predicted proteins when `--clean-intermediate`]
I --> N([Pipeline complete])
J --> N
K --> N
L --> N
M --> N
checku run \
checku/data/test_genomes/faa/IMGI2140918011.faa \
--output-dir tmp/proteome_example \
--cpus 4checku run \
checku/data/test_genomes/faa \
--output-dir tmp/proteome_batch \
--cpus 8For directories containing many bins, you can parallelize across genomes:
checku run \
bins_fna/ \
--output-dir tmp/high_throughput_batch \
--cpus 16 \
--genome-workers 16 \
--clean-intermediate--cpus is the total CPU budget for the run. When --genome-workers is greater
than 1, CheckU divides that budget across concurrent genome workers and
pyhmmer threads per worker. For large directories of MAGs, this is usually much
faster than processing one genome at a time.
checku run \
checku/data/test_genomes/fna/IMG2140918011.fna \
--output-dir tmp/assembly_example \
--cpus 4 \
--clean-intermediateUse --clean-intermediate if you do not need the predicted protein FASTA after the run.
Every checku_summary.tsv now contains both:
completeness— the raw UNI56 marker fraction (markers_detected / 56 * 100)completeness_calibrated— a bundled residual-corrected estimate (CheckU-Cal)
The bundled calibration table is used automatically. If you have genome-level metadata available, you can refine the lookup with:
--calibration-metadata <metadata.tsv>— optional TSV/CSV withgenome_idplus eithertaxonomy_groupor GTDB fields such asclassification_gtdbtk,gtdbtk_domain, andgtdbtk_phylum--calibration-table <table.tsv>— override the bundled manuscript-derived calibration table
When GTDB-style phylum metadata are provided, CheckU-Cal now uses exact phylum-specific residual tables where the shredded benchmark supports them. If an exact phylum/bin combination is too sparse, CheckU falls back to a coarse phylum grouping and then to domain-wide/global residual tables. This keeps the software-facing calibration more specific than the manuscript display groups while retaining safe fallbacks for poorly represented taxa.
The genome_id column must match the CheckU genome identifier, which is derived from the input filename stem. For example, bin_001.fna.gz becomes bin_001.
Minimal example:
genome_id gtdbtk_domain gtdbtk_phylum classification_gtdbtk
bin_001 d__Bacteria p__Planctomycetota d__Bacteria;p__Planctomycetota;c__Planctomycetes;o__Planctomycetales;f__Planctomycetaceae;g__Planctomyces;s__
bin_002 d__Archaea p__Thermoproteota d__Archaea;p__Thermoproteota;c__Nitrososphaeria;o__Nitrososphaerales;f__Nitrososphaeraceae;g__Nitrososphaera;s__
bin_003 d__Bacteria p__Bacillota_A d__Bacteria;p__Bacillota_A;c__Clostridia;o__Eubacteriales;f__Lachnospiraceae;g__Roseburia;s__If you already maintain a curated calibration stratum, you can also provide taxonomy_group directly:
genome_id domain taxonomy_group
bin_001 Bacteria Planctomycetota
bin_002 Archaea Thermoproteota
bin_003 Bacteria Bacillota_AAccepted metadata columns:
- genome id:
genome_id,record_id,taxon_oid,genome,name - domain:
domain,gtdbtk_domain,gtdb_domain - taxonomy group:
taxonomy_group - phylum:
gtdbtk_phylum,gtdbtk__phylum,gtdb_phylum,phylum_gtdb,phylum - full GTDB classification:
classification_gtdbtk,classification,gtdb_classification,gtdb_taxonomy
Resolution order:
- Use
taxonomy_groupdirectly if provided. - Otherwise derive the exact phylum from the phylum or GTDB classification fields.
- If that exact phylum is not represented in the calibration table for the relevant completeness bin, fall back to the coarser group.
- If taxonomy is absent or unsupported, fall back to domain-wide and then global calibration.
Example run with metadata:
checku run \
/path/to/mags \
--calibration-metadata metadata.tsv \
--output-dir tmp/checku_with_calibration \
--cpus 8The summary and presence tables also report the calibration provenance:
calibration_domaincalibration_taxonomy_groupcalibration_checku_bincalibration_n_train
- The default marker file ships with CheckU (UNI56).
- Point
--hmmto a different GA-calibrated.hmmfile or to a directory that holds.hmmor.hmm.gzprofiles. - Every profile must define GA cutoffs. The run stops early if a profile is missing them or if names are duplicated.
Example:
checku run \
/path/to/genomes \
--hmm /path/to/custom_markers.hmm \
--output-dir tmp/custom_markers \
--cpus 8All outputs live in the chosen --output-dir.
checku_summary.tsv— per-genome summary with raw completeness, calibrated completeness, contamination, duplicate counts, calibration provenance, and Pyrodigal gene statistics.details/checku_presence.tsv— marker presence/absence matrix.details/hits/*.tsv— raw pyhmmer hits with domain scores.checkpoint/checku_checkpoint.json— resume data for interrupted runs.logs/checku.log— timestamps, command line, and status messages.- Output tables and logs record input/output locations using absolute paths for reproducibility.
- Runs resume automatically when
--resumeis left on (default). - Use
--no-resumeto start fresh; the older checkpoint is copied aside. - Increase
--log-leveltoDEBUGwhen you need extra detail.
Small test data sets ship with CheckU. After installation you can confirm the pipeline by running:
checku testThe command should finish without errors and produce the summary and presence tables described above.
If you are running from the repository with Pixi:
pixi run python -m checku testThe tables below summarize the expected checku_summary.tsv values for the bundled FAA and FNA test sets.
Absolute paths (input/protein columns in the real table) are omitted for privacy.
The real output table also includes calibration_domain, calibration_taxonomy_group,
calibration_checku_bin, and calibration_n_train.
FAA (protein inputs):
| genome_id | markers_detected | completeness | completeness_calibrated | duplicated_markers | contamination |
|---|---|---|---|---|---|
| IMGI2140918011 | 55 | 98.21 | 89.20 | 0 | 0.00 |
| IMGI2645727657 | 56 | 100.00 | 89.09 | 0 | 0.00 |
| IMGI651324087 | 56 | 100.00 | 89.09 | 0 | 0.00 |
| IMGM3300027739_BIN74 | 36 | 64.29 | 57.07 | 0 | 0.00 |
| SCISO2808607008 | 55 | 98.21 | 89.20 | 1 | 1.79 |
| SDISOGCA_003484685.1 | 47 | 83.93 | 70.34 | 1 | 1.79 |
| SHISO2654587767 | 55 | 98.21 | 89.20 | 1 | 1.79 |
| SLISOGCF_900639865.1 | 56 | 100.00 | 89.09 | 1 | 1.79 |
| SRISO640427127 | 52 | 92.86 | 81.67 | 0 | 0.00 |
| SXGCA_000019745.1 | 55 | 98.21 | 89.20 | 0 | 0.00 |
| SXGCA_902860225.1_Azoamicus_ciliaticola | 51 | 91.07 | 79.88 | 0 | 0.00 |
| SXISO642555114 | 54 | 96.43 | 87.42 | 1 | 1.79 |
FNA (nucleotide inputs with Pyrodigal):
| genome_id | markers_detected | completeness | completeness_calibrated | duplicated_markers | contamination | pyrodigal_genes | pyrodigal_contigs |
|---|---|---|---|---|---|---|---|
| IMG2140918011 | 56 | 100.00 | 89.09 | 0 | 0.00 | 2974 | 78 |
| IMG2645727657 | 56 | 100.00 | 89.09 | 0 | 0.00 | 1516 | 1 |
| IMG2645727657_HALF | 46 | 82.14 | 68.55 | 0 | 0.00 | 821 | 1 |
| IMG651324087 | 56 | 100.00 | 89.09 | 0 | 0.00 | 2572 | 73 |