Kalamari is a database of completed and public assemblies, backed by trusted institutions. These assemblies can be further used in formatted databases such as Kraken or Blast.
Completed assemblies means that you do not have to worry about the database itself being contaminated with "rogue" contigs. Additionally, most assemblies were obtained by subject matter experts (SMEs) at Centers for Disease Control and Prevention (CDC). Those not from CDC come from other trusted institutions or projects such as FDA-ARGOS. Most genomes are from species that are either studied or are common contaminants in the Enteric Diseases Laboratory Branch (EDLB) at CDC.
Kalamari also comes with a custom taxonomy database such as defining Shigella as a subspecies of Escherichia coli or defining the four lineages of Listeria monocytogenes. These changes have been backed by trusted SMEs in EDLB.
To start using Kalamari, you'll need to complete the following steps:
- Export variables for NCBI API
- Install Kalamari dependencies
- Choose either Conda (preferred) or manual installation.
- Download the databases
- Build and filter the taxonomy directory
NCBI edirect requests run considerably more smoothly when the following environment variables are set:
NCBI_API_KEYEMAIL
Follow these instructions to obtain an NCBI API key. This key associates your edirect requests with your username. Without it, edirect requests might be buggy. After obtaining an NCBI API key, add it to your environment with
export NCBI_API_KEY=unique_api_key
where unique_api_key is a unique hexadecimal number with characters from 0-9 and a-f.
You should also set your email address in the
EMAIL environment variable as edirect tries to guess it, which is an error prone process.
Add this variable to your environment with
export EMAIL=my@email.address
using your own email address instead of my@email.address.
- Create the Kalamari conda environment, then activate it.
conda create -n kalamari -c conda-forge -c bioconda kalamari
conda activate kalamariWhen Kalamari is installed via conda, all scripts are placed on your $PATH, and the package data directory is installed inside the conda environment.
With the environment activated, run:
echo "$CONDA_PREFIX"to see the location of the install and the directories containing the scripts, source files, etc.
- Download the databases.
This step downloads the reference genome FASTA files for the Kalamari database. Note that this step takes a while to complete.
The databases are downloaded using the information contained in src/chromosomes.tsv and src/plasmids.tsv.
These files represent the chromosome and plasmid databases, respectively.
To download both the chromosome and plasmid databases with default settings, run:
downloadKalamari.shTo include incomplete assemblies, please see the download section under manual installation.
Files will output to: ${CONDA_PREFIX}/share/kalamari-<version>/kalamari/
For more control over database downloads when using a conda installation, such as selecting databases, specifying an output directory, or setting download parameters, see DOWNLOAD_PL.md.
- Build and filter the taxonomy directory
The taxonomy directory contains a locally generated NCBI taxonomy dump that incorporates Kalamari-specific modifications. It includes filtered nodes.dmp and names.dmp files representing only the TaxIDs present in the Kalamari database (and their ancestors). This taxonomy is used by downstream tools such as Kraken when building formatted databases.
buildTaxonomy.sh
filterTaxonomy.shThe taxonomy directory will be located at: ${CONDA_PREFIX}/share/kalamari-<version>/taxonomy/
- Congrats! You are done! For instructions using Kalamari with Kraken, Sepia, BLAST, ANI, etc., see database formatting instructions.
Manual installation is viable but less preferred.
- Clone this repo locally:
git clone https://github.com/lskatz/Kalamari.git-
Install dependencies:
- Perl (5.x)
wget(orcurl)- Debian/Ubuntu:
apt-get install wget
- Debian/Ubuntu:
- NCBI Entrez Direct (
edirect,esearch, etc.)- Install via your package manager
- Debian/Ubuntu:
apt install ncbi-entrez-direct
- taxonkit
-
Add the
bin/directory to your$PATH:
cd Kalamari
export PATH="$PWD/bin:$PATH"Confirm with:
which downloadKalamari.shTo make this change persistent across sessions, add the export line to your shell profile (e.g., ~/.bashrc or ~/.zshrc).
- Download the databases.
This step downloads the reference genome FASTA files for the Kalamari database. Note that this step takes a while to complete.
The databases are downloaded using the information contained in src/chromosomes.tsv and src/plasmids.tsv.
These files represent the chromosome and plasmid databases, respectively.
Optionally, you can include assemblies that are not complete (i.e., more than one contig per chromosome)
by including src/chromosomes-incomplete.tsv by using KALAMARI_EXPERIMENTAL as shown below.
To download both the chromosome and plasmid databases with default settings, run:
downloadKalamari.shTo include incomplete assemblies, include KALAMARI_EXPERIMENTAL in the environment.
You can either export it or include it when you execute it like so:
# either
export KALAMARI_EXPERIMENTAL=1
# or
KALAMARI_EXPERIMENTAL=1 downloadKalamari.shFiles will output to: <Kalamari_cloned_repo>/share/kalamari-<version>/kalamari/
For more control over the database download such as selecting databases, specifying an output directory, or setting download parameters, see DOWNLOAD_PL.md.
- Build and filter the taxonomy directory.
The taxonomy directory contains a locally generated NCBI taxonomy dump that incorporates Kalamari-specific modifications. It includes filtered nodes.dmp and names.dmp files representing only the TaxIDs present in the Kalamari database (and their ancestors). This taxonomy is used by downstream tools such as Kraken when building formatted databases.
buildTaxonomy.sh
filterTaxonomy.shThe taxonomy directory will be located at: <Kalamari_cloned_repo>/share/kalamari-<version>/taxonomy/
- Congrats! You are done! For instructions using Kalamari with Kraken, Sepia, BLAST, ANI, etc., see database formatting instructions.
How to format and query databases
Please see CONTRIBUTING.md
Katz LS, Griswold T, Lindsey RL, Lauer AC, Im MS, Williams G, Halpin JL, Gómez GA, Kucerova Z, Morrison S, Page A, Den Bakker HC, Carleton HA. 2025. "Kalamari: a representative set of genomes of public health concern." Microbiol Resour Announc 14:e00963-24. https://doi.org/10.1128/mra.00963-24