Viral protein annotation using structural homology
vHold annotates viral proteins by searching structural databases, enabling functional annotation of divergent sequences where BLAST fails. Protein structure is 3-10x more conserved than sequence during evolution, making structure-based search essential for annotating rapidly-evolving viral proteins.
Traditional sequence-based tools (BLAST, DIAMOND) fail when sequence identity drops below ~30%. This affects 40-70% of viral proteins in metagenomic datasets. vHold solves this by:
- Predicting protein structure from sequence using ProstT5
- Searching viral structure databases with Foldseek
- Transferring functional annotations from structural homologs
- Python 3.10+
- 4 GB disk space for databases
- GPU recommended (CPU works but is slower)
git clone https://github.com/HandleyLab/vhold.git
cd vhold
pip install -e .# conda
conda install -c conda-forge -c bioconda foldseek
# or download binary
wget https://mmseqs.com/foldseek/foldseek-linux-avx2.tar.gz
tar xzf foldseek-linux-avx2.tar.gz
export PATH="$(pwd)/foldseek/bin:$PATH"vhold install # Download all databases (~1.1 GB)
vhold install --no-viro3d # BFVD only (smaller)
vhold install -d /custom/path # Custom location# Annotate viral proteins
vhold run -i proteins.fasta -o results/ -t 4
# View results
cat results/vhold_results.tsv| File | Description |
|---|---|
vhold_results.tsv |
Main annotation table |
vhold_summary.json |
Statistics and distributions |
vhold_dark_matter.tsv |
Unannotated proteins for follow-up |
Input (proteins.fasta):
>protein_1
MKTIIALSYIFCLVFADYKDDDDK...
>protein_2
MSDKIIHLTDDSFDTDVLKADGAI...
Output (vhold_results.tsv):
query_id description confidence category primary_evalue
protein_1 Major capsid protein high structural 1.2e-45
protein_2 RNA-dependent RNA pol medium replication 3.4e-12
vHold searches two curated viral structure databases:
| Database | Structures | Description |
|---|---|---|
| BFVD | 351,242 | AlphaFold2 predictions of viral proteins |
| Viro3D | 85,162 | Curated structures from 4,400+ virus species |
Input FASTA
|
v
ProstT5 (sequence -> 3Di structure alphabet)
|
v
Foldseek (structural search against BFVD + Viro3D)
|
v
Consensus scoring (multi-database agreement)
|
v
Functional classification
|
v
Output: annotations + dark matter proteins
For cluster environments with separate GPU and CPU nodes:
# Step 1: GPU node - predict structures
vhold predict -i proteins.fasta -o predictions/ --device cuda
# Step 2: CPU node - search databases
vhold compare -p predictions/ -o results/ -t 32vhold run -i proteins.fasta -o results/ \
--evalue 1e-5 \ # Stricter threshold
--sensitivity 9.5 \ # Foldseek sensitivity (1-9.5)
--threads 8 \ # CPU threads
--device cuda # GPU for ProstT5| Component | Hardware | Speed |
|---|---|---|
| ProstT5 | GPU (V100) | ~1,000 proteins/hour |
| ProstT5 | CPU | ~50 proteins/hour |
| Foldseek | 8 CPU cores | ~10,000 proteins/hour |
Memory: ~3 GB GPU or ~6 GB CPU for ProstT5
vHold assigns confidence based on E-value, sequence identity, and database agreement:
| Level | Criteria |
|---|---|
| high | E-value < 1e-10, multi-database agreement |
| medium | E-value < 1e-5, single database or partial agreement |
| low | E-value < 1e-3, weak evidence |
| very_low | E-value > 1e-3, use with caution |
Proteins are classified into categories based on transferred annotations:
- structural - capsid, envelope, spike, tail
- replication - polymerase, helicase, primase
- protease - proteases, peptidases
- nuclease - endonuclease, integrase
- packaging - terminase, portal
- regulatory - repressor, activator
- lysis - holin, endolysin
- movement - plant virus movement proteins
- unknown - no functional annotation
Proteins without confident annotations are flagged as "dark matter" for follow-up:
| Category | Meaning |
|---|---|
| no_hits | No structural homologs found - potentially novel |
| unknown_function | Hits exist but function unknown |
| weak_hits | Low-confidence matches |
See case_studies/ for worked examples:
- SARS-CoV-2 - Pipeline validation with well-characterized proteome
- Remote Homology - Annotation of divergent proteins at <30% sequence identity
If you use vHold in your research, please cite:
[Citation pending publication]
MIT License
vHold builds on: