Skip to content

Viral protein annotation using structural homology

Notifications You must be signed in to change notification settings

HandleyLab/vhold

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

vHold

Viral protein annotation using structural homology

vHold annotates viral proteins by searching structural databases, enabling functional annotation of divergent sequences where BLAST fails. Protein structure is 3-10x more conserved than sequence during evolution, making structure-based search essential for annotating rapidly-evolving viral proteins.

Why vHold?

Traditional sequence-based tools (BLAST, DIAMOND) fail when sequence identity drops below ~30%. This affects 40-70% of viral proteins in metagenomic datasets. vHold solves this by:

  1. Predicting protein structure from sequence using ProstT5
  2. Searching viral structure databases with Foldseek
  3. Transferring functional annotations from structural homologs

Installation

Requirements

  • Python 3.10+
  • 4 GB disk space for databases
  • GPU recommended (CPU works but is slower)

Install

git clone https://github.com/HandleyLab/vhold.git
cd vhold
pip install -e .

Install Foldseek

# conda
conda install -c conda-forge -c bioconda foldseek

# or download binary
wget https://mmseqs.com/foldseek/foldseek-linux-avx2.tar.gz
tar xzf foldseek-linux-avx2.tar.gz
export PATH="$(pwd)/foldseek/bin:$PATH"

Download Databases

vhold install                    # Download all databases (~1.1 GB)
vhold install --no-viro3d        # BFVD only (smaller)
vhold install -d /custom/path    # Custom location

Quick Start

# Annotate viral proteins
vhold run -i proteins.fasta -o results/ -t 4

# View results
cat results/vhold_results.tsv

Output Files

File Description
vhold_results.tsv Main annotation table
vhold_summary.json Statistics and distributions
vhold_dark_matter.tsv Unannotated proteins for follow-up

Example

Input (proteins.fasta):

>protein_1
MKTIIALSYIFCLVFADYKDDDDK...
>protein_2
MSDKIIHLTDDSFDTDVLKADGAI...

Output (vhold_results.tsv):

query_id    description              confidence    category      primary_evalue
protein_1   Major capsid protein     high          structural    1.2e-45
protein_2   RNA-dependent RNA pol    medium        replication   3.4e-12

Databases

vHold searches two curated viral structure databases:

Database Structures Description
BFVD 351,242 AlphaFold2 predictions of viral proteins
Viro3D 85,162 Curated structures from 4,400+ virus species

Pipeline Overview

Input FASTA
    |
    v
ProstT5 (sequence -> 3Di structure alphabet)
    |
    v
Foldseek (structural search against BFVD + Viro3D)
    |
    v
Consensus scoring (multi-database agreement)
    |
    v
Functional classification
    |
    v
Output: annotations + dark matter proteins

Advanced Usage

Two-Step Workflow

For cluster environments with separate GPU and CPU nodes:

# Step 1: GPU node - predict structures
vhold predict -i proteins.fasta -o predictions/ --device cuda

# Step 2: CPU node - search databases
vhold compare -p predictions/ -o results/ -t 32

Custom Parameters

vhold run -i proteins.fasta -o results/ \
    --evalue 1e-5 \           # Stricter threshold
    --sensitivity 9.5 \       # Foldseek sensitivity (1-9.5)
    --threads 8 \             # CPU threads
    --device cuda             # GPU for ProstT5

Performance

Component Hardware Speed
ProstT5 GPU (V100) ~1,000 proteins/hour
ProstT5 CPU ~50 proteins/hour
Foldseek 8 CPU cores ~10,000 proteins/hour

Memory: ~3 GB GPU or ~6 GB CPU for ProstT5

Confidence Levels

vHold assigns confidence based on E-value, sequence identity, and database agreement:

Level Criteria
high E-value < 1e-10, multi-database agreement
medium E-value < 1e-5, single database or partial agreement
low E-value < 1e-3, weak evidence
very_low E-value > 1e-3, use with caution

Functional Categories

Proteins are classified into categories based on transferred annotations:

  • structural - capsid, envelope, spike, tail
  • replication - polymerase, helicase, primase
  • protease - proteases, peptidases
  • nuclease - endonuclease, integrase
  • packaging - terminase, portal
  • regulatory - repressor, activator
  • lysis - holin, endolysin
  • movement - plant virus movement proteins
  • unknown - no functional annotation

Dark Matter Analysis

Proteins without confident annotations are flagged as "dark matter" for follow-up:

Category Meaning
no_hits No structural homologs found - potentially novel
unknown_function Hits exist but function unknown
weak_hits Low-confidence matches

Case Studies

See case_studies/ for worked examples:

  • SARS-CoV-2 - Pipeline validation with well-characterized proteome
  • Remote Homology - Annotation of divergent proteins at <30% sequence identity

Citation

If you use vHold in your research, please cite:

[Citation pending publication]

License

MIT License

Acknowledgments

vHold builds on:

  • ProstT5 - Protein language model for structure prediction
  • Foldseek - Fast structural search
  • BFVD - Big Fantastic Virus Database
  • Viro3D - Curated viral structures

About

Viral protein annotation using structural homology

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages