LucaVirus: Modeling the Evolutionary and Functional Landscape of Viruses with a Unified Genome-Protein Language Model
OpenVirus Datasets are now available on HuggingFace.
The corpus comprises 15.7 million non-redundant viral sequences (including 10.4 million nucleotide sequences and 5.2 million protein sequences). This large-scale dataset provides a robust foundation for learning viral evolutionary patterns and functional signatures.
You can access the datasets via the following links (please note the subtle differences between these links):
OpenVirus (Full Corpus - Gene & Prot):
https://huggingface.co/datasets/LucaGroup/LucaVirus-OpenVirus-Gene-Prot
OpenVirus-Gene (Nucleotide only):
https://huggingface.co/datasets/LucaGroup/LucaVirus-OpenVirus-Gene
OpenVirus-Prot (Protein only):
https://huggingface.co/datasets/LucaGroup/LucaVirus-OpenVirus-Prot
LucaVirus now supports the Hugging Face interface for further training.
It allows for various training modes, including using sequence-only inputs or injecting biological knowledge following the LucaVirus framework. You can fine-tune the model for both sequence-level and token-level classification or regression tasks.
Please refer to the Hugging Face address: https://huggingface.co/collections/LucaGroup/lucavirus, or the huggingface branch of this repository.
- **Hugging Face Native**: Full support for `AutoModel`, `AutoModelForMaskedLM`, `AutoModelForSequenceClassification`, `AutoModelForTokenClassification`, `AutoConfig`, and `AutoTokenizer`.
- **Unified Architecture**: Single model architecture handling multiple biological modalities.
- **Task-Specific Heads**:
- `LucaVirusModel`: For sequences embedding.
- `LucaVirusForMaskedLM`: For pre-training and sequence recovery.
- `LucaVirusForSequenceClassification`: For sequence-level tasks (e.g., protein family, solubility, or promoter prediction).
- `LucaVirusForTokenClassification`: For residue-level tasks (e.g., secondary structure, binding sites, or post-translational modifications).
- **Extensible**: Easily adaptable to custom downstream tasks using the standard `transformers` API.
Huggingface
https://huggingface.co/LucaGroup
On June 16, 2025, the preprint version was released and LucaVirus and LucaVirusTasks were open-sourced.
Fig. 1 The workflow of LucaVirus.
OpenVirus
We curated OpenVirus, a comprehensive, large-scale data set of viral sequences used to train the LucaVirus model.
This data set comprises 15.7 million viral sequences, totaling 25.4 billion tokens—including 23.7 billion nucleotide tokens from 10.4 million sequences and 1.6 billion amino acid tokens from 5.2 million protein sequences.
Nucleotide sequences were primarily sourced from the NCBI Virus database and seven independent viral diversity studies (9, 20-24), ensuring inclusion of sequences not available in NCBI.
Protein sequences were obtained from the UniProtKB and MGnify databases.
The OpenVirus data set covers all known viral taxa.
The major groups include: double-strand (ds) DNA viruses (27% of sequences), RNA viruses (26%), reverse-transcribing viruses (20%), single-strand (ss) DNA viruses and others (6%), and unclassified viruses (21%).
These four groups collectively account for 94% of the total sequence count.
The data set includes viruses infecting all three domains and six kingdoms of cellular life, including animals (48%), bacteria (25%), plants (12%), protists (2%), fungi (2%), archaea (1%), and unknown hosts (22%).
LucaVirus employs a semi-supervised pre-training strategy, building on the framework established by LucaOne.
The model initializes its corresponding layers with weights derived from LucaOne’s latest training checkpoint at step 1,760,000.
The pre-training process integrates self-supervised masked language modeling (MLM) with seven biologically relevant supervised tasks to enhance the model’s ability to capture diverse biological features.
These tasks are categorized as follows:
Sequence-level classification tasks:
(i) Order taxonomy prediction for nucleotide sequences;
(ii) Order taxonomy prediction for protein sequences;
and (iii) UniProt functional keyword prediction for protein sequences.
Token-level classification tasks:
(i) gene prediction for nucleotide sequences;
(ii) protein homologous superfamily annotation;
(iii) protein conserved domain annotation;
and (iv) protein active site prediction.
Fig. 2 LucaVirus learns interpretable representation of viral sequences that reflect genetic divergence.
2) Exploring the hidden diversity and functional proteins of viruses
Fig. 3 Exploring the hidden diversity and functional proteins of viruses.
Fig. 4 Fitting and predicting the fitness landscapes of a viral protein.
Fig. 5 Performance of LucaVirus in antibody-antigen binding prediction.
sudo yum update
sudo yum install git-all
sudo apt-get update
sudo apt install git-all
wget https://repo.anaconda.com/archive/Anaconda3-2022.05-Linux-x86_64.sh
sh Anaconda3-2022.05-Linux-x86_64.sh
source ~/.bashrc
conda create -n lucavirus python=3.9.13
conda activate lucavirus
pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
conda activate lucavirus
conda install ipykernel
python -m ipykernel install --user --name lucavirus --display-name "Python(LucaVirus)"
jupyter kernelspec list
jupyter kernelspec uninstall lucavirus
TrainedCheckPoints
This project will download automatically LucaVirus Trained-CheckPoint from FTP when embedding inference.
using src/get_embedding.py or src/embedding/get_embedding.py
usage information refer to src/embedding/README.md or src/get_embedding_guidance.md
run_multi_v1.0.sh
use the LucaOne's checkpoint(step=17600000 or 36000000) for LucaVirus training.
run_multi_v1.0_continue.sh
continue training when an interruption occurs.
run_multi_mask_v1.0.sh
training LucaVirus only using mask pretrain task.
run_multi_v1.0_gene.sh
training LucaVirus only using viral gene(DNA + RNA) data.
run_multi_v1.0_prot.sh
training LucaVirus only using viral protein data.
run_multi_v1.0_single.sh
training LucaVirus only using one GPU card.
tensorboard --logdir tb-logs --bind_all --port 8008
The pre-training data will be opened soon.
The downstream tasks datasets and downstream tasks checkpoints can access at: LucaVirus.
or at: Zenodo
Foundation Model: LucaVirus
Downstream Tasks: LucaVirusTasks
Yong He,
Yuan-Fei Pan,
Zhaorong Li,
Mang Shi,
Yuqi Liu
@article { LucaVirus,
author = {Pan, Yuan-Fei* and He, Yong* and Liu, Yu-Qi and Shan, Yong-Tao and Liu, Shu-Ning and Liu, Xue and Pan, Xiaoyun and Bai, Yinqi and Xu, Zan and Wang, Zheng and Ye, Jieping and Holmes, Edward C. and Li, Bo and Chen, Yao-Qing and Li, Zhao-Rong and Shi, Mang},
title = {Predicting the Evolutionary and Functional Landscapes of Viruses with a Unified Nucleotide-Protein Language Model: LucaVirus},
elocation-id = {2025.06.14.659722},
year = {2025},
doi = {10.1101/2025.06.14.659722},
publisher = {Cold Spring Harbor Laboratory },
URL = {https://www.biorxiv.org/content/early/2025/06/20/2025.06.14.659722 },
eprint = {https://www.biorxiv.org/content/early/2025/06/20/2025.06.14.659722.full.pdf },
journal = {bioRxiv}
}




