🧬xDecoder unlocks the potential of foundation models for few-shot personal gene expression prediction
This repo provides the implementation for our study utilizing the representation power from large-scale genomic language models (gLMs) and sequence-to-function (S2F) models in predicting personal gene expression.
We used the embeddings from:
- S2F models: Enformer, Borzoi
- gLMs: Evo2-7B, Nucleotide Transformer v2 (NT), Caduceus-ph using a framework xDecoder.
Additionally, we also full-parameter fine-tuned Enformer and Borzoi as a comparison.
We recommend using a separate conda environment for each model, following the setup instructions from the corresponding original repositories:
And please install the following additional package for each environment:
pip install pyfaidx biopython pandas pybedtools
cd data
wget https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz
gunzip hs37d5.fa.gz
cd data
gunzip all_partition_data.csv.gz
-
Step 1. Download VCF files and make them compressed and be indexed by
bcftoolscd data/vcf_files wget -i vcf_urls.txt -
Step 2. Make consensus fasta files for individuals
-
Install
bcftoolsbased on installation or tryconda install -c conda-forge bcftools -
Prepare sample list file for generating consensus (e.g.
data/train_50_indivs.txt) -
Run
make_consensus.pycd make_data python make_consensus.py --sample_list_file {sample name list file, default: ../data/train_50_indivs.txt} --reference {reference path, default: ../data/hs37d5.fa} --vcf_root {vcf file root, default: ../data/vcf_files} --consensus_root {consensus save root, default: ../data/consensus}
-
We train the models by inputting the embeddings from 2 S2F models and 3 gLMs and support three training strategies:
- r-: Reference-only training
- p-: Individual-only training
- rp-: Reference pretraining followed by individual fine-tuning
We predefined the configuration files for each model, if using a custom data path, update it in the config YAML or create a new one.
rp- models, you shall first train using reference-only (r-) and then fine-tune with individual data and setting pretrained_model_path.
python train_gLM_based_models.py --config_file_path configurations/config_{model_name}_ref.yml
# example
# python train_gLM_based_models.py --config_file_path configurations/config_caduceus_ref.yml
# python train_gLM_based_models.py --config_file_path configurations/config_Borzoi_embed_ref.yml
python train_gLM_based_models.py --config_file_path configurations/config_{model_name}_indiv.yml
# example
# python train_gLM_based_models.py --config_file_path configurations/config_caduceus_indiv.yml
# python train_gLM_based_models.py --config_file_path configurations/config_Borzoi_embed_indiv.yml
Due to the high computational cost of Evo2 embedding generation, we recommend precomputing and storing embeddings as .pt files.
- Expected folder structure:
preCalculatedEmbed_root/
├── sample_name/
│ ├── {gene_ID}_{allele}_1.pt
│ └── {gene_ID}_{allele}_2.pt
Set the preCalculatedEmbed_root path in your configuration YAML.
The same three strategies apply to fine-tuned S2F models.
python ft_S2F_models.py --config_file_path configurations/config_{model_name}_ref.yml
# example
# python ft_S2F_models.py --config_file_path configurations/config_Enformer_ref.yml
python ft_S2F_models.py --config_file_path configurations/config_{model_name}_indiv.yml
# example:
# python ft_S2F_models.py --config_file_path configurations/config_Enformer_indiv.yml
First run reference-only (r-) training, then fine-tune using individual-only configurations but setting pretrained_model_path of reference-trained model checkpoints in the config.