Skip to content

Assessing large-scale genomic language models in predicting personal gene expression: promises and limitations

License

Notifications You must be signed in to change notification settings

StatBiomed/xDecoder

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🧬xDecoder unlocks the potential of foundation models for few-shot personal gene expression prediction

This repo provides the implementation for our study utilizing the representation power from large-scale genomic language models (gLMs) and sequence-to-function (S2F) models in predicting personal gene expression.

We used the embeddings from:

  • S2F models: Enformer, Borzoi
  • gLMs: Evo2-7B, Nucleotide Transformer v2 (NT), Caduceus-ph using a framework xDecoder.

Additionally, we also full-parameter fine-tuned Enformer and Borzoi as a comparison.

📦 Installation

We recommend using a separate conda environment for each model, following the setup instructions from the corresponding original repositories:

And please install the following additional package for each environment:

pip install pyfaidx biopython pandas pybedtools

📁 Dataset Preparation

🔹 Download Reference Genome


cd data
wget https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz

gunzip hs37d5.fa.gz

🔹 Prepare Partition File

cd data
gunzip all_partition_data.csv.gz

🔹 Generate Consensus of Individuals

  • Step 1. Download VCF files and make them compressed and be indexed by bcftools

    cd data/vcf_files
    wget -i vcf_urls.txt
    
  • Step 2. Make consensus fasta files for individuals

    • Install bcftools based on installation or try conda install -c conda-forge bcftools

    • Prepare sample list file for generating consensus (e.g. data/train_50_indivs.txt)

    • Run make_consensus.py

      cd make_data
      
      python make_consensus.py --sample_list_file {sample name list file, default: ../data/train_50_indivs.txt}  --reference {reference path, default: ../data/hs37d5.fa} --vcf_root {vcf file root, default: ../data/vcf_files} --consensus_root {consensus save root, default: ../data/consensus}
      
      

🧱 Train xDecoder models

We train the models by inputting the embeddings from 2 S2F models and 3 gLMs and support three training strategies:

  • r-: Reference-only training
  • p-: Individual-only training
  • rp-: Reference pretraining followed by individual fine-tuning

We predefined the configuration files for each model, if using a custom data path, update it in the config YAML or create a new one.

⚠️ For rp- models, you shall first train using reference-only (r-) and then fine-tune with individual data and setting pretrained_model_path.

🔹 Reference-only training


python train_gLM_based_models.py --config_file_path configurations/config_{model_name}_ref.yml

# example
# python train_gLM_based_models.py --config_file_path configurations/config_caduceus_ref.yml

# python train_gLM_based_models.py --config_file_path configurations/config_Borzoi_embed_ref.yml

🔹 Individual-only training


python train_gLM_based_models.py --config_file_path configurations/config_{model_name}_indiv.yml

# example
# python train_gLM_based_models.py --config_file_path configurations/config_caduceus_indiv.yml

# python train_gLM_based_models.py --config_file_path configurations/config_Borzoi_embed_indiv.yml

⚠️ Evo2 Embeddings

Due to the high computational cost of Evo2 embedding generation, we recommend precomputing and storing embeddings as .pt files.

  • Expected folder structure:
preCalculatedEmbed_root/
├── sample_name/
│   ├── {gene_ID}_{allele}_1.pt
│   └── {gene_ID}_{allele}_2.pt
      

Set the preCalculatedEmbed_root path in your configuration YAML.

🔧 Fine-tune S2F models

The same three strategies apply to fine-tuned S2F models.

🔹 Reference-Only Training

python ft_S2F_models.py --config_file_path configurations/config_{model_name}_ref.yml

# example
# python ft_S2F_models.py --config_file_path configurations/config_Enformer_ref.yml

🔹 Individual-Only Training


python ft_S2F_models.py --config_file_path configurations/config_{model_name}_indiv.yml

# example:
# python ft_S2F_models.py --config_file_path configurations/config_Enformer_indiv.yml

🔹 Reference-Individual Training

First run reference-only (r-) training, then fine-tune using individual-only configurations but setting pretrained_model_path of reference-trained model checkpoints in the config.

About

Assessing large-scale genomic language models in predicting personal gene expression: promises and limitations

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages