This repository contains the official implementation of CrossDNA, a DNA sequence language model designed for explicit and dynamic cross-strand representation learning. CrossDNA uses a duplex-inspired dual-branch architecture to model forward and reverse-complement DNA views, together with lightweight cross-strand communication and long-context sequence modeling modules.
The repository provides source code, configuration files, pretrained model weights, and example scripts for pretraining CrossDNA, fine-tuning it on downstream genomic benchmarks, and extracting sequence-level representations. It can be used to reproduce the experiments reported in our study, train CrossDNA on custom genomic sequences, or load pretrained CrossDNA models for downstream tasks such as genomic sequence classification, enhancer prediction, chromatin profile prediction, and embedding-based analysis.
Pretrained CrossDNA model variants are also available through Hugging Face.
- Scripts for Pretraining, NT & Genomic Benchmarks.
- Paper Released.
- Pretrained Weights of CrossDNA (8.1M).
- [HuggingFace 🤗] includes variants of the CrossDNA model.
- Source Code and Pretrained Weights on transformers.
git clone https://github.com/LuoGroup2023/CrossDNA.git cd CrossDNA/crossdna
conda create -n CrossDNA python=3.11 conda activate CrossDNA pip install --no-cache-dir --index-url https://download.pytorch.org/whl/cu121 torch==2.5.0+cu121 torchvision==0.20.0+cu121 torchaudio==2.5.0+cu121 pip install -U --no-use-pep517 git+https://github.com/fla-org/flash-linear-attention --no-deps pip install --no-cache-dir triton==3.2.0 pip install tensorflow -i https://pypi.tuna.tsinghua.edu.cn/simple pip install --no-deps "selene_sdk==0.6.0" pip install -U cython plotly pytabix ruamel.yaml ruamel.yaml.clib seaborn statsmodels narwhals patsy pip install transformer pytorch-lightning==2.5.0.post0 wandb hydra-core==1.3.2 omegaconf==2.3.0 datasets polars genomic_benchmarks liftover psutil kipoiseq pyBigWig timm
mkdir data mkdir -p data/hg38/ curl https://storage.googleapis.com/basenji_barnyard2/hg38.ml.fa.gz > data/hg38/hg38.ml.fa.gz gunzip data/hg38/hg38.ml.fa.gz # unzip the fasta file curl https://storage.googleapis.com/basenji_barnyard2/sequences_human.bed > data/hg38/human-sequences.bed
You can check out the Nucleotide Transformer ang Genomic Benchmarks paper for how to download and process NT benchmark & Genomic Benchmark datasets.
The final file structure (data directory) should look like
|____bert_hg38 | |____hg38.ml.fa | |____hg38.ml.fa.fai | |____human-sequences.bed |____nucleotide_transformer | |____H3K36me3 | |____...... |____genomic_benchmark | |____dummy_mouse_enhancers_ensembl | |____....
The recommended entry point is the pre-training script under scripts/pre_train. Before running, please update the environment-specific paths in the script, such as conda, full_path_to_root, and the output directory.
bash scripts/pre_train/CrossDNA_2k.sh
This script launches experiment=hg38-pretrain/crossdna with the default 2k pre-training setup. You can edit variables such as SEQLEN, BLOCK_SIZE, BATCH_SIZE, D_MODEL, Depth, LR, and MAX_EPOCHES directly in the script.
GenomicBenchmarks provides 8 binary- and multi-class tasks packaged as a Python library.
The recommended launch script is scripts/benchmark/gb/gb_crossdna.sh. Please update the checkpoint path and dataset root in the script, or pass them from the command line.
bash scripts/benchmark/gb/gb_crossdna.sh human_enhancers_cohn
Optional override:
bash scripts/benchmark/gb/gb_crossdna.sh human_enhancers_cohn /path/to/pretrain.ckpt /path/to/genomic_benchmark
An additional 408K/tiny-backbone variant is also provided:
bash scripts/benchmark/gb/gb_crossdna_408k.sh human_enhancers_cohn /path/to/pretrain_408k.ckpt /path/to/genomic_benchmark
Task-specific MAX_LENGTH, BATCH_SIZE, and LR are selected automatically inside the script according to DATASET_NAME.
Datasets are hosted on the Hub as InstaDeepAI/nucleotide_transformer_downstream_tasks.
The recommended launch script is scripts/benchmark/nt/nt_crossdna.sh. Please update the checkpoint path and dataset root in the script, or pass them from the command line.
bash scripts/benchmark/nt/nt_crossdna.sh H3K4me3
Optional override:
bash scripts/benchmark/nt/nt_crossdna.sh H3K4me3 /path/to/pretrain.ckpt /path/to/nucleotide_transformer
Task-specific BATCH_SIZE and LR are configured inside the script for each NT dataset.
import os
os.environ.setdefault("DISABLE_TORCH_COMPILE", "1")
import torch
if hasattr(torch, "compile"):
def _no_compile(fn=None, *args, **kwargs):
if fn is None:
def deco(f): return f
return deco
return fn
torch.compile = _no_compile
from transformers import AutoModelForMaskedLM, AutoTokenizer
MODEL_DIR = "/home/zhaol/projects/huggingface_crossdna/crossdna_inference" # This directory must contain either the `model.safetensors` or `pytorch_model.bin` file.
tok = AutoTokenizer.from_pretrained(MODEL_DIR, trust_remote_code=True, local_files_only=True)
model = AutoModelForMaskedLM.from_pretrained(MODEL_DIR, trust_remote_code=True, local_files_only=True)
model.eval()
# ---- Device Selection
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
# ---- Base mapping (logits indexed 0..4 <-> A/C/G/T/N)
labels = ["A", "C", "G", "T", "N"]
base_map = {ch: i for i, ch in enumerate(labels)}
def dna_to_base_ids(seq: str, device=None):
t = torch.tensor([base_map.get(ch.upper(), base_map["N"]) for ch in seq], dtype=torch.long)
return t.to(device) if device is not None else t
# ========== Test (MaskedLM Forward) ==========
x = torch.full((2, 16), base_map['N'], dtype=torch.long, device=device)
mask_id = getattr(model.config, "mask_token_id", 3)
x[:, 3] = mask_id
x[:, 9] = mask_id
with torch.no_grad():
out = model(input_ids=x)
logits = out.logits.detach().cpu()
print("logits.shape =", tuple(logits.shape))
dna = "TGATGTGACTCACATAGGCGGTGGCGTGATATGTTGTGACTCATTTCCCGGAAACGGATGACTAATGCCATATGTTATCAGTTTCCTGGAAATTTGATCACGCCATATTGTGAAATCATGCGATTCCCGGATCACGTGACGGCCGGACGTGACAAGTATGAGTCACTAAGTGGCGTGATCTTACGAATCACGTGATGGTCAATGTCACGTGATCGGCTGGTGAGTCAGCAATATCGTGTGATTCATTC"
inp = dna_to_base_ids(dna, device=device).unsqueeze(0)
with torch.no_grad():
out2 = model(input_ids=inp)
pred2 = out2.logits.argmax(dim=-1).squeeze(0).cpu().tolist()
print("argmax base:", "".join(labels[i] for i in pred2))
# ========== Representations (Embedding) ==========
max_len_cfg = getattr(model.config, "max_position_embeddings", 1024)
max_length = int(min(512, max_len_cfg)) # You can change it to a longer value, but do not exceed max_position_embeddings.
# 2) Construct sample DNA text with base IDs (note: not tokenizer IDs)
sequence = "ACTG" * (max_length // 4)
seq_base = dna_to_base_ids(sequence, device=device).unsqueeze(0) # [1, L] in 0..4
# 3) Temporarily switch the backbone to “characterization mode”.
was_pretrain = getattr(model.backbone, "pretrain", False)
was_for_repr = getattr(model.backbone, "for_representation", False)
model.backbone.pretrain = False # Let Backbone accept a single sequence
model.backbone.for_representation = True # Let forward return the fused representation.
with torch.inference_mode():
embeddings, _ = model.backbone(seq_base) # [B, L, H]
embeddings_cpu = embeddings.detach().cpu()
print("embeddings.shape =", tuple(embeddings_cpu.shape))
# 4) Restore original settings
model.backbone.pretrain = was_pretrain
model.backbone.for_representation = was_for_repr
All data used in this study were obtained from publicly available datasets.
For the Genomic Benchmarks tasks, we used datasets hosted on Hugging Face: https://huggingface.co/katarinagresova. Data were processed following the procedures described in the associated GitHub repositories: https://github.com/ML-Bioinfo-CEITEC/genomic_benchmarks and https://github.com/HazyResearch/hyena-dna.
The Nucleotide Transformer downstream tasks were downloaded from: https://huggingface.co/datasets/InstaDeepAI/nucleotide_transformer_downstream_tasks/tree/main, and prepared according to the data processing and loading pipeline provided in the Caduceus repository: https://github.com/kuleshov-group/caduceus.
Chromatin profile prediction data were obtained from the DeepSEA resource. Preprocessing followed the Sei framework implementation: https://github.com/FunctionLab/sei-framework, and task-specific fine-tuning was configured in accordance with the GENA-LM DeepSEA scripts: https://github.com/AIRI-Institute/GENA_LM/blob/main/downstream_tasks/DeepSea/run_deepsea_finetuning.py.
For the enhancer activity prediction task, we used the dataset available at: https://huggingface.co/datasets/GenerTeam/DeepSTARR-enhancer-activity/tree/main, and followed the data preprocessing and model fine-tuning procedures described in the associated study.
DNA long-range benchmark tasks were constructed from the dataset available at Harvard Dataverse: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/YUP2G5, and processed following the JanusDNA repository: https://github.com/Qihao-Duan/JanusDNA.
For the experiment evaluating the generalization performance of enhancers, mouse memory CD8 T cell enhancers and Drosophila E2-4 neural enhancers were obtained from the EnhancerAtlas database: http://www.enhanceratlas.net/scenhancer/download.php. Human K562 cell-line enhancer sequences were also retrieved from EnhancerAtlas. Ten experimentally validated, highly active developmental enhancers designed in the DREAM study were downloaded from the supplementary materials of the corresponding publication: https://academic.oup.com/nar/article/52/21/13447/7825962#supplementary-data. You can find our processed enhancer dataset via this link: https://doi.org/10.5281/zenodo.17995482 .
To benchmark the embedding quality of DNA foundation models, we used the DNA Foundation Benchmark dataset, available at: https://huggingface.co/datasets/hfeng3/dna_foundation_benchmark_dataset/tree/main.
- Cheng Yang: yangchengyjs@hnu.edu.cn
College of Computer Science and Electronic Engineering, Hunan University, Changsha