GitHub - SeedLLM/molly: molly, an LLM designed to understand multi-omics data.

Molly

Molly is a Large Language Model composed of multiple encoders, capable of understanding multi-omics data (DNA, RNA, and protein).

Molly 是一个集成了多个 encoder 的大语言模型，能够理解 DNA，RNA 和 protein 序列信息。 Omics-Specific Models（OSMs）指代各自组学赛道中性能领先的专用模型；Enc-Head 则是“组学 Encoder + 分类头”的简洁架构，将预训练编码器与任务相关分类头直接连接。

🌟 Feature

Base Model: Enhanced Qwen3 with nucleotide-transformer and ESM-2 encoders
Optimization: Support Liger-Kernel and FlashAttention for 100% training speedup, see example script

🤗 Download trained model

molly-1.7B molly-4B molly-8B

⚡ How to inference

```bash
./scripts/infer/inference_nt_lora.sh
```

🔥 How to train

Hotfix transformers source code

## transformers/modeling_utils.py
## add 4 lines 
    if not model._tp_plan:
        model_tp_plan = {}
    else:
        model_tp_plan = model._tp_plan

## model._tp_plan -> model_tp_plan
    tp_plan_regex = (
        re.compile("|".join([re.escape(plan) for plan in model_tp_plan]))
        if _torch_distributed_available and torch.distributed.is_initialized()
        else None
    )

Run training script

swanlab login

./scripts/train/run_train.sh

# or for test
./scripts/train/run_train_mini.sh

Eval

需要使用训练好的模型对评测集进行推理
- 脚本中的experiment_name, MODEL_DIR, CHECKPOINT需要修改为训练好模型的路径
- --text-model-path, --dna-rna-model-path, --protein-model-path为官方预训练权重文件路径
- --dataset-path为评测集路径
- --json-file为结果输出路径

./scripts/infer/inference_nt_lora.sh /path/to/checkpoint-3594  /path/to/inference.jsonl

将推理数据转换为待测评的格式
- 需要修改src_paths与dst_path，src_paths为推理结果的路径（注意需要是文件夹），dst_path是转换后的输出路径（注意是json文件）

python3 data_tools/convert.py /path/to/inference.jsonl /path/to/convert.jsonl

使用测评脚本获得模型在各个任务上的性能
- 此处--input_file_path为转换后的输出的json文件的路径

cd eval
./eval.sh modelname ../path/to/convert.jsonl

📌 LICENSE

This project follows apache license.

Name		Name	Last commit message	Last commit date
Latest commit History 210 Commits
.github/workflows		.github/workflows
baselines		baselines
data_tools		data_tools
eval		eval
scripts		scripts
src		src
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.pylintrc		.pylintrc
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Molly

🌟 Feature

🤗 Download trained model

⚡ How to inference

🔥 How to train

Eval

📌 LICENSE

About

Uh oh!

Contributors 5

Uh oh!

Languages

License

SeedLLM/molly

Folders and files

Latest commit

History

Repository files navigation

Molly

🌟 Feature

🤗 Download trained model

⚡ How to inference

🔥 How to train

Eval

📌 LICENSE

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors 5

Uh oh!

Languages