Molly is a Large Language Model composed of multiple encoders, capable of understanding multi-omics data (DNA, RNA, and protein).
Molly 是一个集成了多个 encoder 的大语言模型,能够理解 DNA,RNA 和 protein 序列信息。
Omics-Specific Models(OSMs)指代各自组学赛道中性能领先的专用模型;Enc-Head 则是“组学 Encoder + 分类头”的简洁架构,将预训练编码器与任务相关分类头直接连接。
- Base Model: Enhanced Qwen3 with nucleotide-transformer and ESM-2 encoders
- Optimization: Support Liger-Kernel and FlashAttention for 100% training speedup, see example script
```bash
./scripts/infer/inference_nt_lora.sh
```
-
Hotfix transformers source code
## transformers/modeling_utils.py ## add 4 lines if not model._tp_plan: model_tp_plan = {} else: model_tp_plan = model._tp_plan ## model._tp_plan -> model_tp_plan tp_plan_regex = ( re.compile("|".join([re.escape(plan) for plan in model_tp_plan])) if _torch_distributed_available and torch.distributed.is_initialized() else None )
-
Run training script
swanlab login ./scripts/train/run_train.sh # or for test ./scripts/train/run_train_mini.sh
- 需要使用训练好的模型对评测集进行推理
- 脚本中的experiment_name, MODEL_DIR, CHECKPOINT需要修改为训练好模型的路径
- --text-model-path, --dna-rna-model-path, --protein-model-path为官方预训练权重文件路径
- --dataset-path为评测集路径
- --json-file为结果输出路径
./scripts/infer/inference_nt_lora.sh /path/to/checkpoint-3594 /path/to/inference.jsonl
- 将推理数据转换为待测评的格式
- 需要修改src_paths与dst_path,src_paths为推理结果的路径(注意需要是文件夹),dst_path是转换后的输出路径(注意是json文件)
python3 data_tools/convert.py /path/to/inference.jsonl /path/to/convert.jsonl
- 使用测评脚本获得模型在各个任务上的性能
- 此处--input_file_path为转换后的输出的json文件的路径
cd eval
./eval.sh modelname ../path/to/convert.jsonl
This project follows apache license.
