Skip to content
/ molly Public

molly, an LLM designed to understand multi-omics data.

License

Notifications You must be signed in to change notification settings

SeedLLM/molly

Repository files navigation

Image

Molly

Molly is a Large Language Model composed of multiple encoders, capable of understanding multi-omics data (DNA, RNA, and protein).

Molly 是一个集成了多个 encoder 的大语言模型,能够理解 DNA,RNA 和 protein 序列信息。 Image Omics-Specific Models(OSMs)指代各自组学赛道中性能领先的专用模型;Enc-Head 则是“组学 Encoder + 分类头”的简洁架构,将预训练编码器与任务相关分类头直接连接。

🌟 Feature

🤗 Download trained model

⚡ How to inference

```bash
./scripts/infer/inference_nt_lora.sh
```

🔥 How to train

  1. Hotfix transformers source code

    ## transformers/modeling_utils.py
    ## add 4 lines 
        if not model._tp_plan:
            model_tp_plan = {}
        else:
            model_tp_plan = model._tp_plan
    
    ## model._tp_plan -> model_tp_plan
        tp_plan_regex = (
            re.compile("|".join([re.escape(plan) for plan in model_tp_plan]))
            if _torch_distributed_available and torch.distributed.is_initialized()
            else None
        )
  2. Run training script

    swanlab login
    
    ./scripts/train/run_train.sh
    
    # or for test
    ./scripts/train/run_train_mini.sh

Eval

  • 需要使用训练好的模型对评测集进行推理
    • 脚本中的experiment_name, MODEL_DIR, CHECKPOINT需要修改为训练好模型的路径
    • --text-model-path, --dna-rna-model-path, --protein-model-path为官方预训练权重文件路径
    • --dataset-path为评测集路径
    • --json-file为结果输出路径
./scripts/infer/inference_nt_lora.sh /path/to/checkpoint-3594  /path/to/inference.jsonl
  • 将推理数据转换为待测评的格式
    • 需要修改src_paths与dst_path,src_paths为推理结果的路径(注意需要是文件夹),dst_path是转换后的输出路径(注意是json文件)
python3 data_tools/convert.py /path/to/inference.jsonl /path/to/convert.jsonl
  • 使用测评脚本获得模型在各个任务上的性能
    • 此处--input_file_path为转换后的输出的json文件的路径
cd eval
./eval.sh modelname ../path/to/convert.jsonl

📌 LICENSE

This project follows apache license.

About

molly, an LLM designed to understand multi-omics data.

Topics

Resources

License

Stars

Watchers

Forks

Contributors 5