Skip to content

1anj/DeepMoLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

DeepMoLM: Leveraging Visual and Geometric Structural Information for Molecule-Text Modeling

Overview

DeepMoLM is a multi-modal large language model designed for comprehensive molecular understanding through the fusion of high-resolution 2D visual encodings and 3D geometric fingerprints. Unlike traditional approaches that represent molecules as simple 1D strings (SMILES) or 2D graphs, DeepMoLM employs a DeepEncoder for fine-grained visual perception and a Fusion Projector that leverages cross-attention mechanisms over molecule_fp (3D fingerprint) structural data to achieve stereoscopic molecular comprehension.

Architecture

DeepMoLM integrates the visual perception capabilities of DeepOCR with a specialized Fusion Projector designed for advanced molecular understanding.

DeepMoLM Architecture

flowchart TD
    A[Input: 2D molecular image, 1024x1024] --> B[Patch embedding + SAM-base<br/>12L, 768d, window attn]
    B --> C[SAM features, 4096 tokens]
    C --> D[Conv 16x compression]
    D --> E[Compressed tokens, 256 x 768d]
    E --> F[CLIP-large encoder<br/>24L, 1024d]
    F --> G[CLIP tokens, 256 x 1024d]
    E --> H[Token concat + proj<br/>2048d]
    G --> H
    I[Input: molecule_fp<br/>3D geometric fingerprints] --> J[molecule_fp tokens]
    H --> K[Fusion Projector<br/>Cross-Attention Transformer Decoder Layer]
    J --> K
    K --> L[Fused visual tokens]
    L --> M[LLM interface projection]
    M --> N[Qwen2-VL-7B]
Loading
πŸ—οΈ Architecture
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                   DeepMoLM                          β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                     β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚           DeepEncoder (380M)                β”‚    β”‚
β”‚  β”‚                                             β”‚    β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”β”‚    β”‚
β”‚  β”‚  β”‚  SAM-base    │──→│ Conv    │──→│ CLIP   β”‚β”‚    β”‚
β”‚  β”‚  β”‚  (80M)       β”‚   β”‚ 16Γ—     β”‚   β”‚ (300M) β”‚β”‚    β”‚
β”‚  β”‚  β”‚ Window Attn  β”‚   β”‚Compress β”‚   β”‚ Global β”‚β”‚    β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜β”‚    β”‚
β”‚  β”‚                     ↓                       β”‚    β”‚
β”‚  β”‚             [Concatenation] (256 tokens)    β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β”‚                        β”‚                            β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”       ↓                           β”‚
β”‚  β”‚ molecule_fp  │──→ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚    (3D)      β”‚    β”‚       Fusion Projector    β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚ (Image <-> molecule_fp)   β”‚  β”‚
β”‚                      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚                        ↓                            β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚               Qwen2-VL-7B                   β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β”‚                                                     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

1. DeepEncoder (Visual Perception)

We adopt the DeepEncoder architecture from DeepOCR (380M parameters) to efficiently process high-resolution molecular images (1024Γ—1024 pixels).

  • Components:

    • SAM-base (80M): 12 transformer layers, 768-dimensional embeddings, utilizing windowed attention for efficient local feature extraction
    • CLIP-large (300M): 24 transformer layers, 1024-dimensional embeddings, providing global semantic understanding
    • Conv 16Γ—: A compression module that reduces SAM feature dimensionality by a factor of 16
  • Token Flow:

    Input (1024Γ—1024) β†’ SAM (4096 tokens) β†’ Conv16Γ— (256 tokens)
    CLIP (256 tokens) β†’ Concatenation β†’ 2048-dimensional features
    

2. Fusion Projector

Unlike DeepOCR, which employs a simple linear projector for text processing, DeepMoLM utilizes a Fusion Projector to directly integrate 2D visual and 3D geometric modalities.

  • Inputs:

    • DeepEncoder Features: 256 visual tokens from the 2D molecular image
    • molecule_fp Fingerprints: 3D geometric fingerprint tokens
  • Fusion Mechanism: Cross-attention transformer decoder layer that attends image tokens to molecule_fp tokens, enabling effective multimodal integration

  • Output: Fused visual tokens at the LLM's hidden dimension, ready for downstream processing

3. LLM Interface

The fused multimodal tokens are projected into the Qwen2-VL-7B embedding space, enabling the generation of comprehensive molecular descriptions and accurate property predictions.

Installation

1. Clone the Repository

git clone https://github.com/1anj/DeepMoLM.git
cd DeepMoLM

2. Environment Setup

./environment_setup.sh deepmolm
conda activate deepmolm

3. Download Model Checkpoints

πŸ“’ Important: Model checkpoints are currently being organized and will be made available to the community soon. Please stay tuned for updates!

# SAM and CLIP checkpoints (combined checkpoint file)
# Target location: checkpoints/sam_clip_ckpt/model_cache/model-00001-of-000001.safetensors
huggingface-cli download pkulium/sam_clip_ckpt --local-dir ./checkpoints/sam_clip_ckpt

# Base LLM (Qwen2-VL-7B-Instruct)
huggingface-cli download Efficient-Large-Model/Qwen2-VL-7B-Instruct --local-dir ./checkpoints/Qwen2-VL-7B-Instruct

Inference

Molecular Description

python scripts/infer.py \
    --model-path ./checkpoints/deepmolm \
    --conv-mode auto \
    --text "Describe the input molecule." \
    --image demo/000561.png \
    --structure demo/000561.sdf

# Or

python scripts/infer.py \
    --model-path ./checkpoints/deepmolm \
    --conv-mode auto \
    --text "Describe the input molecule." \
    --image demo/000561.png \
    --temperature 0

Data Preparation

DeepMoLM requires a molecular dataset containing both 2D representations (images or SMILES strings for image generation) and 3D atomic coordinates. The datasets are primarily sourced from PubChem (Kim et al., 2021) and CheBI-20 (Edwards et al., 2022).

This project references and utilizes the following Hugging Face datasets:

  • 3D-MoIT: 3D Molecular Instruction Tuning dataset
  • Vis-CheBI20: Visual CheBI-20 dataset for molecular understanding
  • ChEBI-20-MM: ChEBI-20 multi-modal dataset

Training

Stage 1: Alignment (Vision-Language Pretraining)

This stage aligns the Fusion Projector with the LLM while keeping both the LLM and vision tower frozen.

bash scripts/run_stage1_align.sh

Stage 2: Visual Instruction Tuning

This stage fine-tunes both the Fusion Projector and the LLM for task-specific performance.

PubChem Dataset

Generalist Model:

bash scripts/run_stage2_finetune_generalist.sh

Specialist Models:

bash scripts/run_stage2_finetune_specialist_pubchem_cap.sh
bash scripts/run_stage2_finetune_specialist_pubchem_com.sh
bash scripts/run_stage2_finetune_specialist_pubchem_des.sh
bash scripts/run_stage2_finetune_specialist_pubchemqc_prop.sh

Note: Specialist model evaluation results are saved in ./eval_outputs/specialist_* with separate test/ and valid/ subdirectories.

CheBI-20 Dataset

bash scripts/run_stage2_finetune_chebi20.sh

Evaluation

Running Evaluation

bash scripts/eval/run_eval.sh \
    ./checkpoints/deepmolm-7b \
    ./data/3d-mol-dataset \
    ./eval_outputs/predictions.jsonl

Evaluation Metrics

python scripts/eval/parse_results.py \
    --file_path ./eval_outputs/predictions.jsonl 

GUI

DeepMoLM GUI

Acknowledgements

This project builds upon the following excellent works:

  • DeepOCR: DeepEncoder architecture and efficiency concepts
  • 3D-MoLM: 3D molecular language modeling foundations

References

  1. DeepSeek-AI. (2025). DeepSeek-OCR: Context-Aware Optical Compression
  2. VILA Framework. NVIDIA Research
  3. Segment Anything Model (SAM). Meta AI Research
  4. CLIP. OpenAI
  5. Qwen2-VL. Qwen Team

About

Leveraging Visual and Geometric Structural Information for Molecule-Text Modeling

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors