DeepMoLM is a multi-modal large language model designed for comprehensive molecular understanding through the fusion of high-resolution 2D visual encodings and 3D geometric fingerprints. Unlike traditional approaches that represent molecules as simple 1D strings (SMILES) or 2D graphs, DeepMoLM employs a DeepEncoder for fine-grained visual perception and a Fusion Projector that leverages cross-attention mechanisms over molecule_fp (3D fingerprint) structural data to achieve stereoscopic molecular comprehension.
DeepMoLM integrates the visual perception capabilities of DeepOCR with a specialized Fusion Projector designed for advanced molecular understanding.
flowchart TD
A[Input: 2D molecular image, 1024x1024] --> B[Patch embedding + SAM-base<br/>12L, 768d, window attn]
B --> C[SAM features, 4096 tokens]
C --> D[Conv 16x compression]
D --> E[Compressed tokens, 256 x 768d]
E --> F[CLIP-large encoder<br/>24L, 1024d]
F --> G[CLIP tokens, 256 x 1024d]
E --> H[Token concat + proj<br/>2048d]
G --> H
I[Input: molecule_fp<br/>3D geometric fingerprints] --> J[molecule_fp tokens]
H --> K[Fusion Projector<br/>Cross-Attention Transformer Decoder Layer]
J --> K
K --> L[Fused visual tokens]
L --> M[LLM interface projection]
M --> N[Qwen2-VL-7B]
ποΈ Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DeepMoLM β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β βββββββββββββββββββββββββββββββββββββββββββββββ β
β β DeepEncoder (380M) β β
β β β β
β β ββββββββββββββββ βββββββββββ βββββββββββ β
β β β SAM-base βββββ Conv βββββ CLIP ββ β
β β β (80M) β β 16Γ β β (300M) ββ β
β β β Window Attn β βCompress β β Global ββ β
β β ββββββββββββββββ βββββββββββ βββββββββββ β
β β β β β
β β [Concatenation] (256 tokens) β β
β βββββββββββββββββββββββ¬ββββββββββββββββββββββββ β
β β β
β ββββββββββββββββ β β
β β molecule_fp ββββ βββββββββββββββββββββββββββββ β
β β (3D) β β Fusion Projector β β
β ββββββββββββββββ β (Image <-> molecule_fp) β β
β βββββββββββββββββββββββββββββ β
β β β
β βββββββββββββββββββββββββββββββββββββββββββββββ β
β β Qwen2-VL-7B β β
β βββββββββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
We adopt the DeepEncoder architecture from DeepOCR (380M parameters) to efficiently process high-resolution molecular images (1024Γ1024 pixels).
-
Components:
- SAM-base (80M): 12 transformer layers, 768-dimensional embeddings, utilizing windowed attention for efficient local feature extraction
- CLIP-large (300M): 24 transformer layers, 1024-dimensional embeddings, providing global semantic understanding
- Conv 16Γ: A compression module that reduces SAM feature dimensionality by a factor of 16
-
Token Flow:
Input (1024Γ1024) β SAM (4096 tokens) β Conv16Γ (256 tokens) CLIP (256 tokens) β Concatenation β 2048-dimensional features
Unlike DeepOCR, which employs a simple linear projector for text processing, DeepMoLM utilizes a Fusion Projector to directly integrate 2D visual and 3D geometric modalities.
-
Inputs:
- DeepEncoder Features: 256 visual tokens from the 2D molecular image
- molecule_fp Fingerprints: 3D geometric fingerprint tokens
-
Fusion Mechanism: Cross-attention transformer decoder layer that attends image tokens to
molecule_fptokens, enabling effective multimodal integration -
Output: Fused visual tokens at the LLM's hidden dimension, ready for downstream processing
The fused multimodal tokens are projected into the Qwen2-VL-7B embedding space, enabling the generation of comprehensive molecular descriptions and accurate property predictions.
git clone https://github.com/1anj/DeepMoLM.git
cd DeepMoLM./environment_setup.sh deepmolm
conda activate deepmolmπ’ Important: Model checkpoints are currently being organized and will be made available to the community soon. Please stay tuned for updates!
# SAM and CLIP checkpoints (combined checkpoint file)
# Target location: checkpoints/sam_clip_ckpt/model_cache/model-00001-of-000001.safetensors
huggingface-cli download pkulium/sam_clip_ckpt --local-dir ./checkpoints/sam_clip_ckpt
# Base LLM (Qwen2-VL-7B-Instruct)
huggingface-cli download Efficient-Large-Model/Qwen2-VL-7B-Instruct --local-dir ./checkpoints/Qwen2-VL-7B-Instructpython scripts/infer.py \
--model-path ./checkpoints/deepmolm \
--conv-mode auto \
--text "Describe the input molecule." \
--image demo/000561.png \
--structure demo/000561.sdf
# Or
python scripts/infer.py \
--model-path ./checkpoints/deepmolm \
--conv-mode auto \
--text "Describe the input molecule." \
--image demo/000561.png \
--temperature 0DeepMoLM requires a molecular dataset containing both 2D representations (images or SMILES strings for image generation) and 3D atomic coordinates. The datasets are primarily sourced from PubChem (Kim et al., 2021) and CheBI-20 (Edwards et al., 2022).
This project references and utilizes the following Hugging Face datasets:
- 3D-MoIT: 3D Molecular Instruction Tuning dataset
- Vis-CheBI20: Visual CheBI-20 dataset for molecular understanding
- ChEBI-20-MM: ChEBI-20 multi-modal dataset
This stage aligns the Fusion Projector with the LLM while keeping both the LLM and vision tower frozen.
bash scripts/run_stage1_align.shThis stage fine-tunes both the Fusion Projector and the LLM for task-specific performance.
Generalist Model:
bash scripts/run_stage2_finetune_generalist.shSpecialist Models:
bash scripts/run_stage2_finetune_specialist_pubchem_cap.sh
bash scripts/run_stage2_finetune_specialist_pubchem_com.sh
bash scripts/run_stage2_finetune_specialist_pubchem_des.sh
bash scripts/run_stage2_finetune_specialist_pubchemqc_prop.shNote: Specialist model evaluation results are saved in
./eval_outputs/specialist_*with separatetest/andvalid/subdirectories.
bash scripts/run_stage2_finetune_chebi20.shbash scripts/eval/run_eval.sh \
./checkpoints/deepmolm-7b \
./data/3d-mol-dataset \
./eval_outputs/predictions.jsonlpython scripts/eval/parse_results.py \
--file_path ./eval_outputs/predictions.jsonl This project builds upon the following excellent works:
- DeepOCR: DeepEncoder architecture and efficiency concepts
- 3D-MoLM: 3D molecular language modeling foundations
- DeepSeek-AI. (2025). DeepSeek-OCR: Context-Aware Optical Compression
- VILA Framework. NVIDIA Research
- Segment Anything Model (SAM). Meta AI Research
- CLIP. OpenAI
- Qwen2-VL. Qwen Team

