Skip to content

mjslee0921/ncflow

Repository files navigation

Codebase for NCFlow

This repository contains the training and inference code for NCFlow, a flow matching model that can predict the 3D conformation of arbitrary ncAAs in proteins.

Reference: Jin Sub Lee and Philip M Kim. “Design of peptides with non-canonical amino acids using flow matching”. In: bioRxiv (2025), pp. 2025–07 https://www.biorxiv.org/content/10.1101/2025.07.31.667780v1

Inference

First, download checkpoints here and place them in the checkpoints folder.

Small toy datasets can also be downloaded here and placed under the data folder.

Sampling can be done on a preprocessed test set with the following command:

python -m samplers.sampler ncaa

where ncaa refers to the YAML file in config/inference/ncaa.yaml.

For design via deep mutational scanning, you can run the following command:

python design_peptides_smiles.py <config> <cif file> <comma-separated receptor chain IDs> <peptide chain> <name for labeling>

For instance:

python design_peptides_smiles.py ncaa ./data/mdm2.cif A,C D mdm2

By default the ncAA pool used for sampling are taken from Wuxi Apptec, but a custom ncAA list can be generated easily by adapting the script process_ncaa_pool_for_design.py.

The flags --cyclic should be added if the peptide is head-to-tail cyclic, and --disulfide if there are disulfide bridges. This is to ensure that these bonds are included in the output SDF files.

For Pubchem and Plinder sampling:

python -m samplers.sampler_pubchem pubchem

python -m samplers.sampler_plinder plinder

Scoring

AEV-PLIG and ATM is used to score the designed variants and identify strongest predicted binders. Please use the respective codebases for installation and inference.

Note that we use the scripts from Quantumbind RBFE for running ATM, which has a more intuitive Python wrapper for running ATM.

The modified scripts for running AEV-PLIG and ATM directly on designed peptide variants will be uploaded shortly.

Training

Stage 1: PubChem

First, download the PubChem3D SDF files from here: https://ftp.ncbi.nlm.nih.gov/pubchem/Compound_3D/01_conf_per_cmpd/SDF/

Then, you can run the following command:

python -m preprocessing.process_pubchem <path to folder with .sdf.gz> <outpath>

This will preprocess the SDF files into .pth files containing features necessary for pretraining.

Then, you can run the command:

python -m trainers.trainer_pubchem pubchem

where pubchem refers to the config/training/pubchem.yaml file.

Stage 2: Plinder

First, download the Plinder dataset (https://www.plinder.sh/) - the 2024/06 release is used for NCFlow. You only need the systems folder from Plinder to process the dataset.

Then, you can run the following command: python -m preprocessing.process_plinder

Note that the input/output paths in the processing script is hard-coded, so you will need to revise accordingly.

Then, you can train on the processed pth files with the command:

python -m trainers.trainer_plinder plinder

where plinder refers to the config/training/plinder.yaml file.

When finetuning from the PubChem checkpoint, you simply need to add the path to the checkpoint under ckpt in the YAML file and add the flag --resume.

Stage 3: NCAAs

First, you must download a snapshot of the PDB using conventional tools (ex. Biotite's rcsb class).

Then you can run the processing script with python -m preprocessing.process_ncaa

Training is performed similarly to previous stages:

python -m trainers.trainer ncaa

Contact

If there are any comments or questions, feel free to email me at mjslee0921@gmail.com.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages