Codebase for NCFlow

This repository contains the training and inference code for NCFlow, a flow matching model that can predict the 3D conformation of arbitrary ncAAs in proteins.

Reference: Jin Sub Lee and Philip M Kim. “Design of peptides with non-canonical amino acids using flow matching”. In: bioRxiv (2025), pp. 2025–07 https://www.biorxiv.org/content/10.1101/2025.07.31.667780v1

Inference

First, download checkpoints here and place them in the checkpoints folder.

Small toy datasets can also be downloaded here and placed under the data folder.

Sampling can be done on a preprocessed test set with the following command:

python -m samplers.sampler ncaa

where ncaa refers to the YAML file in config/inference/ncaa.yaml.

For design via deep mutational scanning, you can run the following command:

python design_peptides_smiles.py <config> <cif file> <comma-separated receptor chain IDs> <peptide chain> <name for labeling>

For instance:

python design_peptides_smiles.py ncaa ./data/mdm2.cif A,C D mdm2

By default the ncAA pool used for sampling are taken from Wuxi Apptec, but a custom ncAA list can be generated easily by adapting the script process_ncaa_pool_for_design.py.

The flags --cyclic should be added if the peptide is head-to-tail cyclic, and --disulfide if there are disulfide bridges. This is to ensure that these bonds are included in the output SDF files.

For Pubchem and Plinder sampling:

python -m samplers.sampler_pubchem pubchem

python -m samplers.sampler_plinder plinder

Scoring

AEV-PLIG and ATM is used to score the designed variants and identify strongest predicted binders. Please use the respective codebases for installation and inference.

Note that we use the scripts from Quantumbind RBFE for running ATM, which has a more intuitive Python wrapper for running ATM.

The modified scripts for running AEV-PLIG and ATM directly on designed peptide variants will be uploaded shortly.

Training

Stage 1: PubChem

First, download the PubChem3D SDF files from here: https://ftp.ncbi.nlm.nih.gov/pubchem/Compound_3D/01_conf_per_cmpd/SDF/

Then, you can run the following command:

python -m preprocessing.process_pubchem <path to folder with .sdf.gz> <outpath>

This will preprocess the SDF files into .pth files containing features necessary for pretraining.

Then, you can run the command:

python -m trainers.trainer_pubchem pubchem

where pubchem refers to the config/training/pubchem.yaml file.

Stage 2: Plinder

First, download the Plinder dataset (https://www.plinder.sh/) - the 2024/06 release is used for NCFlow. You only need the systems folder from Plinder to process the dataset.

Then, you can run the following command: python -m preprocessing.process_plinder

Note that the input/output paths in the processing script is hard-coded, so you will need to revise accordingly.

Then, you can train on the processed pth files with the command:

python -m trainers.trainer_plinder plinder

where plinder refers to the config/training/plinder.yaml file.

When finetuning from the PubChem checkpoint, you simply need to add the path to the checkpoint under ckpt in the YAML file and add the flag --resume.

Stage 3: NCAAs

First, you must download a snapshot of the PDB using conventional tools (ex. Biotite's rcsb class).

Then you can run the processing script with python -m preprocessing.process_ncaa

Training is performed similarly to previous stages:

python -m trainers.trainer ncaa

Contact

If there are any comments or questions, feel free to email me at mjslee0921@gmail.com.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
config		config
dataloaders		dataloaders
model		model
preprocessing		preprocessing
samplers		samplers
trainers		trainers
utils		utils
LICENSE		LICENSE
README.md		README.md
design_peptides_smiles.py		design_peptides_smiles.py
loss.py		loss.py
loss_conf.py		loss_conf.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Codebase for NCFlow

Inference

Scoring

Training

Stage 1: PubChem

Stage 2: Plinder

Stage 3: NCAAs

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Codebase for NCFlow

Inference

Scoring

Training

Stage 1: PubChem

Stage 2: Plinder

Stage 3: NCAAs

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages