B3BPFN

B3BPFN is a reproducible repository for blood-brain barrier-penetrating peptide (BBBP) prediction. The current pipeline combines ESM2 sequence embeddings, iFeatureOmega physicochemical descriptors, mutual-information-based feature selection, and a TabPFN classifier.

The repository includes:

the curated benchmark dataset used in the current experiments
the training script for rebuilding the model pipeline
the inference script for scoring new peptide sequences
the exported preprocessing objects and trained classifier

Overview

The released workflow follows four stages:

Encode peptide sequences with facebook/esm2_t33_650M_UR50D.
Extract handcrafted descriptors with iFeatureOmegaCLI.
Fuse features, remove constant dimensions, standardize them, and keep the top 700 features with mutual information.
Train a TabPFNClassifier on an undersampled training set and apply the saved threshold for final prediction.

Current benchmark scale in this repository:

Positive peptides: 426
Negative peptides: 6865
Feature space before selection: 2121
Retained features: 700

Reported performance from the current internal benchmark:

Sensitivity: 0.9294
Specificity: 0.8824
Accuracy: 0.9059
MCC: 0.8127
AUROC: 0.9460

Repository Layout

B3BPFN/
├── data/
│   ├── all_negative.fasta
│   └── all_positive.fasta
├── models/
│   ├── kbest_selector.pkl
│   ├── optimal_threshold.txt
│   ├── standard_scaler.pkl
│   ├── tabpfn_classifier.pkl
│   └── variance_selector.pkl
├── tests/
│   └── test_predict_cli.py
├── predict_peptide.py
├── train_peptide_final.py
└── README.md

Environment

This project is intended to run on a Linux server or a compute node with a configured Python environment. At minimum, you will need:

Python 3.10 or newer
numpy
pandas
torch
joblib
scikit-learn
transformers
tabpfn
iFeatureOmegaCLI

Notes:

The ESM2 backbone is loaded from Hugging Face on first use.
Training and large-batch inference are best run on a GPU-enabled server.
iFeatureOmegaCLI installation can vary by environment; follow your server setup conventions if you already maintain a shared bioinformatics stack.

Data

The benchmark data are stored in FASTA format:

data/all_positive.fasta: curated BBB-penetrating peptides
data/all_negative.fasta: curated non-BBB-penetrating peptides

The current training script builds a balanced test set by holding out 20% of the positive peptides and sampling the same number of negatives.

Training

Run training from the repository root:

python3 train_peptide_final.py

The script will:

read the FASTA files under data/
extract ESM2 and iFeatureOmega features
perform variance filtering, scaling, and top-700 feature selection
undersample the negative class to a 1:5 positive-to-negative ratio
train the TabPFN classifier
write the resulting artifacts to models/

Inference

Prepare an input FASTA file such as:

>pep_1
RRWQWRMKKLG
>pep_2
ACDEFGHIKLMNP

Then run:

python3 predict_peptide.py -i input.fasta -o bbb_predictions.csv

Optional arguments:

-m, --model_dir: path to the saved model directory, default is models
-o, --output: output CSV path, default is bbb_predictions.csv

The output CSV contains:

ID
Sequence
Probability
Prediction (BBB+ or BBB-)

Reproducibility Notes

The training script uses a fixed random seed of 42.
The exported objects in models/ are the exact preprocessing and classification artifacts used by the inference script.
tabpfn_classifier.pkl is tracked with Git LFS because of its size.

Verification

This repository currently includes a lightweight regression test to ensure that the default inference model path stays aligned with the published repository layout:

python3 -m unittest tests/test_predict_cli.py

Public Release Status

This repository is being prepared as a code release first. Paper metadata, citation information, and author details are intentionally omitted at this stage.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

B3BPFN

Overview

Repository Layout

Environment

Data

Training

Inference

Reproducibility Notes

Verification

Public Release Status

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
models		models
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
predict_peptide.py		predict_peptide.py
train_peptide_final.py		train_peptide_final.py

Folders and files

Latest commit

History

Repository files navigation

B3BPFN

Overview

Repository Layout

Environment

Data

Training

Inference

Reproducibility Notes

Verification

Public Release Status

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages