This repository contains a deep learning framework designed to identify Chromatin-associated RNA-binding Proteins (caRBPs) using Protein Language Models (pLM) and advanced hybrid neural architectures.
- pLM Integration: Leverages ProtT5-XL-UniRef50 embeddings to capture high-dimensional evolutionary and biophysical information.
- Hybrid Loss Function: Implements a Hybrid Dice-Focal Loss to specifically address class imbalance in biological datasets.
- Python 3.8
- PyTorch 2.4.1
- TensorFlow 2.13.1
- Transformers 4.46.3
- Scikit-learn 1.3.2
- Pandas 2.0.3
- Numpy 1.24.3
- Matplotlib 3.7.5
- data/: Directory containing the datasets in .csv format.
- inference/: Directory containing datasets in .csv format for inference and scripts to predict potential caRBPs.
- model_training/: Directory containing scripts with or without pLM for peptide/protein training.
- model/: Directory containing the model generated by pLM-CNN-BiLSTM for inference.
- README.md: Project documentation.
- inference.py: A script for high-throughput prediction on new FASTA sequences.
- gen_T5_feature.py: Extracts features from protein sequences using the T5EncoderModel.
- plm_protein_model.py: The final training script for full-length proteins, incorporating pLM features and a dual-pathway CNN-BiLSTM architecture.
- peptide_model.py: A specialized model for short peptide fragments without pLM, used for classifying chromatin-binding peptides.
- protein_model.py: A specialized model for full-length protein without pLM, used for classifying caRBPs.
- plm_peptide_model.py: A specialized model for short peptide fragments with pLM, used for classifying chromatin-binding peptides.
git clone https://github.com/callmeracle/caRBP-Pred.git
cd caRBP-Pred
conda create -n carbp_env python=3.8
conda activate carbp_env
The ProtT5-XL model can be downloaded from here (https://huggingface.co/Rostlab/prot_t5_xl_uniref50/tree/main) Embedding files can be downloaded from here (https://drive.google.com/drive/folders/1TrboNK1oMGsd03l1G9CyPhQWVOuyt2Gt?usp=sharing)
1. Feature Extraction
Use gen_T5_feature.py to extract pre-trained language model features from protein sequences:
python gen_T5_feature.py
- Input Format
protein_id,sequence
P12345,MKTLLILGL...
P67890,GSHMASSNA...
- Output:
protein_plm_residue.npy
2. Model Training Run plm_protein_model.py for 5-fold cross-validation training:
python plm_protein_model.py
3. Model Inference Use the previously trained model to predict new sequences:
python inference/inference.py
- Configuration (modify in main):
task = {
"input_csv": "inference/all_fasta.csv", # Input sequences
"ptm_npy_path": "./inference_protein_plm_residue.npy", # Pre-computed features
"model_path": "../model/model.h5", # Model path
"output_csv": "Final_BiLSTM_pLM_Inference.csv" # Output results
}
- Output results:
Gene_Name,Prediction_Prob,Decision,Threshold
Protein_A,0.8923,Positive,0.7
Protein_B,0.2341,Negative,0.7
...