Skip to content

callmeracle/caRBP-Pred

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

82 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Deep Learning-based Prediction of Chromatin-Associated RNA-Binding Proteins

This repository contains a deep learning framework designed to identify Chromatin-associated RNA-binding Proteins (caRBPs) using Protein Language Models (pLM) and advanced hybrid neural architectures.

Key Features

  • pLM Integration: Leverages ProtT5-XL-UniRef50 embeddings to capture high-dimensional evolutionary and biophysical information.
  • Hybrid Loss Function: Implements a Hybrid Dice-Focal Loss to specifically address class imbalance in biological datasets.

Dependencies

  • Python 3.8
  • PyTorch 2.4.1
  • TensorFlow 2.13.1
  • Transformers 4.46.3
  • Scikit-learn 1.3.2
  • Pandas 2.0.3
  • Numpy 1.24.3
  • Matplotlib 3.7.5

Contents

  • data/: Directory containing the datasets in .csv format.
  • inference/: Directory containing datasets in .csv format for inference and scripts to predict potential caRBPs.
  • model_training/: Directory containing scripts with or without pLM for peptide/protein training.
  • model/: Directory containing the model generated by pLM-CNN-BiLSTM for inference.
  • README.md: Project documentation.

Code Details

  • inference.py: A script for high-throughput prediction on new FASTA sequences.
  • gen_T5_feature.py: Extracts features from protein sequences using the T5EncoderModel.
  • plm_protein_model.py: The final training script for full-length proteins, incorporating pLM features and a dual-pathway CNN-BiLSTM architecture.
  • peptide_model.py: A specialized model for short peptide fragments without pLM, used for classifying chromatin-binding peptides.
  • protein_model.py: A specialized model for full-length protein without pLM, used for classifying caRBPs.
  • plm_peptide_model.py: A specialized model for short peptide fragments with pLM, used for classifying chromatin-binding peptides.

Installation

git clone https://github.com/callmeracle/caRBP-Pred.git    
cd caRBP-Pred  
conda create -n carbp_env python=3.8  
conda activate carbp_env

The ProtT5-XL model can be downloaded from here (https://huggingface.co/Rostlab/prot_t5_xl_uniref50/tree/main) Embedding files can be downloaded from here (https://drive.google.com/drive/folders/1TrboNK1oMGsd03l1G9CyPhQWVOuyt2Gt?usp=sharing)

Usage

1. Feature Extraction

Use gen_T5_feature.py to extract pre-trained language model features from protein sequences:

python gen_T5_feature.py
  • Input Format
protein_id,sequence
P12345,MKTLLILGL...  
P67890,GSHMASSNA...  
  • Output:
protein_plm_residue.npy

2. Model Training Run plm_protein_model.py for 5-fold cross-validation training:

python plm_protein_model.py

3. Model Inference Use the previously trained model to predict new sequences:

python inference/inference.py
  • Configuration (modify in main):
task = {  
    "input_csv": "inference/all_fasta.csv",           # Input sequences  
    "ptm_npy_path": "./inference_protein_plm_residue.npy",  # Pre-computed features  
    "model_path": "../model/model.h5",                # Model path  
    "output_csv": "Final_BiLSTM_pLM_Inference.csv"      # Output results  
}  
  • Output results:
Gene_Name,Prediction_Prob,Decision,Threshold  
Protein_A,0.8923,Positive,0.7  
Protein_B,0.2341,Negative,0.7  
...

About

Predicting chromatin-associated RNA-binding proteins from short sequencing using deep learning

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages