Deep Learning-based Prediction of Chromatin-Associated RNA-Binding Proteins

This repository contains a deep learning framework designed to identify Chromatin-associated RNA-binding Proteins (caRBPs) using Protein Language Models (pLM) and advanced hybrid neural architectures.

Key Features

pLM Integration: Leverages ProtT5-XL-UniRef50 embeddings to capture high-dimensional evolutionary and biophysical information.
Hybrid Loss Function: Implements a Hybrid Dice-Focal Loss to specifically address class imbalance in biological datasets.

Dependencies

Python 3.8
PyTorch 2.4.1
TensorFlow 2.13.1
Transformers 4.46.3
Scikit-learn 1.3.2
Pandas 2.0.3
Numpy 1.24.3
Matplotlib 3.7.5

data/: Directory containing the datasets in .csv format.
inference/: Directory containing datasets in .csv format for inference and scripts to predict potential caRBPs.
model_training/: Directory containing scripts with or without pLM for peptide/protein training.
model/: Directory containing the model generated by pLM-CNN-BiLSTM for inference.
README.md: Project documentation.

Code Details

inference.py: A script for high-throughput prediction on new FASTA sequences.
gen_T5_feature.py: Extracts features from protein sequences using the T5EncoderModel.
plm_protein_model.py: The final training script for full-length proteins, incorporating pLM features and a dual-pathway CNN-BiLSTM architecture.
peptide_model.py: A specialized model for short peptide fragments without pLM, used for classifying chromatin-binding peptides.
protein_model.py: A specialized model for full-length protein without pLM, used for classifying caRBPs.
plm_peptide_model.py: A specialized model for short peptide fragments with pLM, used for classifying chromatin-binding peptides.

Installation

git clone https://github.com/callmeracle/caRBP-Pred.git    
cd caRBP-Pred  
conda create -n carbp_env python=3.8  
conda activate carbp_env

The ProtT5-XL model can be downloaded from here (https://huggingface.co/Rostlab/prot_t5_xl_uniref50/tree/main) Embedding files can be downloaded from here (https://drive.google.com/drive/folders/1TrboNK1oMGsd03l1G9CyPhQWVOuyt2Gt?usp=sharing)

Usage

1. Feature Extraction

Use gen_T5_feature.py to extract pre-trained language model features from protein sequences:

python gen_T5_feature.py

Input Format

protein_id,sequence
P12345,MKTLLILGL...  
P67890,GSHMASSNA...

Output:

protein_plm_residue.npy

2. Model Training Run plm_protein_model.py for 5-fold cross-validation training:

python plm_protein_model.py

3. Model Inference Use the previously trained model to predict new sequences:

python inference/inference.py

Configuration (modify in main):

task = {  
    "input_csv": "inference/all_fasta.csv",           # Input sequences  
    "ptm_npy_path": "./inference_protein_plm_residue.npy",  # Pre-computed features  
    "model_path": "../model/model.h5",                # Model path  
    "output_csv": "Final_BiLSTM_pLM_Inference.csv"      # Output results  
}

Output results:

Gene_Name,Prediction_Prob,Decision,Threshold  
Protein_A,0.8923,Positive,0.7  
Protein_B,0.2341,Negative,0.7  
...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Deep Learning-based Prediction of Chromatin-Associated RNA-Binding Proteins

Key Features

Dependencies

Contents

Code Details

Installation

Usage

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 82 Commits
data		data
inference		inference
model		model
model_training		model_training
README.md		README.md
requirement.txt		requirement.txt

Folders and files

Latest commit

History

Repository files navigation

Deep Learning-based Prediction of Chromatin-Associated RNA-Binding Proteins

Key Features

Dependencies

Contents

Code Details

Installation

Usage

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages