Sliding Window INteraction Grammar
An interaction language model for protein-peptide and protein-protein interaction contexts.
- pMHC
- Method to predict class I and class II MHC peptide binding.
- Missense mutation pertubations
- Method to predict whether a protein-protein interaction would occur in the presence of a missense mutation.
For the SCV, only the Epitope, MHC, Hit, and Sequence columns are necessary.
For the cross-prediction, the Set should be defined as Train or Test. For prediction of peptides with no known labels, use the nolabel_prediction.py file, the Set should be defiend as Train or Test, and the Hit should be left empty for Test Epitopes.
To cross predict on new peptides/alleles, concatenate the training.csv for the model of your choice found in the Data folder with new data in the format as shown below. For the cross-prediction, the 'Set' column should be defined as 'Train' (from training dataset) or 'Test' (new data). For prediction of peptides with no known labels, use the nolabel_prediction.py file, the 'Set' should be defiend as 'Train' or 'Test', and the 'Hit' column should be left empty for Test Epitopes.
| Epitope | MHC | Set | Hit | Sequence |
|---|---|---|---|---|
| AAALIIHHV | HLA-A02:11 | Train | 1 | MAVMAPRTLVLLLSGALAL... |
| AGFAGDDAPR | HLA-A02:11 | Test | 0 | MAVMAPRTLVLLLSGALAL... |
| Epitope | MHC | Set | Hit | Sequence |
|---|---|---|---|---|
| AAALIIHHV | HLA-A02:11 | Train | 1 | MAVMAPRTLVLLLSGALAL... |
| AGFAGDDAPR | HLA-A02:11 | Test | MAVMAPRTLVLLLSGALAL... |
The Mutated sequence (unless WT), position of the mutation on the Mutated sequence (1-indexed, python adjustment in code), the Interactor sequence, and the label (Y2H_score) are the bare minimum necessary to run SWING. We highly recomend you set up your traning data as shown in the SWING_MutInt_Notebook.ipynb.
| Mutated_Seq (unless WT) | Interactor_Seq | Position | Y2H_score |
|---|---|---|---|
| MALDGPEQMELEEGKA... | MTSSYSSSSCPLGCTMA... | 60 | 0 |
| MARLALSPVPSHWMVA... | MDNKKRLAYAIIQFLHD... | 137 | 1 |
To cross predict on new missense mutations, concatenate the Mutation_pertubration_model.csv found in the Data folder or the data of your choice with the no label data in the format as shown below. For the no label prediction, 1. the 'Set' column should be defined as 'Train' (from training dataset) or 'Test' (new, unlabeled data) and the Y2H_score should be left empty for 'Test' mutations 2. columns for the amino acids before and after mutation should be added, 3. a 'Mutant' or 'WildType' label should be added to the column 'Type', and 4. some unique 'MutationID' should be given to the mutations to easily map them back to the original dataframe.
Note: Only mutant data should be added for no label prediction, not wild type. Corresponding wild type interactions will be added in the background.
| Mutated_Seq (unless WT) | Interactor_Seq | Before_AA | Position | After_AA | Y2H_score | Set | Type | MutationID |
|---|---|---|---|---|---|---|---|---|
| MALDGPEQMELEEGKA... | MTSSYSSSSCPLGCTMA... | R | 60 | Q | 0 | Train | WildType | |
| MARLALSPVPSHWMVA... | MDNKKRLAYAIIQFLHD... | G | 137 | S | Test | Mutant | 1 |
- pandas (v 1.2.4)
- numpy (v 1.20.1)
- scikit-learn (v 1.3.2)
- gensim (v 4.2.0)
- xgboost (v 1.6.1)
- matplotlib (v 3.6.3)
- python-Levenshtein (v 0.25.1)
A vignette with a step by step explanation of the method has been provided here.
To run no label prediction on mutation data, the following line of code can be used:
python3 MutInt_nolabel_prediction.py --data_set 'data.csv' --output 'MutInt_nolabel_preds' --k 7 --L 1 --metric 'polarity' --padding_score 9 --w 6 --dm 1 --dim 128 --epochs 52 --min_count 1 --alpha 0.08711 --save_embeddings True --n_estimators 375 --max_depth 6 --learning_rate 0.08966To run the standard cross validation on the Class I datasets the following line of code can be used:
python3 scv.py --data_set ClassI_training_210.csv --output 'ClassI_SCV_210' --save_embeddings True
--metric 'polarity' --classifier 'XGBoost' --loops 10 --k 7 --dim 583 --dm 0 --w 11 --min_count 1
--alpha 0.02349139979145104 --epochs 13 --n_estimators 232 --max_depth 6
--learning_rate 0.9402316101150048To run the cross prediction on the Class I datasets the following line of code can be used:
python3 cross_pred.py --data_set ClassI_crossval_HLA-A02:02_210.csv --output 'ClassI_HLA-A02:02_210'
--save_embeddings True --metric 'polarity' --loops 10 --classifier 'XGBoost' --cross_pred_set 'HLA-A02:02'
--k 7 --dim 583 --dm 0 --w 11 --min_count 1 --alpha 0.02349139979145104 --epochs 13 --n_estimators 232
--max_depth 6 --learning_rate 0.9402316101150048The hyperparameters for the Class II model are:
--k 7 --dim 146 --dm 0 --w 12 --min_count 1 --alpha 0.03887032752085429 --epochs 13 --n_estimators 341 --max_depth 9 --learning_rate 0.6534638199102993The hyperparameters for the Mixed Class model are:
--k 7 --dim 74 --dm 0 --w 12 --min_count 1 --alpha 0.03783042872771851 --epochs 10 --n_estimators 269 --max_depth 9 --learning_rate 0.6082359422582875Note ~45G of memory is needed to run the Class I model and ~30G for the Mixed Model.
Takes a pandas dataframe where each row represents a protein-protein/peptide-protein interaction.
Customization includes setting the interactor protein and the peptide window. In the pMHC context, the epitope defines the peptide window. In the missense mutation pertubation context, the window_k parameter defines the size of the window and the mutation defines the position. Additionally, the scale used to calculate the score can be altered.
The function returns a list of score encodings strings that each represent a PPI. The ends of the encodings include padding from the sliding window process. These encodings will be broken into k-mers for the embedding model.
- df: a string path to the location of the file
- The file must have a column for the interactor protein sequence, target protein sequence. For the mutation context, the position of the mutation must be provided
- padding_score: int
- Defines the number assigned to the padding. This number should be outside of the range of the scores given to AA pairs. default=9
Takes the encoding scores from get_window_encodings().
Customization includes setting size of the kmers (k), a shuffle option, and the integer defining the padding score.
This function returns a list of lists of overlapping k-mers of specified size k, removing k-mers of only padding. Each list of k-mers are specific to each of the PPIs. This output is compatible with gensims
- encoding_scores: a list of lists
- The list contains a list for each PPI. Each PPI list is composed of one string with the encodings
- k: int
- Defines the size of the k-mers, default=7
- shuffle:
- Whether the k-mers are shuffled. Shuffling may prevent overfitting based on position of the k-mers. default=False
- padding_score: int
- Defines the number assigned to the padding. This number should be outside of the range of the scores given to AA pairs. default=9
Takes in the k-mers created by the get_kmers_str() function.
Returns a Doc2Vec TaggedDocuments entities for each PPI to be used in a Doc2Vec model.
- matrix: a list of lists
- The list that contains a list of k-mers for each PPI
- tokens_only:
- default=False
Siwek, J. C., Omelchenko, A. A., Chhibbar, P., Arshad, S., Rosengart, A., Nazarali, I., ... & Das, J. (2025). Sliding Window Interaction Grammar (SWING): a generalized interaction language model for peptide and protein interactions. Nature Methods, 1-13.