By: Paul Kolbeck
Proteins are large molecules inside of organisms that fulfill a huge variety of functions within organisms. All of biological life hinges on the functions of proteins within organisms. From viruses to whales and everything in between, living beings are basically a collection of proteins that, through complex interactions and mechanisms, ensure the continuation of their own existence. As proteins are an underlying part of all biological processes, understanding proteins and their interactions is an important portion research in biology.
Proteins are created when a ribosome (a type of protein complex) takes in a gene sequence in the from of a strand of messenger RNA (mRNA), which was transcripted from the DNA in the nucleus of the cell, and starts an amino acid chain. Amino acids are the building blocks of proteins, and are small organic compounds that have binding sites on them that allow them to be bound together in a chain. As the ribosome travels along the mRNA strand, it encounters a sequence of triplets of bases, known as codons. Each codon corresponds to an amino acid, which the ribosome then takes from the surrounding supply and attaches to the budding amino acid chain in the order in which the codons appear. Thus, the mRNA strand is used to build a new sequence of amino acids. This sequence of amino acids then folds into a 3d shape, which then has complex properties based on the sequence of amino acids and how it folded.
Not only are the proteins themselves complex, but they exist within a sea of other proteins with which they may interact, giving rise to a network of interactions which further infuence the functionality of proteins. As the environments change, so does the function of a protein, so a protein that is a part of one set of interactions in a liver that allow the liver to grow back, may be part of a different set of interactions in skeletal muscle that allow the muscle to contract. Mapping these interactions is thus another integral part in understanding the biological processes in an organism.
In this tutorial, our goal is to investigate possible relationships between several protein characteristics, particularly their role in an interaction network and their structure. For example, one might predict that the number of pathways a protein is a part of may correlate to the number of active sites on a protein, which further correlates with the mass (i.e. size). TODO
For this analysis, we use Python 3, along with external packages numpy, pandas, matplotlib, networkx, and statsmodel. TODO
# imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import networkx as nx
import re
import requests
from bs4 import BeautifulSoup
import statsmodels.api as sm
import io
import time
import warnings
warnings.filterwarnings("ignore")For this tutorial, we chose to use the interaction network data provided by Innatedb. Innatedb is a database of genes, interactions, and signaling pathways related to the innate immune response of human beings. This was chosen so as to provide a sample of the protein interaction network in the human body with a fleshed out interaction network. As this is a biased sample of the population of all proteins in the human body, results derived from the analysis will only definitively apply to this subset of data, but may inform further investigation in other portions of the human protein interaction network.
The InnateDB database is only on the order of a hundred Megabytes, so downloading and importing the entire database was taken as the easiest and cleanest way to work with the database. The data from Innatedb was downloaded as a gz zipped file, extracted, and imported using pandas read_csv method.
# Making a dataframe out of the innate immune system data file, and getting characteristics from it.
interactionDF = pd.read_csv("./all.mitab/all.mitab", sep = "\t")
print(interactionDF.info())<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418677 entries, 0 to 418676
Data columns (total 42 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 #unique_identifier_A 418677 non-null object
1 unique_identifier_B 418677 non-null object
2 alt_identifier_A 418677 non-null object
3 alt_identifier_B 418677 non-null object
4 alias_A 418677 non-null object
5 alias_B 418677 non-null object
6 interaction_detection_method 418677 non-null object
7 author 418677 non-null object
8 pmid 418677 non-null object
9 ncbi_taxid_A 418677 non-null object
10 ncbi_taxid_B 418677 non-null object
11 interaction_type 417823 non-null object
12 source_database 418677 non-null object
13 idinteraction_in_source_db 418677 non-null object
14 confidence_score 418677 non-null object
15 expansion_method 418677 non-null object
16 biological_role_A 418677 non-null object
17 biological_role_B 418677 non-null object
18 exp_role_A 418677 non-null object
19 exp_role_B 418677 non-null object
20 interactor_type_A 418677 non-null object
21 interactor_type_B 418677 non-null object
22 xrefs_A 418677 non-null object
23 xrefs_B 418677 non-null object
24 xrefs_interaction 418677 non-null object
25 annotations_A 418677 non-null object
26 annotations_B 418677 non-null object
27 annotations_interaction 418677 non-null object
28 ncbi_taxid_host_organism 418677 non-null object
29 parameters_interaction 418677 non-null object
30 creation_date 418677 non-null object
31 update_date 418677 non-null object
32 checksum_A 418677 non-null object
33 checksum_B 418677 non-null object
34 checksum_interaction 418677 non-null object
35 negative 418677 non-null bool
36 features_A 418677 non-null object
37 features_B 418677 non-null object
38 stoichiometry_A 418677 non-null object
39 stoichiometry_B 418677 non-null object
40 participant_identification_method_A 418677 non-null object
41 participant_identification_method_B 418677 non-null object
dtypes: bool(1), object(41)
memory usage: 131.4+ MB
None
We see that we get a 131 MB dataframe, where each entry is an interaction. Each entry features a unique identifier created by InnateDB for proteins A and B, as well as several other identifiers. First, we'd like to check that we have a network. To do this, we are going to make a graph out of the first 2000 entries in the dataframe, as the function that draw the graph from matplotlib and networkx can't handle a graph with 400,000+ edges.
# Taking a look at the network
edgeList = interactionDF[["#unique_identifier_A", "unique_identifier_B"]]
# removing self interaction edges, to reduce clutter
edgeListNoSelf = edgeList[edgeList["#unique_identifier_A"] != edgeList["unique_identifier_B"]]
G = nx.Graph(list(np.array(edgeListNoSelf.head(2000))))
plt.figure(figsize = [30,15])
subax1 = plt.subplot(121)
nx.draw(G, node_size = 8)
plt.show()Looking at the graph, we can see that there is one main component with a majority of the proteins, and several isolated components. These isolated components may be connected to the main component if we were to include the rest of the edges, but checking using the method of graphing is unreasonable. However, we can take a closer look at the main component, isolating it using networkx's connected_components algorithm. We change the node color to green and the edge color to red for better contrast.
# getting the largest component
S = [G.subgraph(c).copy() for c in nx.connected_components(G)]
Sl = max(S, key=len)
plt.figure(figsize = [30,15])
subax1 = plt.subplot(121)
nx.draw(Sl, node_size = 10, node_color = "green", edge_color = "red")
plt.show()Just from looking at this graph, we see that we have a variety of positions in the graph - measures of centrality should be varied. We can see that there are some nodes from which there are several edges (i.e. they have a high degree), while there are other nodes that only have one edge. Of course, this is only a component of whole graph, but it suggests that we should expect variance.
We have confirmed that we have a network of protein interactions, which means that we have a means of getting measures of centrality for our proteins. However, we would like to compare these measures of centrality to other characteristics of the proteins, and the InnateDB database does not have any characteristics on the proteins themselves. However, the dataframe does supply additional IDs for the proteins, which can be used to query other databases for more information on the proteins. We chose to use the Uniprot database for our information on the proteins. The uniprot IDs are listed under the "alias_(A,B)" columns in the InnateDB data, along with a few other IDs. To acquire the Uniprot ID, we use a regular expression. Note that the "alias_(A,B)" actually contains both the Uniprot ID and the Uniprot Entry Name, so we add the '^_' to our regular expression to ensure we match on the ID.
# Defining function to get the uniprotkb protein id, if it exists.
def getUniprotkb(str):
r = re.search("uniprotkb:([^_|]+)[|]", str + "|")
if r:
return r.group(1)
else:
return "NONE"
# get mapping from innatedb unique identifier to uniprot ID
aProts = (interactionDF["#unique_identifier_A"] + "->" + interactionDF["alias_A"]).unique()
bProts = (interactionDF["unique_identifier_B"] + "->" + interactionDF["alias_B"]).unique()
allProts = np.unique(np.concatenate((aProts, bProts), axis = 0))
protDF = pd.DataFrame([i.split("->") for i in allProts], columns = ["innateDB_ID","alias"])
protDF["UniprotID"] = protDF["alias"].apply(getUniprotkb)
# Display head, get the number of entries without a uniprotID
print(f"Number of unique innatedb ids: {len(protDF['innateDB_ID'].unique())}")
print(f"{pd.isna(protDF['UniprotID']).sum()} entries are missing a uniprotID out of {protDF.shape[0]}")
display(protDF.head())Number of unique innatedb ids: 25271
0 entries are missing a uniprotID out of 25271
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| innateDB_ID | alias | UniprotID | |
|---|---|---|---|
| 0 | innatedb:IDBG-1 | refseq:NP_003131|uniprotkb:Q05066|uniprotkb:SR... | Q05066 |
| 1 | innatedb:IDBG-100000 | uniprotkb:LPAR3_HUMAN|refseq:NP_036284|uniprot... | Q9UBY5 |
| 2 | innatedb:IDBG-100012 | uniprotkb:Q8IWG1|uniprotkb:WDR63_HUMAN|refseq:... | Q8IWG1 |
| 3 | innatedb:IDBG-10002 | refseq:NP_056518|uniprotkb:Q9Y3A4|uniprotkb:RR... | Q9Y3A4 |
| 4 | innatedb:IDBG-100021 | uniprotkb:MCLN3_HUMAN|uniprotkb:Q8TDD5|refseq:... | Q8TDD5 |
We have a Uniprot ID for every InnateDB ID! This means we don't have to handle any missing values yet.
Now we need to get the protein data from the Uniprot database. Uniprot is popular enough to luckily have an extensive API (TODO), which we take advantage of to query their database for information each protein. There is an upper limit on the number of individual ids that can be queried at the same time, so we divide the list of 25000+ IDs into blocks of length 100. The API also specifies that queries can be limited to certain columns, so we choose a set of columns relevant to our investigation. We chose the id again, the entry name, the organism (to confirm this is a human protein), the mass, the length, the tissue specificity, the disease involvement, the subcellular location, the gene ontology, the domains, the family and the domain extent.
# breaking the queries into blocks
def makeQueryList(n, ids):
numBlocks = int((len(ids)/n + 1) - len(ids)/n % 1)
queryList = []
for i in range(numBlocks - 1):
entry = "id:" + ids[i*n]
for j in range(n-1):
entry = entry + "+OR+id:" + ids[i*n + 1 + j]
queryList.append(entry)
entry = ids[(numBlocks-1)*n]
for j in range(len(ids) - (numBlocks-1)*n - 1):
entry = entry + "+OR+" + ids[(numBlocks-1)*n + 1 + j]
return queryList
proteinQueryList = makeQueryList(100, np.array(protDF[~pd.isna(protDF['UniprotID'])]['UniprotID']))
columns = "columns=id,entry name,organism,mass,length,comment(TISSUE SPECIFICITY),comment(DISEASE),comment(SUBCELLULAR LOCATION),go,comment(DOMAIN),families,feature(DOMAIN EXTENT)"
fmt = "format=tab"# querying the database for information on the proteins
colQuery = requests.get("https://www.uniprot.org/uniprot/?query=" + protDF[~pd.isna(protDF['UniprotID'])]['UniprotID'][0] + "&" + columns + "&" + fmt)
temp = pd.read_csv(io.StringIO(colQuery.text), sep = "\t")
uniprotDF = pd.DataFrame(columns = temp.columns)
print("Progress\n" + "_"*100)
for i in range(len(proteinQueryList)):
# make request
r = requests.get("https://www.uniprot.org/uniprot/?query=" + proteinQueryList[i] + "&" + columns + "&" + fmt)
temp = pd.read_csv(io.StringIO(r.text), sep = "\t")
uniprotDF = uniprotDF.merge(temp, how = "outer")
if i/len(proteinQueryList) % .01 < (i-1)/len(proteinQueryList) % .01: print("=", end = "")
time.sleep(1)Progress
____________________________________________________________________________________________________
====================================================================================================
print(uniprotDF.info())
display(uniprotDF.head())<class 'pandas.core.frame.DataFrame'>
Int64Index: 23655 entries, 0 to 23654
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Entry 23655 non-null object
1 Entry name 23655 non-null object
2 Organism 23617 non-null object
3 Mass 23617 non-null object
4 Length 23617 non-null float64
5 Tissue specificity 12509 non-null object
6 Involvement in disease 4452 non-null object
7 Subcellular location [CC] 20868 non-null object
8 Gene ontology (GO) 23253 non-null object
9 Domain [CC] 5417 non-null object
10 Protein families 17655 non-null object
11 Domain [FT] 10999 non-null object
dtypes: float64(1), object(11)
memory usage: 2.3+ MB
None
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| Entry | Entry name | Organism | Mass | Length | Tissue specificity | Involvement in disease | Subcellular location [CC] | Gene ontology (GO) | Domain [CC] | Protein families | Domain [FT] | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | P19320 | VCAM1_HUMAN | Homo sapiens (Human) | 81,276 | 739.0 | TISSUE SPECIFICITY: Expressed on inflamed vasc... | NaN | SUBCELLULAR LOCATION: Membrane; Single-pass ty... | alpha9-beta1 integrin-vascular cell adhesion m... | DOMAIN: Either the first or the fourth Ig-like... | NaN | DOMAIN 25..105; /note="Ig-like C2-type 1"; DO... |
| 1 | P36543 | VATE1_HUMAN | Homo sapiens (Human) | 26,145 | 226.0 | TISSUE SPECIFICITY: Kidney; localizes to early... | DISEASE: Cutis laxa, autosomal recessive, 2C (... | SUBCELLULAR LOCATION: Apical cell membrane {EC... | apical plasma membrane [GO:0016324]; cytosol [... | NaN | V-ATPase E subunit family | NaN |
| 2 | Q9UKA9 | PTBP2_HUMAN | Homo sapiens (Human) | 57,491 | 531.0 | TISSUE SPECIFICITY: Mainly expressed in brain ... | NaN | SUBCELLULAR LOCATION: Nucleus {ECO:0000250|Uni... | growth cone [GO:0030426]; neuronal cell body [... | NaN | NaN | DOMAIN 59..133; /note="RRM 1"; /evidence="EC... |
| 3 | Q96NR8 | RDH12_HUMAN | Homo sapiens (Human) | 35,094 | 316.0 | TISSUE SPECIFICITY: Widely expressed, mostly i... | DISEASE: Leber congenital amaurosis 13 (LCA13)... | SUBCELLULAR LOCATION: Endoplasmic reticulum me... | endoplasmic reticulum membrane [GO:0005789]; p... | NaN | Short-chain dehydrogenases/reductases (SDR) fa... | NaN |
| 4 | Q52LW3 | RHG29_HUMAN | Homo sapiens (Human) | 142,064 | 1261.0 | TISSUE SPECIFICITY: Widely expressed. Highly e... | NaN | NaN | cytoplasm [GO:0005737]; cytosol [GO:0005829]; ... | NaN | NaN | DOMAIN 192..462; /note="F-BAR"; /evidence="E... |
We have successfully queried the database! We see that the columns can vary with the number of missing values. Further, by looking at some of the entries, we can already guess that they will be difficult to impossible to extract useful information from, as they are hand-written comments that use varied human language. The tissue specificity column, for example, has entries that read "not expressed in ..." This will be further investigated later in this tutorial.
Now, we would like to combine the InnateDB unique IDS with the Uniprot protein information. To do this, we do a left merge on the "UniprotID" column and the "Entry" column of the InnateDB IDs dataframe and the Uniprot dataframe.
print(protDF.shape)
print(uniprotDF.shape)
display(protDF.head())
display(uniprotDF.head())(25271, 3)
(23655, 12)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| innateDB_ID | alias | UniprotID | |
|---|---|---|---|
| 0 | innatedb:IDBG-1 | refseq:NP_003131|uniprotkb:Q05066|uniprotkb:SR... | Q05066 |
| 1 | innatedb:IDBG-100000 | uniprotkb:LPAR3_HUMAN|refseq:NP_036284|uniprot... | Q9UBY5 |
| 2 | innatedb:IDBG-100012 | uniprotkb:Q8IWG1|uniprotkb:WDR63_HUMAN|refseq:... | Q8IWG1 |
| 3 | innatedb:IDBG-10002 | refseq:NP_056518|uniprotkb:Q9Y3A4|uniprotkb:RR... | Q9Y3A4 |
| 4 | innatedb:IDBG-100021 | uniprotkb:MCLN3_HUMAN|uniprotkb:Q8TDD5|refseq:... | Q8TDD5 |
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| Entry | Entry name | Organism | Mass | Length | Tissue specificity | Involvement in disease | Subcellular location [CC] | Gene ontology (GO) | Domain [CC] | Protein families | Domain [FT] | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | P19320 | VCAM1_HUMAN | Homo sapiens (Human) | 81,276 | 739.0 | TISSUE SPECIFICITY: Expressed on inflamed vasc... | NaN | SUBCELLULAR LOCATION: Membrane; Single-pass ty... | alpha9-beta1 integrin-vascular cell adhesion m... | DOMAIN: Either the first or the fourth Ig-like... | NaN | DOMAIN 25..105; /note="Ig-like C2-type 1"; DO... |
| 1 | P36543 | VATE1_HUMAN | Homo sapiens (Human) | 26,145 | 226.0 | TISSUE SPECIFICITY: Kidney; localizes to early... | DISEASE: Cutis laxa, autosomal recessive, 2C (... | SUBCELLULAR LOCATION: Apical cell membrane {EC... | apical plasma membrane [GO:0016324]; cytosol [... | NaN | V-ATPase E subunit family | NaN |
| 2 | Q9UKA9 | PTBP2_HUMAN | Homo sapiens (Human) | 57,491 | 531.0 | TISSUE SPECIFICITY: Mainly expressed in brain ... | NaN | SUBCELLULAR LOCATION: Nucleus {ECO:0000250|Uni... | growth cone [GO:0030426]; neuronal cell body [... | NaN | NaN | DOMAIN 59..133; /note="RRM 1"; /evidence="EC... |
| 3 | Q96NR8 | RDH12_HUMAN | Homo sapiens (Human) | 35,094 | 316.0 | TISSUE SPECIFICITY: Widely expressed, mostly i... | DISEASE: Leber congenital amaurosis 13 (LCA13)... | SUBCELLULAR LOCATION: Endoplasmic reticulum me... | endoplasmic reticulum membrane [GO:0005789]; p... | NaN | Short-chain dehydrogenases/reductases (SDR) fa... | NaN |
| 4 | Q52LW3 | RHG29_HUMAN | Homo sapiens (Human) | 142,064 | 1261.0 | TISSUE SPECIFICITY: Widely expressed. Highly e... | NaN | NaN | cytoplasm [GO:0005737]; cytosol [GO:0005829]; ... | NaN | NaN | DOMAIN 192..462; /note="F-BAR"; /evidence="E... |
# merging the dataframes using the equivalent to SQL's left join, on the uniprot ID.
fullProtDF = protDF.merge(uniprotDF, how = "left", left_on = "UniprotID", right_on = "Entry")
print(fullProtDF.info())
display(fullProtDF.head())<class 'pandas.core.frame.DataFrame'>
Int64Index: 25271 entries, 0 to 25270
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 innateDB_ID 25271 non-null object
1 alias 25271 non-null object
2 UniprotID 25271 non-null object
3 Entry 23725 non-null object
4 Entry name 23725 non-null object
5 Organism 23676 non-null object
6 Mass 23676 non-null object
7 Length 23676 non-null float64
8 Tissue specificity 12533 non-null object
9 Involvement in disease 4462 non-null object
10 Subcellular location [CC] 20921 non-null object
11 Gene ontology (GO) 23312 non-null object
12 Domain [CC] 5426 non-null object
13 Protein families 17706 non-null object
14 Domain [FT] 11014 non-null object
dtypes: float64(1), object(14)
memory usage: 3.1+ MB
None
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| innateDB_ID | alias | UniprotID | Entry | Entry name | Organism | Mass | Length | Tissue specificity | Involvement in disease | Subcellular location [CC] | Gene ontology (GO) | Domain [CC] | Protein families | Domain [FT] | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | innatedb:IDBG-1 | refseq:NP_003131|uniprotkb:Q05066|uniprotkb:SR... | Q05066 | Q05066 | SRY_HUMAN | Homo sapiens (Human) | 23,884 | 204.0 | NaN | DISEASE: 46,XY sex reversal 1 (SRXY1) [MIM:400... | SUBCELLULAR LOCATION: Nucleus speckle {ECO:000... | chromatin [GO:0000785]; cytoplasm [GO:0005737]... | DOMAIN: DNA binding and bending properties of ... | SRY family | NaN |
| 1 | innatedb:IDBG-100000 | uniprotkb:LPAR3_HUMAN|refseq:NP_036284|uniprot... | Q9UBY5 | Q9UBY5 | LPAR3_HUMAN | Homo sapiens (Human) | 40,128 | 353.0 | TISSUE SPECIFICITY: Most abundantly expressed ... | NaN | SUBCELLULAR LOCATION: Cell membrane; Multi-pas... | axon [GO:0030424]; cytoplasm [GO:0005737]; int... | NaN | G-protein coupled receptor 1 family | NaN |
| 2 | innatedb:IDBG-100012 | uniprotkb:Q8IWG1|uniprotkb:WDR63_HUMAN|refseq:... | Q8IWG1 | Q8IWG1 | DNAI3_HUMAN | Homo sapiens (Human) | 102,935 | 891.0 | NaN | DISEASE: Note=A rare heterozygous in-frame DNA... | SUBCELLULAR LOCATION: Cytoplasm {ECO:0000269|P... | axonemal dynein complex [GO:0005858]; cytoplas... | NaN | NaN | NaN |
| 3 | innatedb:IDBG-10002 | refseq:NP_056518|uniprotkb:Q9Y3A4|uniprotkb:RR... | Q9Y3A4 | Q9Y3A4 | RRP7A_HUMAN | Homo sapiens (Human) | 32,334 | 280.0 | NaN | NaN | NaN | CURI complex [GO:0032545]; cytoplasm [GO:00057... | NaN | RRP7 family | DOMAIN 61..90; /note="RRM" |
| 4 | innatedb:IDBG-100021 | uniprotkb:MCLN3_HUMAN|uniprotkb:Q8TDD5|refseq:... | Q8TDD5 | Q8TDD5 | MCLN3_HUMAN | Homo sapiens (Human) | 64,248 | 553.0 | NaN | NaN | SUBCELLULAR LOCATION: Cell membrane {ECO:00002... | autophagosome membrane [GO:0000421]; early end... | DOMAIN: The most N-terminal extracellular/lume... | Transient receptor (TC 1.A.4) family, Polycyst... | NaN |
While each InnateDB had listed a UniprotID, not every UniprotID was successfully queried from the Uniprot database. This means some of our InnateDB IDs (the nodes on our network) will not have ay information from Uniprot database.
Now that we have Uniprot data associated with most of the nodes, we would like to make this data usable. Most of the columns are strings that are impossible to use in data analysis. However, some of the columns can be converted into usable information. The "Involvement in disease" entries have Mendelian Inheritance in Man (MIM) Ids, which are IDs of known genetic diseases and can be extracted via a regular expression. The same goes for The "Domain [FT]" column have a set number of descriptors, and the "Gene Ontology (GO)" column, which has Gene Ontology (GO) Ids that describe subcellular locations. The Ids are extracted, put into a list, and into new columns.
# get the mass as a float
fullProtDF["flMass"] = fullProtDF["Mass"].apply(lambda x: float(str(x).replace(",","")) if ~pd.isna(x) else None)
# process into list of MIM ids
def getMIM(s):
if pd.isna(s):
return None
else:
return re.findall("\[MIM:(\d+)\]", s)
fullProtDF["MIM"] = fullProtDF["Involvement in disease"].apply(getMIM)
# process domain column into list of IDs
def getDOM(s):
if pd.isna(s):
return None
else:
return re.findall('/note="([\w\s-]+)"', s)
fullProtDF["Domain_ID"] = fullProtDF["Domain [FT]"].apply(getDOM)
# process to get GO IDs
def getGO(s):
if pd.isna(s):
return None
else:
return re.findall("\[GO:(\d+)\]", s)
fullProtDF["GO_ID"] = fullProtDF["Gene ontology (GO)"].apply(getGO)
display(fullProtDF.head()).dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| innateDB_ID | alias | UniprotID | Entry | Entry name | Organism | Mass | Length | Tissue specificity | Involvement in disease | Subcellular location [CC] | Gene ontology (GO) | Domain [CC] | Protein families | Domain [FT] | Degree_Centrality | flMass | MIM | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | innatedb:IDBG-1 | refseq:NP_003131|uniprotkb:Q05066|uniprotkb:SR... | Q05066 | Q05066 | SRY_HUMAN | Homo sapiens (Human) | 23,884 | 204.0 | NaN | DISEASE: 46,XY sex reversal 1 (SRXY1) [MIM:400... | SUBCELLULAR LOCATION: Nucleus speckle {ECO:000... | chromatin [GO:0000785]; cytoplasm [GO:0005737]... | DOMAIN: DNA binding and bending properties of ... | SRY family | NaN | 0.000953 | 23884.0 | [400044, 400045] |
| 1 | innatedb:IDBG-100000 | uniprotkb:LPAR3_HUMAN|refseq:NP_036284|uniprot... | Q9UBY5 | Q9UBY5 | LPAR3_HUMAN | Homo sapiens (Human) | 40,128 | 353.0 | TISSUE SPECIFICITY: Most abundantly expressed ... | NaN | SUBCELLULAR LOCATION: Cell membrane; Multi-pas... | axon [GO:0030424]; cytoplasm [GO:0005737]; int... | NaN | G-protein coupled receptor 1 family | NaN | 0.000079 | 40128.0 | None |
| 2 | innatedb:IDBG-100012 | uniprotkb:Q8IWG1|uniprotkb:WDR63_HUMAN|refseq:... | Q8IWG1 | Q8IWG1 | DNAI3_HUMAN | Homo sapiens (Human) | 102,935 | 891.0 | NaN | DISEASE: Note=A rare heterozygous in-frame DNA... | SUBCELLULAR LOCATION: Cytoplasm {ECO:0000269|P... | axonemal dynein complex [GO:0005858]; cytoplas... | NaN | NaN | NaN | 0.000040 | 102935.0 | [] |
| 3 | innatedb:IDBG-10002 | refseq:NP_056518|uniprotkb:Q9Y3A4|uniprotkb:RR... | Q9Y3A4 | Q9Y3A4 | RRP7A_HUMAN | Homo sapiens (Human) | 32,334 | 280.0 | NaN | NaN | NaN | CURI complex [GO:0032545]; cytoplasm [GO:00057... | NaN | RRP7 family | DOMAIN 61..90; /note="RRM" | 0.001033 | 32334.0 | None |
| 4 | innatedb:IDBG-100021 | uniprotkb:MCLN3_HUMAN|uniprotkb:Q8TDD5|refseq:... | Q8TDD5 | Q8TDD5 | MCLN3_HUMAN | Homo sapiens (Human) | 64,248 | 553.0 | NaN | NaN | SUBCELLULAR LOCATION: Cell membrane {ECO:00002... | autophagosome membrane [GO:0000421]; early end... | DOMAIN: The most N-terminal extracellular/lume... | Transient receptor (TC 1.A.4) family, Polycyst... | NaN | 0.000079 | 64248.0 | None |
Now that we have useful forms of the protein characteristics, we want to get measures of centrality on the nodes in our network. To do this, we make use of the networkx centrality algorithms. We compute the measure of centrality, and add it as a column to our dataframe. For our measures of centrality, we chose degree, closeness, and eigenvector, as these are some of the most common forms of measuring centrality. Betweenness was left out due to computational limitations.
# getting the graph from the edge list
G = nx.Graph(list(np.array(edgeListNoSelf)))
# adding measures of centrality to the data frame
# degree centrality
a = nx.degree_centrality(G)
fullProtDF["Degree_Centrality"] = fullProtDF["innateDB_ID"].apply(lambda x: a.get(x))
# closeness centrality
a = nx.closeness_centrality(G)
fullProtDF["Closeness_Centrality"] = fullProtDF["innateDB_ID"].apply(lambda x: a.get(x))
# eigenvector centrality
a = nx.eigenvector_centrality(G)
fullProtDF["Eigenvector_Centrality"] = fullProtDF["innateDB_ID"].apply(lambda x: a.get(x))display(fullProtDF.head()).dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| innateDB_ID | alias | UniprotID | Entry | Entry name | Organism | Mass | Length | Tissue specificity | Involvement in disease | Subcellular location [CC] | Gene ontology (GO) | Domain [CC] | Protein families | Domain [FT] | Degree_Centrality | flMass | MIM | Domain_ID | GO_ID | Closeness_Centrality | Eigenvector_Centrality | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | innatedb:IDBG-1 | refseq:NP_003131|uniprotkb:Q05066|uniprotkb:SR... | Q05066 | Q05066 | SRY_HUMAN | Homo sapiens (Human) | 23,884 | 204.0 | NaN | DISEASE: 46,XY sex reversal 1 (SRXY1) [MIM:400... | SUBCELLULAR LOCATION: Nucleus speckle {ECO:000... | chromatin [GO:0000785]; cytoplasm [GO:0005737]... | DOMAIN: DNA binding and bending properties of ... | SRY family | NaN | 0.000953 | 23884.0 | [400044, 400045] | None | [0000785, 0005737, 0016607, 0005654, 0005634, ... | 0.321910 | 0.001644 |
| 1 | innatedb:IDBG-100000 | uniprotkb:LPAR3_HUMAN|refseq:NP_036284|uniprot... | Q9UBY5 | Q9UBY5 | LPAR3_HUMAN | Homo sapiens (Human) | 40,128 | 353.0 | TISSUE SPECIFICITY: Most abundantly expressed ... | NaN | SUBCELLULAR LOCATION: Cell membrane; Multi-pas... | axon [GO:0030424]; cytoplasm [GO:0005737]; int... | NaN | G-protein coupled receptor 1 family | NaN | 0.000079 | 40128.0 | None | None | [0030424, 0005737, 0005887, 0005886, 0045202, ... | 0.297747 | 0.000734 |
| 2 | innatedb:IDBG-100012 | uniprotkb:Q8IWG1|uniprotkb:WDR63_HUMAN|refseq:... | Q8IWG1 | Q8IWG1 | DNAI3_HUMAN | Homo sapiens (Human) | 102,935 | 891.0 | NaN | DISEASE: Note=A rare heterozygous in-frame DNA... | SUBCELLULAR LOCATION: Cytoplasm {ECO:0000269|P... | axonemal dynein complex [GO:0005858]; cytoplas... | NaN | NaN | NaN | 0.000040 | 102935.0 | [] | None | [0005858, 0005737, 0036156, 0071933, 0045504, ... | 0.264544 | 0.000114 |
| 3 | innatedb:IDBG-10002 | refseq:NP_056518|uniprotkb:Q9Y3A4|uniprotkb:RR... | Q9Y3A4 | Q9Y3A4 | RRP7A_HUMAN | Homo sapiens (Human) | 32,334 | 280.0 | NaN | NaN | NaN | CURI complex [GO:0032545]; cytoplasm [GO:00057... | NaN | RRP7 family | DOMAIN 61..90; /note="RRM" | 0.001033 | 32334.0 | None | [RRM] | [0032545, 0005737, 0005654, 0034456, 0003723, ... | 0.349394 | 0.006516 |
| 4 | innatedb:IDBG-100021 | uniprotkb:MCLN3_HUMAN|uniprotkb:Q8TDD5|refseq:... | Q8TDD5 | Q8TDD5 | MCLN3_HUMAN | Homo sapiens (Human) | 64,248 | 553.0 | NaN | NaN | SUBCELLULAR LOCATION: Cell membrane {ECO:00002... | autophagosome membrane [GO:0000421]; early end... | DOMAIN: The most N-terminal extracellular/lume... | Transient receptor (TC 1.A.4) family, Polycyst... | NaN | 0.000079 | 64248.0 | None | None | [0000421, 0031901, 0016021, 0031902, 0005765, ... | 0.346118 | 0.002906 |
Having gotten measures of centrality, we would like to compare the protein characteristics with the measures of centrality via a linear regression. Our list of ids is insufficient for this, as a linear regression does not work with categorical data, to make the data usable, we take the length of the lists, so that we can compare the centrality with the number of domains, genetic diseases, and subcellular locations.
# Get list of all attributes for each category
# Domains
DomList = []
for i in fullProtDF["Domain_ID"]:
if i:
for j in i:
if not j in DomList:
DomList.append(j)
print(len(DomList))
# MIM Ids
MIMList = []
for i in fullProtDF["MIM"]:
if i:
for j in i:
if not j in MIMList:
MIMList.append(j)
print(len(MIMList))
# Gene Ontology Ids
GOList = []
for i in fullProtDF["GO_ID"]:
if i:
for j in i:
if not j in GOList:
GOList.append(j)
print(len(GOList))2072
5293
18911
# add list lengths to dataframe
fullProtDF["numMIM"] = fullProtDF["MIM"].apply(lambda x: len(x) if x else None)
fullProtDF["numDOM"] = fullProtDF["Domain_ID"].apply(lambda x: len(x) if x else None)
fullProtDF["numGO"] = fullProtDF["GO_ID"].apply(lambda x: len(x) if x else None)
display(fullProtDF.head()).dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| innateDB_ID | alias | UniprotID | Entry | Entry name | Organism | Mass | Length | Tissue specificity | Involvement in disease | Subcellular location [CC] | Gene ontology (GO) | Domain [CC] | Protein families | Domain [FT] | Degree_Centrality | flMass | MIM | Domain_ID | GO_ID | Closeness_Centrality | Eigenvector_Centrality | numMIM | numDOM | numGO | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | innatedb:IDBG-1 | refseq:NP_003131|uniprotkb:Q05066|uniprotkb:SR... | Q05066 | Q05066 | SRY_HUMAN | Homo sapiens (Human) | 23,884 | 204.0 | NaN | DISEASE: 46,XY sex reversal 1 (SRXY1) [MIM:400... | SUBCELLULAR LOCATION: Nucleus speckle {ECO:000... | chromatin [GO:0000785]; cytoplasm [GO:0005737]... | DOMAIN: DNA binding and bending properties of ... | SRY family | NaN | 0.000953 | 23884.0 | [400044, 400045] | None | [0000785, 0005737, 0016607, 0005654, 0005634, ... | 0.321910 | 0.001644 | 2.0 | NaN | 20.0 |
| 1 | innatedb:IDBG-100000 | uniprotkb:LPAR3_HUMAN|refseq:NP_036284|uniprot... | Q9UBY5 | Q9UBY5 | LPAR3_HUMAN | Homo sapiens (Human) | 40,128 | 353.0 | TISSUE SPECIFICITY: Most abundantly expressed ... | NaN | SUBCELLULAR LOCATION: Cell membrane; Multi-pas... | axon [GO:0030424]; cytoplasm [GO:0005737]; int... | NaN | G-protein coupled receptor 1 family | NaN | 0.000079 | 40128.0 | None | None | [0030424, 0005737, 0005887, 0005886, 0045202, ... | 0.297747 | 0.000734 | NaN | NaN | 17.0 |
| 2 | innatedb:IDBG-100012 | uniprotkb:Q8IWG1|uniprotkb:WDR63_HUMAN|refseq:... | Q8IWG1 | Q8IWG1 | DNAI3_HUMAN | Homo sapiens (Human) | 102,935 | 891.0 | NaN | DISEASE: Note=A rare heterozygous in-frame DNA... | SUBCELLULAR LOCATION: Cytoplasm {ECO:0000269|P... | axonemal dynein complex [GO:0005858]; cytoplas... | NaN | NaN | NaN | 0.000040 | 102935.0 | [] | None | [0005858, 0005737, 0036156, 0071933, 0045504, ... | 0.264544 | 0.000114 | NaN | NaN | 12.0 |
| 3 | innatedb:IDBG-10002 | refseq:NP_056518|uniprotkb:Q9Y3A4|uniprotkb:RR... | Q9Y3A4 | Q9Y3A4 | RRP7A_HUMAN | Homo sapiens (Human) | 32,334 | 280.0 | NaN | NaN | NaN | CURI complex [GO:0032545]; cytoplasm [GO:00057... | NaN | RRP7 family | DOMAIN 61..90; /note="RRM" | 0.001033 | 32334.0 | None | [RRM] | [0032545, 0005737, 0005654, 0034456, 0003723, ... | 0.349394 | 0.006516 | NaN | 1.0 | 8.0 |
| 4 | innatedb:IDBG-100021 | uniprotkb:MCLN3_HUMAN|uniprotkb:Q8TDD5|refseq:... | Q8TDD5 | Q8TDD5 | MCLN3_HUMAN | Homo sapiens (Human) | 64,248 | 553.0 | NaN | NaN | SUBCELLULAR LOCATION: Cell membrane {ECO:00002... | autophagosome membrane [GO:0000421]; early end... | DOMAIN: The most N-terminal extracellular/lume... | Transient receptor (TC 1.A.4) family, Polycyst... | NaN | 0.000079 | 64248.0 | None | None | [0000421, 0031901, 0016021, 0031902, 0005765, ... | 0.346118 | 0.002906 | NaN | NaN | 13.0 |
Having added the list lengths to our dataframe, we would like to start by doing linear regressions to compare individual parameters. To do this, we make a list of each possible pair of parameters, and then make a list of linear regression solutions for each of them. Finally, we make a histogram showing the distribution in r-squared values, which are indicators of a good fit.
linearRegressList = ["flMass", "Length",
"Degree_Centrality", "Closeness_Centrality", "Eigenvector_Centrality",
"numMIM", "numDOM", "numGO"]
pairs = []
for i in range(len(linearRegressList)-1):
for j in range(i+1):
pairs.append((linearRegressList[i+1], linearRegressList[j]))regressSols = []
for i in pairs:
a = ~pd.isna(fullProtDF[i[0]])
b = ~pd.isna(fullProtDF[i[1]])
x = np.array(fullProtDF[a & b][i[0]]).reshape(-1,1)
y = np.array(fullProtDF[a & b][i[1]]).reshape(-1,1)
x = sm.add_constant(x)
s = sm.OLS(y, x).fit()
regressSols.append(s)# making a histogram of the r-squared values
fnt = 15
rSq = np.array([i.rsquared for i in regressSols])
plt.figure(figsize = [10,8])
plt.hist(rSq, bins = 30)
plt.grid(True, linestyle = "--")
plt.xlabel(r"$R^2$", fontsize = fnt)
plt.ylabel("Frequency", fontsize = fnt)
plt.title(r"$R^2$ Distribution", fontsize = fnt)
plt.show()We see that some of the parameters have an r squared of greater than .2. We expect the r squared close to 1 to be the Mass Length relationship, as the longer the protein chain, the heavier it is. We plot the Fits on the data for the parameter pairs with an r squared of greater than .2.
rsqCutoff = .2
nrsq = (rSq > rsqCutoff).sum()
print(f"Number of pairs with an r squared greater than .4: {nrsq}")
# print out the pairs, along with their coefficients and r squared values
for i in range(len(regressSols)):
if regressSols[i].rsquared > rsqCutoff:
print(pairs[i])
Number of pairs with an r squared greater than .4: 5
('Length', 'flMass')
('Eigenvector_Centrality', 'Degree_Centrality')
('Eigenvector_Centrality', 'Closeness_Centrality')
('numDOM', 'flMass')
('numDOM', 'Length')
plt.figure(figsize = [10,60])
j = 1
for i in range(len(regressSols)):
if regressSols[i].rsquared > rsqCutoff:
plt.subplot(nrsq,1,j)
j += 1
a = ~pd.isna(fullProtDF[pairs[i][0]])
b = ~pd.isna(fullProtDF[pairs[i][1]])
x = np.array(fullProtDF[a & b][pairs[i][0]]).reshape(-1,1)
y = np.array(fullProtDF[a & b][pairs[i][1]]).reshape(-1,1)
plt.scatter(x, y, label = "Data")
plt.plot(x, regressSols[i].params[1]*x + regressSols[i].params[0], color = "orange", label = "Fit")
plt.xlabel(pairs[i][0], fontsize = 15)
plt.ylabel(pairs[i][1], fontsize = 15)
plt.grid(True, linestyle = "--")
plt.legend(fontsize = 13)
plt.tight_layout()
plt.show()It seems that significant correlations presented are lacking in new information. The number of domains is also expected to correspond with the mass/length of the protein, as a larger protein means more subregions in the protein. In addition, the relation between the measures of centrality is also expected, as they do all measure the same idea of a node being central or important in the graph.
In conclusion, there are no obvious relationships between the location of a protein in an interaction network and the strucutural/locational properties of the protein. In hindsight, that makes a certain amount of sense. A protein may be a part of many pathways, not due to its size or number of domains, but due to how often the function of the protein is necessary. To further investigate this, it would be prudent to observe whether certain pathways or functions correlate with protein centrality. Furthermore, getting details that correspond with the Domain labels, MIM ids, or GO ids from the proteins (i.e. is the centrality for proteins involved in this genetic disease different from proteins involved in that genetic disease). Finally, as proteins have several labels, machine learning may be useful as a predictor of any one label using all the other data.



