Skip to content

Latest commit

 

History

History
1505 lines (1349 loc) · 51.8 KB

File metadata and controls

1505 lines (1349 loc) · 51.8 KB

Investigating the Properties of Central Proteins in an Interaction Network

By: Paul Kolbeck

Proteins are large molecules inside of organisms that fulfill a huge variety of functions within organisms. All of biological life hinges on the functions of proteins within organisms. From viruses to whales and everything in between, living beings are basically a collection of proteins that, through complex interactions and mechanisms, ensure the continuation of their own existence. As proteins are an underlying part of all biological processes, understanding proteins and their interactions is an important portion research in biology.

Proteins are created when a ribosome (a type of protein complex) takes in a gene sequence in the from of a strand of messenger RNA (mRNA), which was transcripted from the DNA in the nucleus of the cell, and starts an amino acid chain. Amino acids are the building blocks of proteins, and are small organic compounds that have binding sites on them that allow them to be bound together in a chain. As the ribosome travels along the mRNA strand, it encounters a sequence of triplets of bases, known as codons. Each codon corresponds to an amino acid, which the ribosome then takes from the surrounding supply and attaches to the budding amino acid chain in the order in which the codons appear. Thus, the mRNA strand is used to build a new sequence of amino acids. This sequence of amino acids then folds into a 3d shape, which then has complex properties based on the sequence of amino acids and how it folded.

Not only are the proteins themselves complex, but they exist within a sea of other proteins with which they may interact, giving rise to a network of interactions which further infuence the functionality of proteins. As the environments change, so does the function of a protein, so a protein that is a part of one set of interactions in a liver that allow the liver to grow back, may be part of a different set of interactions in skeletal muscle that allow the muscle to contract. Mapping these interactions is thus another integral part in understanding the biological processes in an organism.

In this tutorial, our goal is to investigate possible relationships between several protein characteristics, particularly their role in an interaction network and their structure. For example, one might predict that the number of pathways a protein is a part of may correlate to the number of active sites on a protein, which further correlates with the mass (i.e. size). TODO

Tools

For this analysis, we use Python 3, along with external packages numpy, pandas, matplotlib, networkx, and statsmodel. TODO

# imports 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import networkx as nx
import re
import requests
from bs4 import BeautifulSoup
import statsmodels.api as sm
import io
import time
import warnings
warnings.filterwarnings("ignore")

Gathering and Cleaning Data

For this tutorial, we chose to use the interaction network data provided by Innatedb. Innatedb is a database of genes, interactions, and signaling pathways related to the innate immune response of human beings. This was chosen so as to provide a sample of the protein interaction network in the human body with a fleshed out interaction network. As this is a biased sample of the population of all proteins in the human body, results derived from the analysis will only definitively apply to this subset of data, but may inform further investigation in other portions of the human protein interaction network.

Acquiring the Innatedb Data

The InnateDB database is only on the order of a hundred Megabytes, so downloading and importing the entire database was taken as the easiest and cleanest way to work with the database. The data from Innatedb was downloaded as a gz zipped file, extracted, and imported using pandas read_csv method.

# Making a dataframe out of the innate immune system data file, and getting characteristics from it. 
interactionDF = pd.read_csv("./all.mitab/all.mitab", sep = "\t")
print(interactionDF.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418677 entries, 0 to 418676
Data columns (total 42 columns):
 #   Column                               Non-Null Count   Dtype 
---  ------                               --------------   ----- 
 0   #unique_identifier_A                 418677 non-null  object
 1   unique_identifier_B                  418677 non-null  object
 2   alt_identifier_A                     418677 non-null  object
 3   alt_identifier_B                     418677 non-null  object
 4   alias_A                              418677 non-null  object
 5   alias_B                              418677 non-null  object
 6   interaction_detection_method         418677 non-null  object
 7   author                               418677 non-null  object
 8   pmid                                 418677 non-null  object
 9   ncbi_taxid_A                         418677 non-null  object
 10  ncbi_taxid_B                         418677 non-null  object
 11  interaction_type                     417823 non-null  object
 12  source_database                      418677 non-null  object
 13  idinteraction_in_source_db           418677 non-null  object
 14  confidence_score                     418677 non-null  object
 15  expansion_method                     418677 non-null  object
 16  biological_role_A                    418677 non-null  object
 17  biological_role_B                    418677 non-null  object
 18  exp_role_A                           418677 non-null  object
 19  exp_role_B                           418677 non-null  object
 20  interactor_type_A                    418677 non-null  object
 21  interactor_type_B                    418677 non-null  object
 22  xrefs_A                              418677 non-null  object
 23  xrefs_B                              418677 non-null  object
 24  xrefs_interaction                    418677 non-null  object
 25  annotations_A                        418677 non-null  object
 26  annotations_B                        418677 non-null  object
 27  annotations_interaction              418677 non-null  object
 28  ncbi_taxid_host_organism             418677 non-null  object
 29  parameters_interaction               418677 non-null  object
 30  creation_date                        418677 non-null  object
 31  update_date                          418677 non-null  object
 32  checksum_A                           418677 non-null  object
 33  checksum_B                           418677 non-null  object
 34  checksum_interaction                 418677 non-null  object
 35  negative                             418677 non-null  bool  
 36  features_A                           418677 non-null  object
 37  features_B                           418677 non-null  object
 38  stoichiometry_A                      418677 non-null  object
 39  stoichiometry_B                      418677 non-null  object
 40  participant_identification_method_A  418677 non-null  object
 41  participant_identification_method_B  418677 non-null  object
dtypes: bool(1), object(41)
memory usage: 131.4+ MB
None

We see that we get a 131 MB dataframe, where each entry is an interaction. Each entry features a unique identifier created by InnateDB for proteins A and B, as well as several other identifiers. First, we'd like to check that we have a network. To do this, we are going to make a graph out of the first 2000 entries in the dataframe, as the function that draw the graph from matplotlib and networkx can't handle a graph with 400,000+ edges.

# Taking a look at the network
edgeList = interactionDF[["#unique_identifier_A", "unique_identifier_B"]]
# removing self interaction edges, to reduce clutter
edgeListNoSelf = edgeList[edgeList["#unique_identifier_A"] != edgeList["unique_identifier_B"]]
G = nx.Graph(list(np.array(edgeListNoSelf.head(2000))))
plt.figure(figsize = [30,15])
subax1 = plt.subplot(121)
nx.draw(G, node_size = 8)
plt.show()

png

Looking at the graph, we can see that there is one main component with a majority of the proteins, and several isolated components. These isolated components may be connected to the main component if we were to include the rest of the edges, but checking using the method of graphing is unreasonable. However, we can take a closer look at the main component, isolating it using networkx's connected_components algorithm. We change the node color to green and the edge color to red for better contrast.

# getting the largest component
S = [G.subgraph(c).copy() for c in nx.connected_components(G)]
Sl = max(S, key=len)
plt.figure(figsize = [30,15])
subax1 = plt.subplot(121)
nx.draw(Sl, node_size = 10, node_color = "green", edge_color = "red")
plt.show()

png

Just from looking at this graph, we see that we have a variety of positions in the graph - measures of centrality should be varied. We can see that there are some nodes from which there are several edges (i.e. they have a high degree), while there are other nodes that only have one edge. Of course, this is only a component of whole graph, but it suggests that we should expect variance.

Getting More Protein Characteristics - Uniprot Database

We have confirmed that we have a network of protein interactions, which means that we have a means of getting measures of centrality for our proteins. However, we would like to compare these measures of centrality to other characteristics of the proteins, and the InnateDB database does not have any characteristics on the proteins themselves. However, the dataframe does supply additional IDs for the proteins, which can be used to query other databases for more information on the proteins. We chose to use the Uniprot database for our information on the proteins. The uniprot IDs are listed under the "alias_(A,B)" columns in the InnateDB data, along with a few other IDs. To acquire the Uniprot ID, we use a regular expression. Note that the "alias_(A,B)" actually contains both the Uniprot ID and the Uniprot Entry Name, so we add the '^_' to our regular expression to ensure we match on the ID.

# Defining function to get the uniprotkb protein id, if it exists. 
def getUniprotkb(str):
    r = re.search("uniprotkb:([^_|]+)[|]", str + "|")
    if r:
        return r.group(1)
    else:
        return "NONE"

# get mapping from innatedb unique identifier to uniprot ID
aProts = (interactionDF["#unique_identifier_A"] + "->" + interactionDF["alias_A"]).unique()
bProts = (interactionDF["unique_identifier_B"] + "->" + interactionDF["alias_B"]).unique()
allProts = np.unique(np.concatenate((aProts, bProts), axis = 0))

protDF  = pd.DataFrame([i.split("->") for i in allProts], columns = ["innateDB_ID","alias"])
protDF["UniprotID"] = protDF["alias"].apply(getUniprotkb)

# Display head, get the number of entries without a uniprotID
print(f"Number of unique innatedb ids: {len(protDF['innateDB_ID'].unique())}")
print(f"{pd.isna(protDF['UniprotID']).sum()} entries are missing a uniprotID out of {protDF.shape[0]}")
display(protDF.head())
Number of unique innatedb ids: 25271
0 entries are missing a uniprotID out of 25271
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
innateDB_ID alias UniprotID
0 innatedb:IDBG-1 refseq:NP_003131|uniprotkb:Q05066|uniprotkb:SR... Q05066
1 innatedb:IDBG-100000 uniprotkb:LPAR3_HUMAN|refseq:NP_036284|uniprot... Q9UBY5
2 innatedb:IDBG-100012 uniprotkb:Q8IWG1|uniprotkb:WDR63_HUMAN|refseq:... Q8IWG1
3 innatedb:IDBG-10002 refseq:NP_056518|uniprotkb:Q9Y3A4|uniprotkb:RR... Q9Y3A4
4 innatedb:IDBG-100021 uniprotkb:MCLN3_HUMAN|uniprotkb:Q8TDD5|refseq:... Q8TDD5

We have a Uniprot ID for every InnateDB ID! This means we don't have to handle any missing values yet.

Querying

Now we need to get the protein data from the Uniprot database. Uniprot is popular enough to luckily have an extensive API (TODO), which we take advantage of to query their database for information each protein. There is an upper limit on the number of individual ids that can be queried at the same time, so we divide the list of 25000+ IDs into blocks of length 100. The API also specifies that queries can be limited to certain columns, so we choose a set of columns relevant to our investigation. We chose the id again, the entry name, the organism (to confirm this is a human protein), the mass, the length, the tissue specificity, the disease involvement, the subcellular location, the gene ontology, the domains, the family and the domain extent.

# breaking the queries into blocks
def makeQueryList(n, ids):
    numBlocks = int((len(ids)/n + 1) - len(ids)/n % 1)
    queryList = []
    for i in range(numBlocks - 1):
        entry = "id:" + ids[i*n]
        for j in range(n-1):
            entry = entry + "+OR+id:" + ids[i*n + 1 + j]
        queryList.append(entry)
    entry = ids[(numBlocks-1)*n]
    for j in range(len(ids) - (numBlocks-1)*n - 1):
        entry = entry + "+OR+" + ids[(numBlocks-1)*n + 1 + j]
    return queryList

proteinQueryList = makeQueryList(100, np.array(protDF[~pd.isna(protDF['UniprotID'])]['UniprotID']))
columns = "columns=id,entry name,organism,mass,length,comment(TISSUE SPECIFICITY),comment(DISEASE),comment(SUBCELLULAR LOCATION),go,comment(DOMAIN),families,feature(DOMAIN EXTENT)"
fmt = "format=tab"
# querying the database for information on the proteins
colQuery = requests.get("https://www.uniprot.org/uniprot/?query=" + protDF[~pd.isna(protDF['UniprotID'])]['UniprotID'][0] + "&" + columns + "&" + fmt)
temp = pd.read_csv(io.StringIO(colQuery.text), sep = "\t")
uniprotDF = pd.DataFrame(columns = temp.columns)
print("Progress\n" + "_"*100)
for i in range(len(proteinQueryList)):
    # make request
    r = requests.get("https://www.uniprot.org/uniprot/?query=" + proteinQueryList[i] + "&" + columns + "&" + fmt)
    temp = pd.read_csv(io.StringIO(r.text), sep = "\t")
    uniprotDF = uniprotDF.merge(temp, how = "outer")
    if i/len(proteinQueryList) % .01 < (i-1)/len(proteinQueryList) % .01: print("=", end = "")
    time.sleep(1)
Progress
____________________________________________________________________________________________________
====================================================================================================
print(uniprotDF.info())
display(uniprotDF.head())
<class 'pandas.core.frame.DataFrame'>
Int64Index: 23655 entries, 0 to 23654
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Entry                      23655 non-null  object 
 1   Entry name                 23655 non-null  object 
 2   Organism                   23617 non-null  object 
 3   Mass                       23617 non-null  object 
 4   Length                     23617 non-null  float64
 5   Tissue specificity         12509 non-null  object 
 6   Involvement in disease     4452 non-null   object 
 7   Subcellular location [CC]  20868 non-null  object 
 8   Gene ontology (GO)         23253 non-null  object 
 9   Domain [CC]                5417 non-null   object 
 10  Protein families           17655 non-null  object 
 11  Domain [FT]                10999 non-null  object 
dtypes: float64(1), object(11)
memory usage: 2.3+ MB
None
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Entry Entry name Organism Mass Length Tissue specificity Involvement in disease Subcellular location [CC] Gene ontology (GO) Domain [CC] Protein families Domain [FT]
0 P19320 VCAM1_HUMAN Homo sapiens (Human) 81,276 739.0 TISSUE SPECIFICITY: Expressed on inflamed vasc... NaN SUBCELLULAR LOCATION: Membrane; Single-pass ty... alpha9-beta1 integrin-vascular cell adhesion m... DOMAIN: Either the first or the fourth Ig-like... NaN DOMAIN 25..105; /note="Ig-like C2-type 1"; DO...
1 P36543 VATE1_HUMAN Homo sapiens (Human) 26,145 226.0 TISSUE SPECIFICITY: Kidney; localizes to early... DISEASE: Cutis laxa, autosomal recessive, 2C (... SUBCELLULAR LOCATION: Apical cell membrane {EC... apical plasma membrane [GO:0016324]; cytosol [... NaN V-ATPase E subunit family NaN
2 Q9UKA9 PTBP2_HUMAN Homo sapiens (Human) 57,491 531.0 TISSUE SPECIFICITY: Mainly expressed in brain ... NaN SUBCELLULAR LOCATION: Nucleus {ECO:0000250|Uni... growth cone [GO:0030426]; neuronal cell body [... NaN NaN DOMAIN 59..133; /note="RRM 1"; /evidence="EC...
3 Q96NR8 RDH12_HUMAN Homo sapiens (Human) 35,094 316.0 TISSUE SPECIFICITY: Widely expressed, mostly i... DISEASE: Leber congenital amaurosis 13 (LCA13)... SUBCELLULAR LOCATION: Endoplasmic reticulum me... endoplasmic reticulum membrane [GO:0005789]; p... NaN Short-chain dehydrogenases/reductases (SDR) fa... NaN
4 Q52LW3 RHG29_HUMAN Homo sapiens (Human) 142,064 1261.0 TISSUE SPECIFICITY: Widely expressed. Highly e... NaN NaN cytoplasm [GO:0005737]; cytosol [GO:0005829]; ... NaN NaN DOMAIN 192..462; /note="F-BAR"; /evidence="E...

We have successfully queried the database! We see that the columns can vary with the number of missing values. Further, by looking at some of the entries, we can already guess that they will be difficult to impossible to extract useful information from, as they are hand-written comments that use varied human language. The tissue specificity column, for example, has entries that read "not expressed in ..." This will be further investigated later in this tutorial.

Merging

Now, we would like to combine the InnateDB unique IDS with the Uniprot protein information. To do this, we do a left merge on the "UniprotID" column and the "Entry" column of the InnateDB IDs dataframe and the Uniprot dataframe.

print(protDF.shape)
print(uniprotDF.shape)
display(protDF.head())
display(uniprotDF.head())
(25271, 3)
(23655, 12)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
innateDB_ID alias UniprotID
0 innatedb:IDBG-1 refseq:NP_003131|uniprotkb:Q05066|uniprotkb:SR... Q05066
1 innatedb:IDBG-100000 uniprotkb:LPAR3_HUMAN|refseq:NP_036284|uniprot... Q9UBY5
2 innatedb:IDBG-100012 uniprotkb:Q8IWG1|uniprotkb:WDR63_HUMAN|refseq:... Q8IWG1
3 innatedb:IDBG-10002 refseq:NP_056518|uniprotkb:Q9Y3A4|uniprotkb:RR... Q9Y3A4
4 innatedb:IDBG-100021 uniprotkb:MCLN3_HUMAN|uniprotkb:Q8TDD5|refseq:... Q8TDD5
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Entry Entry name Organism Mass Length Tissue specificity Involvement in disease Subcellular location [CC] Gene ontology (GO) Domain [CC] Protein families Domain [FT]
0 P19320 VCAM1_HUMAN Homo sapiens (Human) 81,276 739.0 TISSUE SPECIFICITY: Expressed on inflamed vasc... NaN SUBCELLULAR LOCATION: Membrane; Single-pass ty... alpha9-beta1 integrin-vascular cell adhesion m... DOMAIN: Either the first or the fourth Ig-like... NaN DOMAIN 25..105; /note="Ig-like C2-type 1"; DO...
1 P36543 VATE1_HUMAN Homo sapiens (Human) 26,145 226.0 TISSUE SPECIFICITY: Kidney; localizes to early... DISEASE: Cutis laxa, autosomal recessive, 2C (... SUBCELLULAR LOCATION: Apical cell membrane {EC... apical plasma membrane [GO:0016324]; cytosol [... NaN V-ATPase E subunit family NaN
2 Q9UKA9 PTBP2_HUMAN Homo sapiens (Human) 57,491 531.0 TISSUE SPECIFICITY: Mainly expressed in brain ... NaN SUBCELLULAR LOCATION: Nucleus {ECO:0000250|Uni... growth cone [GO:0030426]; neuronal cell body [... NaN NaN DOMAIN 59..133; /note="RRM 1"; /evidence="EC...
3 Q96NR8 RDH12_HUMAN Homo sapiens (Human) 35,094 316.0 TISSUE SPECIFICITY: Widely expressed, mostly i... DISEASE: Leber congenital amaurosis 13 (LCA13)... SUBCELLULAR LOCATION: Endoplasmic reticulum me... endoplasmic reticulum membrane [GO:0005789]; p... NaN Short-chain dehydrogenases/reductases (SDR) fa... NaN
4 Q52LW3 RHG29_HUMAN Homo sapiens (Human) 142,064 1261.0 TISSUE SPECIFICITY: Widely expressed. Highly e... NaN NaN cytoplasm [GO:0005737]; cytosol [GO:0005829]; ... NaN NaN DOMAIN 192..462; /note="F-BAR"; /evidence="E...
# merging the dataframes using the equivalent to SQL's left join, on the uniprot ID.
fullProtDF = protDF.merge(uniprotDF, how = "left", left_on = "UniprotID", right_on = "Entry")
print(fullProtDF.info())
display(fullProtDF.head())
<class 'pandas.core.frame.DataFrame'>
Int64Index: 25271 entries, 0 to 25270
Data columns (total 15 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   innateDB_ID                25271 non-null  object 
 1   alias                      25271 non-null  object 
 2   UniprotID                  25271 non-null  object 
 3   Entry                      23725 non-null  object 
 4   Entry name                 23725 non-null  object 
 5   Organism                   23676 non-null  object 
 6   Mass                       23676 non-null  object 
 7   Length                     23676 non-null  float64
 8   Tissue specificity         12533 non-null  object 
 9   Involvement in disease     4462 non-null   object 
 10  Subcellular location [CC]  20921 non-null  object 
 11  Gene ontology (GO)         23312 non-null  object 
 12  Domain [CC]                5426 non-null   object 
 13  Protein families           17706 non-null  object 
 14  Domain [FT]                11014 non-null  object 
dtypes: float64(1), object(14)
memory usage: 3.1+ MB
None
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
innateDB_ID alias UniprotID Entry Entry name Organism Mass Length Tissue specificity Involvement in disease Subcellular location [CC] Gene ontology (GO) Domain [CC] Protein families Domain [FT]
0 innatedb:IDBG-1 refseq:NP_003131|uniprotkb:Q05066|uniprotkb:SR... Q05066 Q05066 SRY_HUMAN Homo sapiens (Human) 23,884 204.0 NaN DISEASE: 46,XY sex reversal 1 (SRXY1) [MIM:400... SUBCELLULAR LOCATION: Nucleus speckle {ECO:000... chromatin [GO:0000785]; cytoplasm [GO:0005737]... DOMAIN: DNA binding and bending properties of ... SRY family NaN
1 innatedb:IDBG-100000 uniprotkb:LPAR3_HUMAN|refseq:NP_036284|uniprot... Q9UBY5 Q9UBY5 LPAR3_HUMAN Homo sapiens (Human) 40,128 353.0 TISSUE SPECIFICITY: Most abundantly expressed ... NaN SUBCELLULAR LOCATION: Cell membrane; Multi-pas... axon [GO:0030424]; cytoplasm [GO:0005737]; int... NaN G-protein coupled receptor 1 family NaN
2 innatedb:IDBG-100012 uniprotkb:Q8IWG1|uniprotkb:WDR63_HUMAN|refseq:... Q8IWG1 Q8IWG1 DNAI3_HUMAN Homo sapiens (Human) 102,935 891.0 NaN DISEASE: Note=A rare heterozygous in-frame DNA... SUBCELLULAR LOCATION: Cytoplasm {ECO:0000269|P... axonemal dynein complex [GO:0005858]; cytoplas... NaN NaN NaN
3 innatedb:IDBG-10002 refseq:NP_056518|uniprotkb:Q9Y3A4|uniprotkb:RR... Q9Y3A4 Q9Y3A4 RRP7A_HUMAN Homo sapiens (Human) 32,334 280.0 NaN NaN NaN CURI complex [GO:0032545]; cytoplasm [GO:00057... NaN RRP7 family DOMAIN 61..90; /note="RRM"
4 innatedb:IDBG-100021 uniprotkb:MCLN3_HUMAN|uniprotkb:Q8TDD5|refseq:... Q8TDD5 Q8TDD5 MCLN3_HUMAN Homo sapiens (Human) 64,248 553.0 NaN NaN SUBCELLULAR LOCATION: Cell membrane {ECO:00002... autophagosome membrane [GO:0000421]; early end... DOMAIN: The most N-terminal extracellular/lume... Transient receptor (TC 1.A.4) family, Polycyst... NaN

While each InnateDB had listed a UniprotID, not every UniprotID was successfully queried from the Uniprot database. This means some of our InnateDB IDs (the nodes on our network) will not have ay information from Uniprot database.

Tidying Data

Now that we have Uniprot data associated with most of the nodes, we would like to make this data usable. Most of the columns are strings that are impossible to use in data analysis. However, some of the columns can be converted into usable information. The "Involvement in disease" entries have Mendelian Inheritance in Man (MIM) Ids, which are IDs of known genetic diseases and can be extracted via a regular expression. The same goes for The "Domain [FT]" column have a set number of descriptors, and the "Gene Ontology (GO)" column, which has Gene Ontology (GO) Ids that describe subcellular locations. The Ids are extracted, put into a list, and into new columns.

# get the mass as a float
fullProtDF["flMass"] = fullProtDF["Mass"].apply(lambda x: float(str(x).replace(",","")) if ~pd.isna(x) else None)

# process into list of MIM ids
def getMIM(s):
    if pd.isna(s):
        return None
    else:
        return re.findall("\[MIM:(\d+)\]", s)
fullProtDF["MIM"] = fullProtDF["Involvement in disease"].apply(getMIM)

# process domain column into list of IDs
def getDOM(s):
    if pd.isna(s):
        return None
    else:
        return re.findall('/note="([\w\s-]+)"', s)
fullProtDF["Domain_ID"] = fullProtDF["Domain [FT]"].apply(getDOM)

# process to get GO IDs
def getGO(s):
    if pd.isna(s):
        return None
    else:
        return re.findall("\[GO:(\d+)\]", s)
fullProtDF["GO_ID"] = fullProtDF["Gene ontology (GO)"].apply(getGO)

display(fullProtDF.head())
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
innateDB_ID alias UniprotID Entry Entry name Organism Mass Length Tissue specificity Involvement in disease Subcellular location [CC] Gene ontology (GO) Domain [CC] Protein families Domain [FT] Degree_Centrality flMass MIM
0 innatedb:IDBG-1 refseq:NP_003131|uniprotkb:Q05066|uniprotkb:SR... Q05066 Q05066 SRY_HUMAN Homo sapiens (Human) 23,884 204.0 NaN DISEASE: 46,XY sex reversal 1 (SRXY1) [MIM:400... SUBCELLULAR LOCATION: Nucleus speckle {ECO:000... chromatin [GO:0000785]; cytoplasm [GO:0005737]... DOMAIN: DNA binding and bending properties of ... SRY family NaN 0.000953 23884.0 [400044, 400045]
1 innatedb:IDBG-100000 uniprotkb:LPAR3_HUMAN|refseq:NP_036284|uniprot... Q9UBY5 Q9UBY5 LPAR3_HUMAN Homo sapiens (Human) 40,128 353.0 TISSUE SPECIFICITY: Most abundantly expressed ... NaN SUBCELLULAR LOCATION: Cell membrane; Multi-pas... axon [GO:0030424]; cytoplasm [GO:0005737]; int... NaN G-protein coupled receptor 1 family NaN 0.000079 40128.0 None
2 innatedb:IDBG-100012 uniprotkb:Q8IWG1|uniprotkb:WDR63_HUMAN|refseq:... Q8IWG1 Q8IWG1 DNAI3_HUMAN Homo sapiens (Human) 102,935 891.0 NaN DISEASE: Note=A rare heterozygous in-frame DNA... SUBCELLULAR LOCATION: Cytoplasm {ECO:0000269|P... axonemal dynein complex [GO:0005858]; cytoplas... NaN NaN NaN 0.000040 102935.0 []
3 innatedb:IDBG-10002 refseq:NP_056518|uniprotkb:Q9Y3A4|uniprotkb:RR... Q9Y3A4 Q9Y3A4 RRP7A_HUMAN Homo sapiens (Human) 32,334 280.0 NaN NaN NaN CURI complex [GO:0032545]; cytoplasm [GO:00057... NaN RRP7 family DOMAIN 61..90; /note="RRM" 0.001033 32334.0 None
4 innatedb:IDBG-100021 uniprotkb:MCLN3_HUMAN|uniprotkb:Q8TDD5|refseq:... Q8TDD5 Q8TDD5 MCLN3_HUMAN Homo sapiens (Human) 64,248 553.0 NaN NaN SUBCELLULAR LOCATION: Cell membrane {ECO:00002... autophagosome membrane [GO:0000421]; early end... DOMAIN: The most N-terminal extracellular/lume... Transient receptor (TC 1.A.4) family, Polycyst... NaN 0.000079 64248.0 None

Graph Centrality Data

Now that we have useful forms of the protein characteristics, we want to get measures of centrality on the nodes in our network. To do this, we make use of the networkx centrality algorithms. We compute the measure of centrality, and add it as a column to our dataframe. For our measures of centrality, we chose degree, closeness, and eigenvector, as these are some of the most common forms of measuring centrality. Betweenness was left out due to computational limitations.

# getting the graph from the edge list
G = nx.Graph(list(np.array(edgeListNoSelf)))

# adding measures of centrality to the data frame
# degree centrality
a = nx.degree_centrality(G)
fullProtDF["Degree_Centrality"] = fullProtDF["innateDB_ID"].apply(lambda x: a.get(x))


# closeness centrality
a = nx.closeness_centrality(G)
fullProtDF["Closeness_Centrality"] = fullProtDF["innateDB_ID"].apply(lambda x: a.get(x))


# eigenvector centrality
a = nx.eigenvector_centrality(G)
fullProtDF["Eigenvector_Centrality"] = fullProtDF["innateDB_ID"].apply(lambda x: a.get(x))
display(fullProtDF.head())
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
innateDB_ID alias UniprotID Entry Entry name Organism Mass Length Tissue specificity Involvement in disease Subcellular location [CC] Gene ontology (GO) Domain [CC] Protein families Domain [FT] Degree_Centrality flMass MIM Domain_ID GO_ID Closeness_Centrality Eigenvector_Centrality
0 innatedb:IDBG-1 refseq:NP_003131|uniprotkb:Q05066|uniprotkb:SR... Q05066 Q05066 SRY_HUMAN Homo sapiens (Human) 23,884 204.0 NaN DISEASE: 46,XY sex reversal 1 (SRXY1) [MIM:400... SUBCELLULAR LOCATION: Nucleus speckle {ECO:000... chromatin [GO:0000785]; cytoplasm [GO:0005737]... DOMAIN: DNA binding and bending properties of ... SRY family NaN 0.000953 23884.0 [400044, 400045] None [0000785, 0005737, 0016607, 0005654, 0005634, ... 0.321910 0.001644
1 innatedb:IDBG-100000 uniprotkb:LPAR3_HUMAN|refseq:NP_036284|uniprot... Q9UBY5 Q9UBY5 LPAR3_HUMAN Homo sapiens (Human) 40,128 353.0 TISSUE SPECIFICITY: Most abundantly expressed ... NaN SUBCELLULAR LOCATION: Cell membrane; Multi-pas... axon [GO:0030424]; cytoplasm [GO:0005737]; int... NaN G-protein coupled receptor 1 family NaN 0.000079 40128.0 None None [0030424, 0005737, 0005887, 0005886, 0045202, ... 0.297747 0.000734
2 innatedb:IDBG-100012 uniprotkb:Q8IWG1|uniprotkb:WDR63_HUMAN|refseq:... Q8IWG1 Q8IWG1 DNAI3_HUMAN Homo sapiens (Human) 102,935 891.0 NaN DISEASE: Note=A rare heterozygous in-frame DNA... SUBCELLULAR LOCATION: Cytoplasm {ECO:0000269|P... axonemal dynein complex [GO:0005858]; cytoplas... NaN NaN NaN 0.000040 102935.0 [] None [0005858, 0005737, 0036156, 0071933, 0045504, ... 0.264544 0.000114
3 innatedb:IDBG-10002 refseq:NP_056518|uniprotkb:Q9Y3A4|uniprotkb:RR... Q9Y3A4 Q9Y3A4 RRP7A_HUMAN Homo sapiens (Human) 32,334 280.0 NaN NaN NaN CURI complex [GO:0032545]; cytoplasm [GO:00057... NaN RRP7 family DOMAIN 61..90; /note="RRM" 0.001033 32334.0 None [RRM] [0032545, 0005737, 0005654, 0034456, 0003723, ... 0.349394 0.006516
4 innatedb:IDBG-100021 uniprotkb:MCLN3_HUMAN|uniprotkb:Q8TDD5|refseq:... Q8TDD5 Q8TDD5 MCLN3_HUMAN Homo sapiens (Human) 64,248 553.0 NaN NaN SUBCELLULAR LOCATION: Cell membrane {ECO:00002... autophagosome membrane [GO:0000421]; early end... DOMAIN: The most N-terminal extracellular/lume... Transient receptor (TC 1.A.4) family, Polycyst... NaN 0.000079 64248.0 None None [0000421, 0031901, 0016021, 0031902, 0005765, ... 0.346118 0.002906

Linear Regression

Having gotten measures of centrality, we would like to compare the protein characteristics with the measures of centrality via a linear regression. Our list of ids is insufficient for this, as a linear regression does not work with categorical data, to make the data usable, we take the length of the lists, so that we can compare the centrality with the number of domains, genetic diseases, and subcellular locations.

# Get list of all attributes for each category

# Domains
DomList = []
for i in fullProtDF["Domain_ID"]:
    if i:
        for j in i:
            if not j in DomList:
                DomList.append(j)
print(len(DomList))

# MIM Ids
MIMList = []
for i in fullProtDF["MIM"]:
    if i:
        for j in i:
            if not j in MIMList:
                MIMList.append(j)
print(len(MIMList))

# Gene Ontology Ids
GOList = []
for i in fullProtDF["GO_ID"]:
    if i:
        for j in i:
            if not j in GOList:
                GOList.append(j)
print(len(GOList))
2072
5293
18911
# add list lengths to dataframe
fullProtDF["numMIM"] = fullProtDF["MIM"].apply(lambda x: len(x) if x else None)
fullProtDF["numDOM"] = fullProtDF["Domain_ID"].apply(lambda x: len(x) if x else None)
fullProtDF["numGO"] = fullProtDF["GO_ID"].apply(lambda x: len(x) if x else None)
display(fullProtDF.head())
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
innateDB_ID alias UniprotID Entry Entry name Organism Mass Length Tissue specificity Involvement in disease Subcellular location [CC] Gene ontology (GO) Domain [CC] Protein families Domain [FT] Degree_Centrality flMass MIM Domain_ID GO_ID Closeness_Centrality Eigenvector_Centrality numMIM numDOM numGO
0 innatedb:IDBG-1 refseq:NP_003131|uniprotkb:Q05066|uniprotkb:SR... Q05066 Q05066 SRY_HUMAN Homo sapiens (Human) 23,884 204.0 NaN DISEASE: 46,XY sex reversal 1 (SRXY1) [MIM:400... SUBCELLULAR LOCATION: Nucleus speckle {ECO:000... chromatin [GO:0000785]; cytoplasm [GO:0005737]... DOMAIN: DNA binding and bending properties of ... SRY family NaN 0.000953 23884.0 [400044, 400045] None [0000785, 0005737, 0016607, 0005654, 0005634, ... 0.321910 0.001644 2.0 NaN 20.0
1 innatedb:IDBG-100000 uniprotkb:LPAR3_HUMAN|refseq:NP_036284|uniprot... Q9UBY5 Q9UBY5 LPAR3_HUMAN Homo sapiens (Human) 40,128 353.0 TISSUE SPECIFICITY: Most abundantly expressed ... NaN SUBCELLULAR LOCATION: Cell membrane; Multi-pas... axon [GO:0030424]; cytoplasm [GO:0005737]; int... NaN G-protein coupled receptor 1 family NaN 0.000079 40128.0 None None [0030424, 0005737, 0005887, 0005886, 0045202, ... 0.297747 0.000734 NaN NaN 17.0
2 innatedb:IDBG-100012 uniprotkb:Q8IWG1|uniprotkb:WDR63_HUMAN|refseq:... Q8IWG1 Q8IWG1 DNAI3_HUMAN Homo sapiens (Human) 102,935 891.0 NaN DISEASE: Note=A rare heterozygous in-frame DNA... SUBCELLULAR LOCATION: Cytoplasm {ECO:0000269|P... axonemal dynein complex [GO:0005858]; cytoplas... NaN NaN NaN 0.000040 102935.0 [] None [0005858, 0005737, 0036156, 0071933, 0045504, ... 0.264544 0.000114 NaN NaN 12.0
3 innatedb:IDBG-10002 refseq:NP_056518|uniprotkb:Q9Y3A4|uniprotkb:RR... Q9Y3A4 Q9Y3A4 RRP7A_HUMAN Homo sapiens (Human) 32,334 280.0 NaN NaN NaN CURI complex [GO:0032545]; cytoplasm [GO:00057... NaN RRP7 family DOMAIN 61..90; /note="RRM" 0.001033 32334.0 None [RRM] [0032545, 0005737, 0005654, 0034456, 0003723, ... 0.349394 0.006516 NaN 1.0 8.0
4 innatedb:IDBG-100021 uniprotkb:MCLN3_HUMAN|uniprotkb:Q8TDD5|refseq:... Q8TDD5 Q8TDD5 MCLN3_HUMAN Homo sapiens (Human) 64,248 553.0 NaN NaN SUBCELLULAR LOCATION: Cell membrane {ECO:00002... autophagosome membrane [GO:0000421]; early end... DOMAIN: The most N-terminal extracellular/lume... Transient receptor (TC 1.A.4) family, Polycyst... NaN 0.000079 64248.0 None None [0000421, 0031901, 0016021, 0031902, 0005765, ... 0.346118 0.002906 NaN NaN 13.0

Having added the list lengths to our dataframe, we would like to start by doing linear regressions to compare individual parameters. To do this, we make a list of each possible pair of parameters, and then make a list of linear regression solutions for each of them. Finally, we make a histogram showing the distribution in r-squared values, which are indicators of a good fit.

linearRegressList = ["flMass", "Length", 
                     "Degree_Centrality", "Closeness_Centrality", "Eigenvector_Centrality", 
                     "numMIM", "numDOM", "numGO"]
pairs = []
for i in range(len(linearRegressList)-1):
    for j in range(i+1):
        pairs.append((linearRegressList[i+1], linearRegressList[j]))
regressSols = []
for i in pairs:
    a = ~pd.isna(fullProtDF[i[0]])
    b = ~pd.isna(fullProtDF[i[1]])

    x = np.array(fullProtDF[a & b][i[0]]).reshape(-1,1)
    y = np.array(fullProtDF[a & b][i[1]]).reshape(-1,1)
    x = sm.add_constant(x)
    s = sm.OLS(y, x).fit()
    regressSols.append(s)
# making a histogram of the r-squared values
fnt = 15
rSq = np.array([i.rsquared for i in regressSols])
plt.figure(figsize = [10,8])
plt.hist(rSq, bins = 30)
plt.grid(True, linestyle = "--")
plt.xlabel(r"$R^2$", fontsize = fnt)
plt.ylabel("Frequency", fontsize = fnt)
plt.title(r"$R^2$ Distribution", fontsize = fnt)
plt.show()

png

We see that some of the parameters have an r squared of greater than .2. We expect the r squared close to 1 to be the Mass Length relationship, as the longer the protein chain, the heavier it is. We plot the Fits on the data for the parameter pairs with an r squared of greater than .2.

rsqCutoff = .2
nrsq = (rSq > rsqCutoff).sum()
print(f"Number of pairs with an r squared greater than .4: {nrsq}")
# print out the pairs, along with their coefficients and r squared values
for i in range(len(regressSols)):
    if regressSols[i].rsquared > rsqCutoff:
        print(pairs[i])
        
Number of pairs with an r squared greater than .4: 5
('Length', 'flMass')
('Eigenvector_Centrality', 'Degree_Centrality')
('Eigenvector_Centrality', 'Closeness_Centrality')
('numDOM', 'flMass')
('numDOM', 'Length')
plt.figure(figsize = [10,60])
j = 1
for i in range(len(regressSols)):
    if regressSols[i].rsquared > rsqCutoff:
        plt.subplot(nrsq,1,j)
        j += 1
        a = ~pd.isna(fullProtDF[pairs[i][0]])
        b = ~pd.isna(fullProtDF[pairs[i][1]])

        x = np.array(fullProtDF[a & b][pairs[i][0]]).reshape(-1,1)
        y = np.array(fullProtDF[a & b][pairs[i][1]]).reshape(-1,1)
        plt.scatter(x, y, label = "Data")
        plt.plot(x, regressSols[i].params[1]*x + regressSols[i].params[0], color = "orange", label = "Fit")
        plt.xlabel(pairs[i][0], fontsize = 15)
        plt.ylabel(pairs[i][1], fontsize = 15)
        plt.grid(True, linestyle = "--")
        plt.legend(fontsize = 13)
plt.tight_layout()
plt.show()

png

It seems that significant correlations presented are lacking in new information. The number of domains is also expected to correspond with the mass/length of the protein, as a larger protein means more subregions in the protein. In addition, the relation between the measures of centrality is also expected, as they do all measure the same idea of a node being central or important in the graph.

Conclusion/Future Work

In conclusion, there are no obvious relationships between the location of a protein in an interaction network and the strucutural/locational properties of the protein. In hindsight, that makes a certain amount of sense. A protein may be a part of many pathways, not due to its size or number of domains, but due to how often the function of the protein is necessary. To further investigate this, it would be prudent to observe whether certain pathways or functions correlate with protein centrality. Furthermore, getting details that correspond with the Domain labels, MIM ids, or GO ids from the proteins (i.e. is the centrality for proteins involved in this genetic disease different from proteins involved in that genetic disease). Finally, as proteins have several labels, machine learning may be useful as a predictor of any one label using all the other data.