hypa is an open source python package implementing Higher-Order Hypergeometric Path Anomaly Detection, described in the following paper:
Larock, T., Nanumyan, V., Scholtes, I., Casiraghi, G., Eliassi-Rad, T., Schweitzer, F. (2019) Detecting Path Anomalies in Time Series Data on Networks. arXiv Prepr. arXiv1905.10580
In the reproduce/ directory of this repository you can find a Python file called anomalous_random.py, which can reproduce Figure 3. You can also find a Jupyter notebook called flight_analysis.ipynb that can reproduce Figure 4.
hypa is written for python 3+ and requires pathpy, a python package for analyzing sequential data using higher-order network models. You can install pathpy via pip using the command pip install pathpy2.
The python implementation of the Hypergeometric distribution, scipy.stats.hypergeom, is not as precise as either the Distributions.jl or R.stats versions, and the hypergeom.logcdf calculation, the most important for HYPA, is very slow.
Due to this, I have made all 3 implementations accessible in this package. Note that once you have the implementation you want installed, all you need to do is pass the correct name to the Hypa(paths, implementation='julia') constructor. It should not be necessary to interface directly with the specific implementations. Details are as follows:
-
implementation='julia': The Julia implementatioon inDistributinos.jlis the default because it is the fastest and slighty simpler to install and work with (in my experience) thanrpy2. We use PyJulia to access Julia from Python. You can follow the instructions there for the implementation, but in general you will need to have Julia installed on your machine along withDistributions.jlandPyCall.jl. -
implementation='rpy2': TheR.statsimplementation is as effective as (if slightly slower than)Distributions.jl, but installingrpy2to access R from Python tends to more finicky (in my experience). -
implementation='scipy': Usingscipy.stats.hypergeomis the simplest, but also the worst performing. Anyscipy.statsdistribution should havehypergeomand there should be no trouble with imports. We recommend not computing the CDF in log space (e.g. setlog=FalseinHypa.construct_hypa_network), since thelogcdffunction is very slow. For reasons still mysterious to me, thecdffunction in this version seems to have some numerical issues that the others do not.
You can install hypa as follows:
- Install and test the above requirements.
- Clone this repository using
git clone https://github.com/tlarock/hypa.gitfrom a terminal session. This command will download this repository to your computer in the current directory. Where you clone does not matter in principle, but it is best to store it somewhere where it will be safe from being deleted and where you can easily find it again later (e.g. your default~/Downloadsdirectory may not be the best place). - Enter the cloned repository (
cd hypa) and run the following command, which will install the package locally:pip install -e .(if you do not have root access where you are making the installation, trypip install --user -e .instead.) - Test the package by starting a python session in a different directory (e.g.
cd ~) and typingimport hypa.
If you have installation issues that you believe are directly related to hypa, please feel free to open an issue on this github repository. We do not maintain any of the other dependencies and so are probably not able to help with installation issues, thuogh you are free to ask.
A simple test (based on the toy example in the paper) to make sure things are working is to paste the following code block into an ipython session or run it as a script:
import numpy as np
import pathpy as pp
import hypa
paths = pp.Paths()
paths.add_path(('A','X','C'), frequency=30)
paths.add_path(('B','X','D'), frequency=100)
paths.add_path(('B','X','C'), frequency=105)
print(paths)
hy = hypa.HypaPP.from_paths(paths, k=2, implementation='julia') # Insert your desired implementation (out of 'julia', 'rpy2', 'scipy') here!
print(hy.hypa_net)
print(hy.hypa_net.edges)
for edge, edge_data in hy.hypa_net.edges.items():
print("edge: {} hypa score: {}".format(edge, np.exp(edge_data['pval'])))