Skip to content

aezarebski/derp-simulation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

244 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DERP Simulation

This repository contains the code to simulate a database of phylogenetic trees that will be used in a machine learning project.

Usage

Timelines

Figure fig:timelines shows a schematic of the timelines involved in a simulation. There are two main time points that we use to define times. The origin is the start time of the simulation and the present is the time of the last sequenced sample. Times may be measured either forward increasing into the future, or backward increasing into the past.

Note that the measurement times are randomly sampled between the origin and the present and are measured in forwards time (i.e. zero at the origin increasing towards the present.)

./timelines.png

Getting started generating a database

To generate the database, run the main script:

python main.py <path/to/config.json>

If you want a small example to test this out, try using the config/debugging.json file. The whole simulation is configured by the JSON file provided at the command line.

Optional monitoring progress

Making a large database takes a while. There is a tool in src/monitor.py which you can run and it will report on the progress of a current database construction and give a (rough) estimate of the remaining time for each stage of the simulation.

python src/monitor.py <path/to/config.json>

Configuring a simulation

The way in which a dataset is simulated is configured with a JSON file. There are some example configurations provided:

Additional information about these datasets is given here.

Conditioning

The simulations are conditioned to stop at a maximum time, or as soon as either the prevalence reaches 50,000 or the number of sequences reaches 1,000. Simulations are also conditioned to have more than one sequence, and for the pathogen not to have gone extinct.

This conditioning is coded through the following attributes in the remaster XML template.

<trajectory id="trajectory"
            spec="StochasticTrajectory"
            maxTime="50"
	      endsWhen="X==50000 || Psi==1000"
            mustHave="Psi&lt;1001 &amp;&amp; Psi>1 &amp;&amp; X>1">

Prior distributions

As part of the simulation hyperparameters in the configuration file, the prior distribution of some parameters need to be specified. The following describes how you can specify these prior distributions in the configuration.

Note that since there are acceptance criteria on the simulations, the marginal distributions of the parameters will not exactly match the distribution specified in the configuration. This is because we are effectively conditioning the prior samples on the resulting epidemic simulations conforming to conditions regarding extinction and maximum number of observed cases.

Discrete uniform

If \(n∼\text{Uniform}([3, 7])\) where \(n∈\mathbb{Z}\), then you specify this as

{
    "dist": "uniform_int",
    "lower_bound": 3,
    "upper_bound": 7
}

Uniform

If \(n∼\text{Uniform}([3.3, 6.8])\) where \(n∈\mathbb{R}\), then you specify this as

{
    "dist": "uniform",
    "lower_bound": 3.3,
    "upper_bound": 6.8
}

Lognormal

\(log(x)∼\text{Normal}(0.1, 1.0)\), then you specify this as

{
    "dist": "lognormal",
    "LN_mean": 0.1,
    "LN_sigma": 1.0
}

Beta

If \(p∼\text{Beta}(α=1.1,β=3.0)\) where \(p∈[0,1]\), then you specify this as

{
    "dist": "beta",
    "alpha": 1.1,
    "beta": 3.0
}

Visualising the data

Two scripts, visualisation.py and visualisation_temporal.py in src/ can be used to visualise the output of a simulation.

python src/visualisation_temporal.py <path/to/config.json>

Note the src/visualisation_temporal.py script only applies for simulations which are configured to report temporal data (that is, report_temporal_data is set to true in the config).

Database structure

The database is an HDF5 file. Each simulation is represented with a group with a name of the form record_xxxxxx, e.g. record_000123. The data from each simulation is split into two groups: input and output.

Input

The input group has the following datasets:

present
the time since the origin of the last sequenced sample
tree_height
the time between the $T\text{MRCA}$ and the present
tree
a binary blob which is the pickled reconstructed tree of the sequenced samples in the simulation.

Output

The output group contains a lot of measurements, but the most important is the temporal_measurements dataset. The temporal_measurements dataset has the following columns:

measurement_times (float)
the (forward) time since the origin of the measurements
prevalence (int)
the number of infected individuals
cumulative (int)
the cumulative number of infections
reproductive_number (float)
the reproduction number

Using the database

The following demonstrates how to use the database in Python. Don’t forget to close the database connection after using it! The following script reads in the tree and measurements from a simulation and produces this CSV file and the figure below.

from Bio import Phylo
import h5py
import pickle
import matplotlib.pyplot as plt
import numpy as np

hdf5_file = "./out/sim-charmander/dataset-charmander.hdf5"

db_conn = h5py.File(hdf5_file)

demo_tree = pickle.loads(db_conn['record_000001/input/tree'][...].tobytes())
fig, ax = plt.subplots()
Phylo.draw(demo_tree, do_show=False, axes=ax)
fig.savefig('./out/sim-charmander/plots/demo-tree.png')

measurements = db_conn['record_000001/output/parameters/temporal_measurements'][...]
column_names = measurements.dtype.names
np.savetxt('./out/sim-charmander/demo-measurements.csv',
           measurements, delimiter=',',
           header=','.join(column_names))

db_conn.close()

./out/sim-charmander/plots/demo-tree.png

If you want a GUI to inspect the output HDF5 file, the HDFCompass tool provides a simple way to inspect the data that has been generated. There is some basic information about the simulation stored as attributes in the HDF5 file. This includes the date of creation and the size of the dataset.

FAQs

How can I include a period of time without sampling?

If you include "limited_time_sampling": true in your simulation hyperparameters in the configuration JSON, this will sample a random time throughout the duration of the simulation, and only turn the sampling on after that time (after which it is constant). The random time at which sampling is activated is a proportion of the total epidemic duration drawn from a \(\text{Uniform}(0.3,0.7)\) distribution. You can see an example of this in the following figure.

./out/sim-psyduck/plots/demo-tree-change-times.png

from Bio import Phylo
import h5py
import pickle
import matplotlib.pyplot as plt
import numpy as np

HDF5_FILE = "./out/sim-psyduck/dataset-psyduck.hdf5"
RECORD_ID = "record_000003"

with h5py.File(HDF5_FILE, "r") as db_conn:
    rec = db_conn[RECORD_ID]
    demo_tree = pickle.loads(rec["input/tree"][...].tobytes())
    tree_height = rec["input/tree_height"][()]
    present = rec["input/present"][()]
    epidemic_duration = rec["output/parameters/epidemic_duration"][()]
    sampling_prop_change_times = rec["output/parameters/sampling_prop/change_times"][()]

mr_ca_time = present - tree_height
change_times_tree = sampling_prop_change_times - mr_ca_time

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))
Phylo.draw(demo_tree, do_show=False, axes=ax1)

for change_time in change_times_tree:
    ax1.axvline(change_time, color="tomato", linestyle="--", linewidth=1.0, alpha=0.8)

ax1.set_title("Sampling proportion change times")
ax1.set_xlim((-mr_ca_time) * 1.1, (epidemic_duration - mr_ca_time) * 1.1)
ax1.axvline(-mr_ca_time, color="red", linestyle="-", linewidth=1.2, alpha=0.9)
ax1.axvline(epidemic_duration - mr_ca_time, color="red", linestyle="-", linewidth=1.2, alpha=0.9)

x = np.linspace(0.0, 1.0, 200)
uniform_pdf = np.where((x >= 0.3) & (x <= 0.7), 1.0 / 0.6, 0.0)
ax2.plot(x, uniform_pdf, color="blue")
ax2.set_title("Uniform(0.3, 0.7) distribution")
ax2.set_xlabel("x")
ax2.set_ylabel("density")
if sampling_prop_change_times.size > 0 and epidemic_duration != 0:
    normalized_change_time = sampling_prop_change_times[0] / epidemic_duration
    ax2.axvline(
        normalized_change_time, color="blue", linestyle="--", linewidth=1.2, alpha=0.9
    )

fig.tight_layout()
fig.savefig("./out/sim-psyduck/plots/demo-tree-change-times.png")

Why is the Uniform(0.3,0.7) distribution used in limited time sampling?

A \(\text{Uniform}(0.3,0.7)\) distribution is used for the proportion of the epidemic that occurs without sampling because this prevents the activation time falling too close to the start or the end of the epidemic which makes the simulations very time consuming to perform. While pushing the times away from the edges, it still allows for relatively extreme values of this time.

How do I communicate the prior distribution?

There is a script src/make-prior-latex-table.R that might be helpful: it produces a table in LaTeX explaining the prior. This produces a table like the one shown below.

./out/sim-charizard/demo-prior-table.png

Where did the prior distribution come from?

Douglas et al (2021).

Note that because Douglas et al (2021) uses years as their unit of time, the net becoming uninfectious rate needed to be scaled (to days which is what we use here).

How are the change times of the parameters selected?

If parameters change in a simulation, then the times at which this happens are selected uniformly at random between times 0.0 and the end of the epidemic duration in the simulation.

You may be able to find additional information about this in the random_remaster_parameters() function.

How do I get the simulation wall times out of the HDF5 file?

import h5py
import matplotlib.pyplot as plt
import squarify

def wall_time_and_label(db, path):
    label = str(int(path.split("_")[-1]))
    wall_time = db[path].attrs["simulation_wall_time"].item()
    return (label, wall_time)

with h5py.File("out/sim-charmander/dataset-charmander.hdf5", "r") as db:
    times_and_labels = [wall_time_and_label(db, path) for path in db.keys()]
    times_and_labels.sort(key=lambda x: x[1])
    labels, times = zip(*times_and_labels)

plt.figure(figsize=(8, 7), dpi=96)
squarify.plot(sizes=times, color=len(times)*["#1b9e77"], pad=True)
plt.axis("off")
plt.title("Simulation Wall Times")
plt.savefig("out/sim-charmander/plots/walltimes.png")

./out/sim-charmander/plots/walltimes.png

How do I get the configuration out of an HDF5 file?

The configuration used to generate a dataset is stored as an attribute in the HDF5 file. You can recover a copy of the configuration with the following snippet of code.

import h5py
import json
db_conn = h5py.File("out/debugging/dataset-demo.h5py", "r")
config_str = db_conn.attrs["config_json"]
with open("recovered-config.json", "w") as f:
    f.write(config_str)

How do I set up a reproducible environment?

There is a requirements file to install the relevant python packages

python3 -m venv venv
source venv/bin/activate
pip install -U pip
pip install -r requirments.txt

How do I install BEAST2?

BEAST2 is used to simulate the data. If you don’t have BEAST2 installed, there is a script scr/setupbeast2.sh which will download and install this for you. This script will install remaster. If you don’t want to use the script, once you have BEAST2 installed, you will need to install remaster through BEAUti.

How do I install the Bio module?

pip install biopython

How do I install the <python_package> package?

pip install biopython h5py lxml pandas plotnine

About

Code to simulate phylogenetic trees that can be used to train neural networks

Resources

License

Stars

Watchers

Forks

Contributors