This repository contains the code to simulate a database of phylogenetic trees that will be used in a machine learning project.
Figure fig:timelines shows a schematic of the timelines involved in a simulation. There are two main time points that we use to define times. The origin is the start time of the simulation and the present is the time of the last sequenced sample. Times may be measured either forward increasing into the future, or backward increasing into the past.
Note that the measurement times are randomly sampled between the origin and the present and are measured in forwards time (i.e. zero at the origin increasing towards the present.)
To generate the database, run the main script:
python main.py <path/to/config.json>If you want a small example to test this out, try using the
config/debugging.json file. The whole simulation is configured by
the JSON file provided at the command line.
Making a large database takes a while. There is a tool in
src/monitor.py which you can run and it will report on the progress
of a current database construction and give a (rough) estimate of the
remaining time for each stage of the simulation.
python src/monitor.py <path/to/config.json>The way in which a dataset is simulated is configured with a JSON file. There are some example configurations provided:
Additional information about these datasets is given here.
The simulations are conditioned to stop at a maximum time, or as soon as either the prevalence reaches 50,000 or the number of sequences reaches 1,000. Simulations are also conditioned to have more than one sequence, and for the pathogen not to have gone extinct.
This conditioning is coded through the following attributes in the remaster XML template.
<trajectory id="trajectory"
spec="StochasticTrajectory"
maxTime="50"
endsWhen="X==50000 || Psi==1000"
mustHave="Psi<1001 && Psi>1 && X>1">As part of the simulation hyperparameters in the configuration file, the prior distribution of some parameters need to be specified. The following describes how you can specify these prior distributions in the configuration.
Note that since there are acceptance criteria on the simulations, the marginal distributions of the parameters will not exactly match the distribution specified in the configuration. This is because we are effectively conditioning the prior samples on the resulting epidemic simulations conforming to conditions regarding extinction and maximum number of observed cases.
If \(n∼\text{Uniform}([3, 7])\) where \(n∈\mathbb{Z}\), then you specify this as
{
"dist": "uniform_int",
"lower_bound": 3,
"upper_bound": 7
}If \(n∼\text{Uniform}([3.3, 6.8])\) where \(n∈\mathbb{R}\), then you specify this as
{
"dist": "uniform",
"lower_bound": 3.3,
"upper_bound": 6.8
}\(log(x)∼\text{Normal}(0.1, 1.0)\), then you specify this as
{
"dist": "lognormal",
"LN_mean": 0.1,
"LN_sigma": 1.0
}If \(p∼\text{Beta}(α=1.1,β=3.0)\) where \(p∈[0,1]\), then you specify this as
{
"dist": "beta",
"alpha": 1.1,
"beta": 3.0
}Two scripts, visualisation.py and visualisation_temporal.py in
src/ can be used to visualise the output of a simulation.
python src/visualisation_temporal.py <path/to/config.json>Note the src/visualisation_temporal.py script only applies for
simulations which are configured to report temporal data (that is,
report_temporal_data is set to true in the config).
The database is an HDF5 file. Each simulation is represented with a
group with a name of the form record_xxxxxx, e.g. record_000123.
The data from each simulation is split into two groups: input and
output.
The input group has the following datasets:
present- the time since the origin of the last sequenced sample
tree_height- the time between the $T\text{MRCA}$ and the present
tree- a binary blob which is the pickled reconstructed tree of the sequenced samples in the simulation.
The output group contains a lot of measurements, but the most
important is the temporal_measurements dataset. The
temporal_measurements dataset has the following columns:
measurement_times(float)- the (forward) time since the origin of the measurements
prevalence(int)- the number of infected individuals
cumulative(int)- the cumulative number of infections
reproductive_number(float)- the reproduction number
The following demonstrates how to use the database in Python. Don’t forget to close the database connection after using it! The following script reads in the tree and measurements from a simulation and produces this CSV file and the figure below.
from Bio import Phylo
import h5py
import pickle
import matplotlib.pyplot as plt
import numpy as np
hdf5_file = "./out/sim-charmander/dataset-charmander.hdf5"
db_conn = h5py.File(hdf5_file)
demo_tree = pickle.loads(db_conn['record_000001/input/tree'][...].tobytes())
fig, ax = plt.subplots()
Phylo.draw(demo_tree, do_show=False, axes=ax)
fig.savefig('./out/sim-charmander/plots/demo-tree.png')
measurements = db_conn['record_000001/output/parameters/temporal_measurements'][...]
column_names = measurements.dtype.names
np.savetxt('./out/sim-charmander/demo-measurements.csv',
measurements, delimiter=',',
header=','.join(column_names))
db_conn.close()If you want a GUI to inspect the output HDF5 file, the HDFCompass tool provides a simple way to inspect the data that has been generated. There is some basic information about the simulation stored as attributes in the HDF5 file. This includes the date of creation and the size of the dataset.
If you include "limited_time_sampling": true in your simulation
hyperparameters in the configuration JSON, this will sample a random
time throughout the duration of the simulation, and only turn the
sampling on after that time (after which it is constant). The random
time at which sampling is activated is a proportion of the total
epidemic duration drawn from a \(\text{Uniform}(0.3,0.7)\) distribution. You can see an
example of this in the following figure.
from Bio import Phylo
import h5py
import pickle
import matplotlib.pyplot as plt
import numpy as np
HDF5_FILE = "./out/sim-psyduck/dataset-psyduck.hdf5"
RECORD_ID = "record_000003"
with h5py.File(HDF5_FILE, "r") as db_conn:
rec = db_conn[RECORD_ID]
demo_tree = pickle.loads(rec["input/tree"][...].tobytes())
tree_height = rec["input/tree_height"][()]
present = rec["input/present"][()]
epidemic_duration = rec["output/parameters/epidemic_duration"][()]
sampling_prop_change_times = rec["output/parameters/sampling_prop/change_times"][()]
mr_ca_time = present - tree_height
change_times_tree = sampling_prop_change_times - mr_ca_time
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))
Phylo.draw(demo_tree, do_show=False, axes=ax1)
for change_time in change_times_tree:
ax1.axvline(change_time, color="tomato", linestyle="--", linewidth=1.0, alpha=0.8)
ax1.set_title("Sampling proportion change times")
ax1.set_xlim((-mr_ca_time) * 1.1, (epidemic_duration - mr_ca_time) * 1.1)
ax1.axvline(-mr_ca_time, color="red", linestyle="-", linewidth=1.2, alpha=0.9)
ax1.axvline(epidemic_duration - mr_ca_time, color="red", linestyle="-", linewidth=1.2, alpha=0.9)
x = np.linspace(0.0, 1.0, 200)
uniform_pdf = np.where((x >= 0.3) & (x <= 0.7), 1.0 / 0.6, 0.0)
ax2.plot(x, uniform_pdf, color="blue")
ax2.set_title("Uniform(0.3, 0.7) distribution")
ax2.set_xlabel("x")
ax2.set_ylabel("density")
if sampling_prop_change_times.size > 0 and epidemic_duration != 0:
normalized_change_time = sampling_prop_change_times[0] / epidemic_duration
ax2.axvline(
normalized_change_time, color="blue", linestyle="--", linewidth=1.2, alpha=0.9
)
fig.tight_layout()
fig.savefig("./out/sim-psyduck/plots/demo-tree-change-times.png")A \(\text{Uniform}(0.3,0.7)\) distribution is used for the proportion of the epidemic that occurs without sampling because this prevents the activation time falling too close to the start or the end of the epidemic which makes the simulations very time consuming to perform. While pushing the times away from the edges, it still allows for relatively extreme values of this time.
There is a script src/make-prior-latex-table.R that might be
helpful: it produces a table in LaTeX explaining the prior. This
produces a table like the one shown below.
Note that because Douglas et al (2021) uses years as their unit of time, the net becoming uninfectious rate needed to be scaled (to days which is what we use here).
If parameters change in a simulation, then the times at which this happens are selected uniformly at random between times 0.0 and the end of the epidemic duration in the simulation.
You may be able to find additional information about this in the
random_remaster_parameters() function.
import h5py
import matplotlib.pyplot as plt
import squarify
def wall_time_and_label(db, path):
label = str(int(path.split("_")[-1]))
wall_time = db[path].attrs["simulation_wall_time"].item()
return (label, wall_time)
with h5py.File("out/sim-charmander/dataset-charmander.hdf5", "r") as db:
times_and_labels = [wall_time_and_label(db, path) for path in db.keys()]
times_and_labels.sort(key=lambda x: x[1])
labels, times = zip(*times_and_labels)
plt.figure(figsize=(8, 7), dpi=96)
squarify.plot(sizes=times, color=len(times)*["#1b9e77"], pad=True)
plt.axis("off")
plt.title("Simulation Wall Times")
plt.savefig("out/sim-charmander/plots/walltimes.png")The configuration used to generate a dataset is stored as an attribute in the HDF5 file. You can recover a copy of the configuration with the following snippet of code.
import h5py
import json
db_conn = h5py.File("out/debugging/dataset-demo.h5py", "r")
config_str = db_conn.attrs["config_json"]
with open("recovered-config.json", "w") as f:
f.write(config_str)There is a requirements file to install the relevant python packages
python3 -m venv venv
source venv/bin/activate
pip install -U pip
pip install -r requirments.txtBEAST2 is used to simulate the data. If you don’t have BEAST2
installed, there is a script scr/setupbeast2.sh which will download
and install this for you. This script will install remaster. If you
don’t want to use the script, once you have BEAST2 installed, you will
need to install remaster through BEAUti.
pip install biopythonpip install biopython h5py lxml pandas plotnine




