Read and write Pulsar objects to HDF5 files#341
Conversation
The CI will still require high coverage percentages, but it is able to combine multiple runs (tempo2-only, pint-only, both) to cover everything.
|
I'm not quite sure what documentation is expected for Enterprise code. |
Docstrings for all new things is the best way to do it. But there is a ton of existing code that doesn't yet have docstrings... Certainly, any function/method that is user facing should have a docstring. For something that is more for internal uses and is pretty clear from the code, I wouldn't fret too much about it. Something like |
|
@aarchiba, with an HDF5 file format, it seems pretty straightforward to also allow creation of mock pulsar objects. The reason why you created the HDF5 pulsar class is that we sometimes need independence from PINT or Tempo2. This is even more so with simulations. The other day I needed to generate an array of 10k pulsars. With tempo2 or PINT that would take up too much memory and would take ages. So I wrote this MockPulsar Enterprise class that is super efficient and still allows all many Enterprise models that I needed to test to be run: most Enterprise functions only require things defined in BasePulsar. The HDF5 pulsar format seems like the best place to put this kind of functionality, and then it can be saved in a proper file format. Do you see any hiccups right away? |
|
The code in this PR has been converted to a separate package by @AaronDJohnson and myself, which can be found here. There will be a new Enterprise PR that needs to be merged in order for that package to work. Until that is ready, the branch can be found on on my repo I am closing this PR |
This PR allows Enterprise to write Pulsar objects out to HDF5 files in a well-documented format, and to read them back into a new FilePulsar (name suggestions welcome) which should be a drop-in replacement for either PintPulsar or T2Pulsar. The format is flexible enough that downstream users can add their own information in the file (and have these extras included in the documentation); these files can be loaded without needing to understand these extras.
The HDF5 format includes compression for all large-ish entries (in particular the mostly-zero DMX derivatives); the example B1855 data set comes out to about 1.2 MB.
To do:
Note: this PR depends on #340 ; many of the apparent changes here are drawn from that, and this PR may well merge in any changes that PR needs.
The file format allows extensions for project-specific information. Here is what the current description file for the NANOGrav 15-year v1.1 data set looks like (it is Markdown; GitHub uses a weird flavour of Markdown that preserves line breaks, normal viewers will reflow paragraphs normally):
NANOGrav 15-year data release derivative data
The NANOGrav project is releasing its 15-year data set;
this is described in an upcoming paper, but it includes
long-term timing data for 68 pulsars.
Pulsar timing begins with a set of pulse arrival times
and fits a model to those arrival times. The usual output
from this process is the best-fit model parameters and
their uncertainties, and the residuals - the difference in time or
phase between the predicted zero phase and the observed zero phase.
For some applications, for example searching for a
gravitational-wave background, it is vital to include not just
these residuals but their derivative with respect to each of the
fit parameters. This allows construction of a linearized version
of the timing model, which can often be analytically marginalized,
resulting in tremendous speedups. Other applications for such
linearized models include parameter searches in photon data.
The purpose of this file is to provide the derivatives needed
to construct this linear model, plus all other supporting data.
It is stored in HDF5, a widely portable binary format that is
extensible enough to permit project-specific information to be
stored alongside standard values.
This text should accompany a collection of such files in
plain-text form, and it should also be included in all such
files as a dataset called "README".
This data
Timing results as of: 2022-03-14 21:56:13 +0000
Git hash:
78afc7978e267ae9d11ab5daf57e6438a56c528bGenerated: 2023-02-28 10:10:43
Generated by: Anne Archibald
<Anne.Archibald@nanograv.org>File contents
format_name(attribute, optional, constant value='derivative_file')The name of this particular HDF5 format.
format_version(attribute, optional, constant value='0.6.0')Version number indicating the compatibility of this file with
other readers of this format.
Name(dataset)Pulsar name.
RAJ(dataset, units=rad)Right ascension in the Julian system. In radians.
DECJ(dataset, units=rad)Declination in the Julian system. In radians.
DM(dataset, units=pc/cm3)Best-fit dispersion measure, in pc/cm^3.
Estimated distance(dataset, units=kpc)Estimated distance and uncertainty in kiloparsecs.
TOA integer part(dataset, optional, units=day)This is the exact TOA, converted to TDB (barycentric dynamical time)
but not corrected for travel time in any way. In order to retain
nanosecond accuracy, this is split into two arrays: the integer
and the fractional parts of the MJD. This dataset contains the
integer part.
TOA fractional part(dataset, optional, units=day)This is the exact TOA, converted to TDB (barycentric dynamical time)
but not corrected for travel time in any way. In order to retain
nanosecond accuracy, this is split into two arrays: the integer
and the fractional parts of the MJD. This dataset contains the
fractional part.
TOAs in seconds(dataset, units=s)Pulse time-of-arrival data, in Modified Julian Days. These
values are barycentered, that is, converted to times
that the pulses would have reached the solar system barycenter.
(This depends on the pulsar sky position.) Note
that this array has only about microsecond resolution
and so is insufficient to do precision timing.
Raw TOAs in seconds(dataset, units=s)TOAs at the observatory; this is corrected for observatory
clock drift but not converted to any other time system or
adjusted to when the pulses would have reached the solar
system barycenter. This has also been converted to seconds,
that is, the Modified Julian Date has been multiplied by 86400.
This array too has only about microsecond precision.
TOA uncertainties(dataset, units=s)Uncertainties on pulse time-of-arrival data (and thus on
residuals), in seconds.
Residuals(dataset, units=s)Residuals (model minus data, in seconds).
Radio frequencies(dataset, units=MHz)Radio frequency at which each TOA is observed, in MHz. This
frequency is corrected for Doppler shift due to the
observatory's motion around the Sun.
Telescope names(dataset)The name of the telescope at which each TOA was observed.
These names are PINT- (or TEMPO2-)style telescope names (for
example
arecibo).Fit parameters(dataset)Fitted parameters.
Design matrix(dataset)Design matrix. This is an array that is (number of TOAs) by
(number of fit parameters). Each column is the derivative of
the residual (in seconds) with respect to the corresponding
fit parameter. This dataset has an attribute
labelsthatindicates the labels of the design matrix entries (which will
be identical to the fit parameters) and
unitsgiving the unitsof the design matrix entries. These units are stored in Astropy's
"generic" string format for units, which is based on that used in
FITS files.
Set parameters(dataset)Parameters of the timing model that were fixed during fitting.
Not all of these even have numeric values.
Par file(dataset, optional)A
.parfile describing the timing model, as a string.This can be quite long if the model has many DMX parameters.
The value is stored as an array of UTF-8 byte strings, one
per line.
Tim file(dataset, optional)A
.timfile recording the full TOA information. This isin the form of an array of strings (UTF-8 encoded), one per
line. The file is in TEMPO2 format, so will normally contain
more lines than there are TOAs.
Pulsar sky position(dataset)Unit vector pointing to the pulsar's sky position, in equatorial
coordinates.
Pulsar sky position as a function of time(dataset)Unit vector pointing to the pulsar's sky position, in equatorial
coordinates, as a function of time (three values per TOA).
Sun positions(dataset, units=ls)Sun positions (and possibly velocities) relative to
the solar system barycenter, in light-seconds. This array
will be (number of TOAs) by 6. If the Sun velocities are
unavailable they will be set to zero.
Planet positions(dataset, optional, units=ls)Planet positions (and possibly velocities) relative to
the solar system barycenter, in light-seconds. This array
will be (number of TOAs) by 9 by 6. The planets are in order
outward from the Sun, including Pluto. If not all planet
positions or velocities are available, the unknown entries will
contain NaNs. PINT generally computes only positions and only
for the Earth, Jupiter, Saturn, Uranus, and Neptune.
DMX(dataset)DMX information. This describes a time-variable dispersion
measure to the pulsar using a piecewise-constant model.
Each piece covers a specified range of TOA times and specifies
a delta-DM that should be added to the pulsar's overall DM value
within the corresponding time interval. This will be recorded in
the HDF5 file as a group, with a sub-group for each DMX piece;
the relevant values are recorded as attributes of this
sub-group.
Flags(dataset)Flags associated with TOAs. The tempo2 format allows a flexible
list of flags to be associated with each TOA; these often record
details like the observing frontend and backend. There is a
list of flags recommended by the International Pulsar Timing
Array. This entry is an HDF5 group, which contains an HDF5
dataset for each flag that occurs in the file; the dataset
contains UTF-8-encoded string values for that flag for each TOA.
yaml(attribute, optional)Name of configuration file (in yaml format) used to
generate this data.
git_hash(attribute, constant value='78afc7978e267ae9d11ab5daf57e6438a56c528b')Hash selecting specific version of the git repository
(including configurations and par files) used to generate
the data.
git_date(attribute, constant value='2022-03-14 21:56:13 +0000')Last modification date of the git repository (including
configurations and par files) used to generate the data.
generated_date(attribute, constant value='2023-02-28 10:10:43 ')Date this file was generated.
generated_by(attribute, constant value='Anne Archibald<Anne.Archibald@nanograv.org>')Person who generated this file (if not automatic).
Referencing
If you do use this data for something, please reference
both the NANOGrav 15-year data release paper and the DOI
for this data set.