Read and write Pulsar objects to HDF5 files by aarchiba · Pull Request #341 · nanograv/enterprise

aarchiba · 2023-02-15T14:56:56Z

This PR allows Enterprise to write Pulsar objects out to HDF5 files in a well-documented format, and to read them back into a new FilePulsar (name suggestions welcome) which should be a drop-in replacement for either PintPulsar or T2Pulsar. The format is flexible enough that downstream users can add their own information in the file (and have these extras included in the documentation); these files can be loaded without needing to understand these extras.

The HDF5 format includes compression for all large-ish entries (in particular the mostly-zero DMX derivatives); the example B1855 data set comes out to about 1.2 MB.

To do:

Test likelihood computation or other non-trivial use of FilePulsar objects
Add saving par and tim files for T2Pulsar
Determine whether any additional data from T2Pulsar should be included
Determine what to do with unrecognized entries (should they go into a dictionary or something so the user doesn't need to poke around in the HDF5 file themselves? what if they are huge? maybe we should accept open HDF5 files as well as filenames?)
Test the file format machinery by using it to produce derivative files for the NANOGrav 15yr data set
Determine whether there are additional things that should go in the file even if they aren't needed for Enterprise
Write appropriate documentation

Note: this PR depends on #340 ; many of the apparent changes here are drawn from that, and this PR may well merge in any changes that PR needs.

The file format allows extensions for project-specific information. Here is what the current description file for the NANOGrav 15-year v1.1 data set looks like (it is Markdown; GitHub uses a weird flavour of Markdown that preserves line breaks, normal viewers will reflow paragraphs normally):

NANOGrav 15-year data release derivative data

The NANOGrav project is releasing its 15-year data set;
this is described in an upcoming paper, but it includes
long-term timing data for 68 pulsars.

Pulsar timing begins with a set of pulse arrival times
and fits a model to those arrival times. The usual output
from this process is the best-fit model parameters and
their uncertainties, and the residuals - the difference in time or
phase between the predicted zero phase and the observed zero phase.

For some applications, for example searching for a
gravitational-wave background, it is vital to include not just
these residuals but their derivative with respect to each of the
fit parameters. This allows construction of a linearized version
of the timing model, which can often be analytically marginalized,
resulting in tremendous speedups. Other applications for such
linearized models include parameter searches in photon data.

The purpose of this file is to provide the derivatives needed
to construct this linear model, plus all other supporting data.
It is stored in HDF5, a widely portable binary format that is
extensible enough to permit project-specific information to be
stored alongside standard values.

This text should accompany a collection of such files in
plain-text form, and it should also be included in all such
files as a dataset called "README".

This data

Timing results as of: 2022-03-14 21:56:13 +0000

Git hash: 78afc7978e267ae9d11ab5daf57e6438a56c528b

Generated: 2023-02-28 10:10:43

Generated by: Anne Archibald <Anne.Archibald@nanograv.org>

File contents

format_name (attribute, optional, constant value='derivative_file')
The name of this particular HDF5 format.
format_version (attribute, optional, constant value='0.6.0')
Version number indicating the compatibility of this file with
other readers of this format.
Name (dataset)
Pulsar name.
RAJ (dataset, units=rad)
Right ascension in the Julian system. In radians.
DECJ (dataset, units=rad)
Declination in the Julian system. In radians.
DM (dataset, units=pc/cm3)
Best-fit dispersion measure, in pc/cm^3.
Estimated distance (dataset, units=kpc)
Estimated distance and uncertainty in kiloparsecs.
TOA integer part (dataset, optional, units=day)
This is the exact TOA, converted to TDB (barycentric dynamical time)
but not corrected for travel time in any way. In order to retain
nanosecond accuracy, this is split into two arrays: the integer
and the fractional parts of the MJD. This dataset contains the
integer part.
TOA fractional part (dataset, optional, units=day)
This is the exact TOA, converted to TDB (barycentric dynamical time)
but not corrected for travel time in any way. In order to retain
nanosecond accuracy, this is split into two arrays: the integer
and the fractional parts of the MJD. This dataset contains the
fractional part.
TOAs in seconds (dataset, units=s)
Pulse time-of-arrival data, in Modified Julian Days. These
values are barycentered, that is, converted to times
that the pulses would have reached the solar system barycenter.
(This depends on the pulsar sky position.) Note
that this array has only about microsecond resolution
and so is insufficient to do precision timing.
Raw TOAs in seconds (dataset, units=s)
TOAs at the observatory; this is corrected for observatory
clock drift but not converted to any other time system or
adjusted to when the pulses would have reached the solar
system barycenter. This has also been converted to seconds,
that is, the Modified Julian Date has been multiplied by 86400.
This array too has only about microsecond precision.
TOA uncertainties (dataset, units=s)
Uncertainties on pulse time-of-arrival data (and thus on
residuals), in seconds.
Residuals (dataset, units=s)
Residuals (model minus data, in seconds).
Radio frequencies (dataset, units=MHz)
Radio frequency at which each TOA is observed, in MHz. This
frequency is corrected for Doppler shift due to the
observatory's motion around the Sun.
Telescope names (dataset)
The name of the telescope at which each TOA was observed.
These names are PINT- (or TEMPO2-)style telescope names (for
example arecibo).
Fit parameters (dataset)
Fitted parameters.
Design matrix (dataset)
Design matrix. This is an array that is (number of TOAs) by
(number of fit parameters). Each column is the derivative of
the residual (in seconds) with respect to the corresponding
fit parameter. This dataset has an attribute labels that
indicates the labels of the design matrix entries (which will
be identical to the fit parameters) and units giving the units
of the design matrix entries. These units are stored in Astropy's
"generic" string format for units, which is based on that used in
FITS files.
Set parameters (dataset)
Parameters of the timing model that were fixed during fitting.
Not all of these even have numeric values.
Par file (dataset, optional)
A .par file describing the timing model, as a string.
This can be quite long if the model has many DMX parameters.
The value is stored as an array of UTF-8 byte strings, one
per line.
Tim file (dataset, optional)
A .tim file recording the full TOA information. This is
in the form of an array of strings (UTF-8 encoded), one per
line. The file is in TEMPO2 format, so will normally contain
more lines than there are TOAs.
Pulsar sky position (dataset)
Unit vector pointing to the pulsar's sky position, in equatorial
coordinates.
Pulsar sky position as a function of time (dataset)
Unit vector pointing to the pulsar's sky position, in equatorial
coordinates, as a function of time (three values per TOA).
Sun positions (dataset, units=ls)
Sun positions (and possibly velocities) relative to
the solar system barycenter, in light-seconds. This array
will be (number of TOAs) by 6. If the Sun velocities are
unavailable they will be set to zero.
Planet positions (dataset, optional, units=ls)
Planet positions (and possibly velocities) relative to
the solar system barycenter, in light-seconds. This array
will be (number of TOAs) by 9 by 6. The planets are in order
outward from the Sun, including Pluto. If not all planet
positions or velocities are available, the unknown entries will
contain NaNs. PINT generally computes only positions and only
for the Earth, Jupiter, Saturn, Uranus, and Neptune.
DMX (dataset)
DMX information. This describes a time-variable dispersion
measure to the pulsar using a piecewise-constant model.
Each piece covers a specified range of TOA times and specifies
a delta-DM that should be added to the pulsar's overall DM value
within the corresponding time interval. This will be recorded in
the HDF5 file as a group, with a sub-group for each DMX piece;
the relevant values are recorded as attributes of this
sub-group.
Flags (dataset)
Flags associated with TOAs. The tempo2 format allows a flexible
list of flags to be associated with each TOA; these often record
details like the observing frontend and backend. There is a
list of flags recommended by the International Pulsar Timing
Array. This entry is an HDF5 group, which contains an HDF5
dataset for each flag that occurs in the file; the dataset
contains UTF-8-encoded string values for that flag for each TOA.
yaml (attribute, optional)
Name of configuration file (in yaml format) used to
generate this data.
git_hash (attribute, constant value='78afc7978e267ae9d11ab5daf57e6438a56c528b')
Hash selecting specific version of the git repository
(including configurations and par files) used to generate
the data.
git_date (attribute, constant value='2022-03-14 21:56:13 +0000')
Last modification date of the git repository (including
configurations and par files) used to generate the data.
generated_date (attribute, constant value='2023-02-28 10:10:43 ')
Date this file was generated.
generated_by (attribute, constant value='Anne Archibald <Anne.Archibald@nanograv.org>')
Person who generated this file (if not automatic).

Referencing

If you do use this data for something, please reference
both the NANOGrav 15-year data release paper and the DOI
for this data set.

The CI will still require high coverage percentages, but it is able to combine multiple runs (tempo2-only, pint-only, both) to cover everything.

aarchiba · 2023-02-23T16:44:28Z

I'm not quite sure what documentation is expected for Enterprise code.

paulthebaker · 2023-02-23T21:22:16Z

I'm not quite sure what documentation is expected for Enterprise code.

Docstrings for all new things is the best way to do it. But there is a ton of existing code that doesn't yet have docstrings...

Certainly, any function/method that is user facing should have a docstring. For something that is more for internal uses and is pretty clear from the code, I wouldn't fret too much about it.

Something like write_dict_to_hdf5 probably doesn't need a docstring, but if you wanted to add one I wouldn't stop you.

vhaasteren · 2023-04-23T06:16:31Z

@aarchiba, with an HDF5 file format, it seems pretty straightforward to also allow creation of mock pulsar objects. The reason why you created the HDF5 pulsar class is that we sometimes need independence from PINT or Tempo2. This is even more so with simulations.

The other day I needed to generate an array of 10k pulsars. With tempo2 or PINT that would take up too much memory and would take ages. So I wrote this MockPulsar Enterprise class that is super efficient and still allows all many Enterprise models that I needed to test to be run: most Enterprise functions only require things defined in BasePulsar.

The HDF5 pulsar format seems like the best place to put this kind of functionality, and then it can be saved in a proper file format. Do you see any hiccups right away?

vhaasteren · 2023-11-06T12:23:43Z

The code in this PR has been converted to a separate package by @AaronDJohnson and myself, which can be found here. There will be a new Enterprise PR that needs to be merged in order for that package to work. Until that is ready, the branch can be found on on my repo

I am closing this PR

Anne Archibald added 30 commits February 10, 2023 15:58

Make test suite work without tempo2

29cc262

Tidy code

1eabe47

Separate T2-model tests so others can be run

3302266

Parametrize tests so they run on PINT and tempo2

ea0b698

Make code analysis tools happier

8ca41be

Arrange for no-tempo test run

85aa6d2

Try to work around absence of tempo2 for no-tempo

a23d08a

Undo attempt to avoid installing libstempo

a66f7d0

Try just uninstallibg libstempo

c097248

Add -y to pip uninstall

dc24e64

Add tests to cover whole patch

d62f136

Tidy and test PINT-interfacing code

9bec1d8

Fix wrong input test

16edb95

Note about over-specific test

c309d35

Less specific test

61cf782

Includr no_pint test

25874aa

Try to make no_pint run pass

a5f0ec3

Handle missing and old PINT

ce45bcf

Lower required coverage percentage for local tests

8658f17

The CI will still require high coverage percentages, but it is able to combine multiple runs (tempo2-only, pint-only, both) to cover everything.

Fix timing_package defaulting

d9bdc32

Skip more tests if PINT not available

393d12a

Better message when arguments missing

8bb1294

Tidy argument processing

3411b35

Improve error handling in Pulsar

b774ad6

Ensure all PINT tests are conditional on PINT

97a9c6e

Delete leftover file

f630468

Skip pulsar tests if no PINT

189a32e

Ensure .pkl file is removed

177b3c5

Framework for storing things in HDF5 files

5854c0d

Tidy code a little

ac2f957

Anne Archibald added 18 commits February 16, 2023 08:34

Ensure hdf5 gets installed for testing

8bc412b

Support format_version and constant entries

4f84110

Make derivative_file more flexible

c85b43d

Implement format versioning

8669a55

Save large strings in datasets

817eee5

Handle tim files in a more obvious way

a654443

Bump format version

e3bbe36

Improve strings and attach labels to Mmat

cffd918

Tidy code

60518e1

HDF5 files keep insertion order now

0b6beb8

Attach unit information; use datasets more

a3f1b48

Use repr for constant values

660484d

Fix designmatrix for tempo2

9dd4b8b

Change black version to py37 - we test that

7fed57a

fix 3.7-incompatible syntax

81a4a8d

Read design matrix units correctly

5f0d8d2

Get likelihood tests working

7b47ea4

Fix tests in no-pint case

c8631f7

aarchiba changed the title ~~WIP: Read and write Pulsar objects to HDF5 files~~ Read and write Pulsar objects to HDF5 files Feb 22, 2023

Anne Archibald added 2 commits February 23, 2023 16:34

Allow tempo2 pulsars to keep around par and tim

f09f268

test par/tim preservation

aa94140

Anne Archibald added 4 commits February 24, 2023 16:02

Documentation.

b4e4ca1

Add descriptions to datasets

a38da30

Ensure FilePulsar can be written out too

47d0c1c

Fix up tests

26b53b8

vhaasteren closed this Nov 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Read and write Pulsar objects to HDF5 files#341

Read and write Pulsar objects to HDF5 files#341
aarchiba wants to merge 63 commits into
nanograv:masterfrom
aarchiba:derivative_file

aarchiba commented Feb 15, 2023 •

edited

Loading

Uh oh!

aarchiba commented Feb 23, 2023

Uh oh!

paulthebaker commented Feb 23, 2023

Uh oh!

vhaasteren commented Apr 23, 2023

Uh oh!

vhaasteren commented Nov 6, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

aarchiba commented Feb 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

NANOGrav 15-year data release derivative data

This data

File contents

Referencing

Uh oh!

aarchiba commented Feb 23, 2023

Uh oh!

paulthebaker commented Feb 23, 2023

Uh oh!

vhaasteren commented Apr 23, 2023

Uh oh!

vhaasteren commented Nov 6, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

aarchiba commented Feb 15, 2023 •

edited

Loading