Skip to content

Read and write Pulsar objects to HDF5 files#341

Closed
aarchiba wants to merge 63 commits into
nanograv:masterfrom
aarchiba:derivative_file
Closed

Read and write Pulsar objects to HDF5 files#341
aarchiba wants to merge 63 commits into
nanograv:masterfrom
aarchiba:derivative_file

Conversation

@aarchiba
Copy link
Copy Markdown
Contributor

@aarchiba aarchiba commented Feb 15, 2023

This PR allows Enterprise to write Pulsar objects out to HDF5 files in a well-documented format, and to read them back into a new FilePulsar (name suggestions welcome) which should be a drop-in replacement for either PintPulsar or T2Pulsar. The format is flexible enough that downstream users can add their own information in the file (and have these extras included in the documentation); these files can be loaded without needing to understand these extras.

The HDF5 format includes compression for all large-ish entries (in particular the mostly-zero DMX derivatives); the example B1855 data set comes out to about 1.2 MB.

To do:

  • Test likelihood computation or other non-trivial use of FilePulsar objects
  • Add saving par and tim files for T2Pulsar
  • Determine whether any additional data from T2Pulsar should be included
  • Determine what to do with unrecognized entries (should they go into a dictionary or something so the user doesn't need to poke around in the HDF5 file themselves? what if they are huge? maybe we should accept open HDF5 files as well as filenames?)
  • Test the file format machinery by using it to produce derivative files for the NANOGrav 15yr data set
  • Determine whether there are additional things that should go in the file even if they aren't needed for Enterprise
  • Write appropriate documentation

Note: this PR depends on #340 ; many of the apparent changes here are drawn from that, and this PR may well merge in any changes that PR needs.

The file format allows extensions for project-specific information. Here is what the current description file for the NANOGrav 15-year v1.1 data set looks like (it is Markdown; GitHub uses a weird flavour of Markdown that preserves line breaks, normal viewers will reflow paragraphs normally):

NANOGrav 15-year data release derivative data

The NANOGrav project is releasing its 15-year data set;
this is described in an upcoming paper, but it includes
long-term timing data for 68 pulsars.

Pulsar timing begins with a set of pulse arrival times
and fits a model to those arrival times. The usual output
from this process is the best-fit model parameters and
their uncertainties, and the residuals - the difference in time or
phase between the predicted zero phase and the observed zero phase.

For some applications, for example searching for a
gravitational-wave background, it is vital to include not just
these residuals but their derivative with respect to each of the
fit parameters. This allows construction of a linearized version
of the timing model, which can often be analytically marginalized,
resulting in tremendous speedups. Other applications for such
linearized models include parameter searches in photon data.

The purpose of this file is to provide the derivatives needed
to construct this linear model, plus all other supporting data.
It is stored in HDF5, a widely portable binary format that is
extensible enough to permit project-specific information to be
stored alongside standard values.

This text should accompany a collection of such files in
plain-text form, and it should also be included in all such
files as a dataset called "README".

This data

Timing results as of: 2022-03-14 21:56:13 +0000

Git hash: 78afc7978e267ae9d11ab5daf57e6438a56c528b

Generated: 2023-02-28 10:10:43

Generated by: Anne Archibald <Anne.Archibald@nanograv.org>

File contents

  • format_name (attribute, optional, constant value='derivative_file')
    The name of this particular HDF5 format.
  • format_version (attribute, optional, constant value='0.6.0')
    Version number indicating the compatibility of this file with
    other readers of this format.
  • Name (dataset)
    Pulsar name.
  • RAJ (dataset, units=rad)
    Right ascension in the Julian system. In radians.
  • DECJ (dataset, units=rad)
    Declination in the Julian system. In radians.
  • DM (dataset, units=pc/cm3)
    Best-fit dispersion measure, in pc/cm^3.
  • Estimated distance (dataset, units=kpc)
    Estimated distance and uncertainty in kiloparsecs.
  • TOA integer part (dataset, optional, units=day)
    This is the exact TOA, converted to TDB (barycentric dynamical time)
    but not corrected for travel time in any way. In order to retain
    nanosecond accuracy, this is split into two arrays: the integer
    and the fractional parts of the MJD. This dataset contains the
    integer part.
  • TOA fractional part (dataset, optional, units=day)
    This is the exact TOA, converted to TDB (barycentric dynamical time)
    but not corrected for travel time in any way. In order to retain
    nanosecond accuracy, this is split into two arrays: the integer
    and the fractional parts of the MJD. This dataset contains the
    fractional part.
  • TOAs in seconds (dataset, units=s)
    Pulse time-of-arrival data, in Modified Julian Days. These
    values are barycentered, that is, converted to times
    that the pulses would have reached the solar system barycenter.
    (This depends on the pulsar sky position.) Note
    that this array has only about microsecond resolution
    and so is insufficient to do precision timing.
  • Raw TOAs in seconds (dataset, units=s)
    TOAs at the observatory; this is corrected for observatory
    clock drift but not converted to any other time system or
    adjusted to when the pulses would have reached the solar
    system barycenter. This has also been converted to seconds,
    that is, the Modified Julian Date has been multiplied by 86400.
    This array too has only about microsecond precision.
  • TOA uncertainties (dataset, units=s)
    Uncertainties on pulse time-of-arrival data (and thus on
    residuals), in seconds.
  • Residuals (dataset, units=s)
    Residuals (model minus data, in seconds).
  • Radio frequencies (dataset, units=MHz)
    Radio frequency at which each TOA is observed, in MHz. This
    frequency is corrected for Doppler shift due to the
    observatory's motion around the Sun.
  • Telescope names (dataset)
    The name of the telescope at which each TOA was observed.
    These names are PINT- (or TEMPO2-)style telescope names (for
    example arecibo).
  • Fit parameters (dataset)
    Fitted parameters.
  • Design matrix (dataset)
    Design matrix. This is an array that is (number of TOAs) by
    (number of fit parameters). Each column is the derivative of
    the residual (in seconds) with respect to the corresponding
    fit parameter. This dataset has an attribute labels that
    indicates the labels of the design matrix entries (which will
    be identical to the fit parameters) and units giving the units
    of the design matrix entries. These units are stored in Astropy's
    "generic" string format for units, which is based on that used in
    FITS files.
  • Set parameters (dataset)
    Parameters of the timing model that were fixed during fitting.
    Not all of these even have numeric values.
  • Par file (dataset, optional)
    A .par file describing the timing model, as a string.
    This can be quite long if the model has many DMX parameters.
    The value is stored as an array of UTF-8 byte strings, one
    per line.
  • Tim file (dataset, optional)
    A .tim file recording the full TOA information. This is
    in the form of an array of strings (UTF-8 encoded), one per
    line. The file is in TEMPO2 format, so will normally contain
    more lines than there are TOAs.
  • Pulsar sky position (dataset)
    Unit vector pointing to the pulsar's sky position, in equatorial
    coordinates.
  • Pulsar sky position as a function of time (dataset)
    Unit vector pointing to the pulsar's sky position, in equatorial
    coordinates, as a function of time (three values per TOA).
  • Sun positions (dataset, units=ls)
    Sun positions (and possibly velocities) relative to
    the solar system barycenter, in light-seconds. This array
    will be (number of TOAs) by 6. If the Sun velocities are
    unavailable they will be set to zero.
  • Planet positions (dataset, optional, units=ls)
    Planet positions (and possibly velocities) relative to
    the solar system barycenter, in light-seconds. This array
    will be (number of TOAs) by 9 by 6. The planets are in order
    outward from the Sun, including Pluto. If not all planet
    positions or velocities are available, the unknown entries will
    contain NaNs. PINT generally computes only positions and only
    for the Earth, Jupiter, Saturn, Uranus, and Neptune.
  • DMX (dataset)
    DMX information. This describes a time-variable dispersion
    measure to the pulsar using a piecewise-constant model.
    Each piece covers a specified range of TOA times and specifies
    a delta-DM that should be added to the pulsar's overall DM value
    within the corresponding time interval. This will be recorded in
    the HDF5 file as a group, with a sub-group for each DMX piece;
    the relevant values are recorded as attributes of this
    sub-group.
  • Flags (dataset)
    Flags associated with TOAs. The tempo2 format allows a flexible
    list of flags to be associated with each TOA; these often record
    details like the observing frontend and backend. There is a
    list of flags recommended by the International Pulsar Timing
    Array. This entry is an HDF5 group, which contains an HDF5
    dataset for each flag that occurs in the file; the dataset
    contains UTF-8-encoded string values for that flag for each TOA.
  • yaml (attribute, optional)
    Name of configuration file (in yaml format) used to
    generate this data.
  • git_hash (attribute, constant value='78afc7978e267ae9d11ab5daf57e6438a56c528b')
    Hash selecting specific version of the git repository
    (including configurations and par files) used to generate
    the data.
  • git_date (attribute, constant value='2022-03-14 21:56:13 +0000')
    Last modification date of the git repository (including
    configurations and par files) used to generate the data.
  • generated_date (attribute, constant value='2023-02-28 10:10:43 ')
    Date this file was generated.
  • generated_by (attribute, constant value='Anne Archibald <Anne.Archibald@nanograv.org>')
    Person who generated this file (if not automatic).

Referencing

If you do use this data for something, please reference
both the NANOGrav 15-year data release paper and the DOI
for this data set.

@aarchiba aarchiba changed the title WIP: Read and write Pulsar objects to HDF5 files Read and write Pulsar objects to HDF5 files Feb 22, 2023
@aarchiba
Copy link
Copy Markdown
Contributor Author

I'm not quite sure what documentation is expected for Enterprise code.

@paulthebaker
Copy link
Copy Markdown
Member

I'm not quite sure what documentation is expected for Enterprise code.

Docstrings for all new things is the best way to do it. But there is a ton of existing code that doesn't yet have docstrings...

Certainly, any function/method that is user facing should have a docstring. For something that is more for internal uses and is pretty clear from the code, I wouldn't fret too much about it.

Something like write_dict_to_hdf5 probably doesn't need a docstring, but if you wanted to add one I wouldn't stop you.

@vhaasteren
Copy link
Copy Markdown
Member

@aarchiba, with an HDF5 file format, it seems pretty straightforward to also allow creation of mock pulsar objects. The reason why you created the HDF5 pulsar class is that we sometimes need independence from PINT or Tempo2. This is even more so with simulations.

The other day I needed to generate an array of 10k pulsars. With tempo2 or PINT that would take up too much memory and would take ages. So I wrote this MockPulsar Enterprise class that is super efficient and still allows all many Enterprise models that I needed to test to be run: most Enterprise functions only require things defined in BasePulsar.

The HDF5 pulsar format seems like the best place to put this kind of functionality, and then it can be saved in a proper file format. Do you see any hiccups right away?

@vhaasteren
Copy link
Copy Markdown
Member

The code in this PR has been converted to a separate package by @AaronDJohnson and myself, which can be found here. There will be a new Enterprise PR that needs to be merged in order for that package to work. Until that is ready, the branch can be found on on my repo

I am closing this PR

@vhaasteren vhaasteren closed this Nov 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants