Reproducibility is the hallmark of the scientific method.
Modern science has become so complex that many science projects rely on multiple software packages to work in unison, resulting in networks of data products along the analyses. Versioning and managing these data products are essential in making modern data- and computation-intensive science reproducible.
Motivated by the Event Horizon Telescope (EHT)'s
observational data calibration pipelines and theory data analyses
tools, hallmark is a lightweight package designed to version
control and manage data products in a complex workflow.
It provides a simple abstraction and a uniform Application Programming
Interface (API) on top of different backend technologies such as
POSIX file system,
object storage,
globus,
iRODS,
stream,
etc.
By using hallmark with other packages such as yukon and
banyan in Project Laniakea, researchers can utilize
computing infrastructures in a global scale to accelerate their
science.
When performing large scale parameter surveys and constructing
simulation libraries, it is common to encode parameter values in the
file paths.
Example include Ma+0.94_i70/sed_Rh160.h5.
hallmark provides a subclassed pandas DataFrame, called
ParaFrame, to decode file paths back to proper parameters, and put
the result into a pandas DataFrame.
ParaFrame uses python parse to parse the file paths.
Because parse is the opposite of format, this means the format string
used to generate the surveys and libraries in the first place can be
reused.
In addition, ParaFrame has a nice interface to perform filter, which
makes parameter selection much easier than pure pandas.
Examples of using ParaFrame can be found in the Jupyter Notebook demos/ParaFrame.ipynb.