Update I/O for parquet/etc. using pyarrow by sidneymau · Pull Request #190 · rmjarvis/TreeCorr

sidneymau · 2025-10-29T19:01:22Z

I implemented a Reader that leverages pyarrow datasets. There are a few benefits to this:

More direct handling of parquet files than with pandas
Potential support for files in arrow, csv, json, orc in addition to parquet (I didn't test this but it should work in principle)
Support for "files" that are actually directories full of many parquet files (but which you want to treat as one large source of data) and support for directory partitioning (e.g., if there is a directory for each healpixel, then the directory structure can be parsed as a column for performing selections/etc.). Note that this is not otherwise supported by TreeCorr at the moment, so some changes would need to be made in catalog.py to leverage this

I mostly copied over the parquet reader tests for the arrow reader tests. Running test_reader.py, I get the following output:

time for test_fits_reader = 0.23
time for test_hdf_reader = 0.03
names =  {'KAPPA', 'GAMMA1', 'MU', 'RA', 'INDEX', 'GAMMA2', 'Z', 'DEC'}
names =  {'KAPPA', 'GAMMA1', 'MU', 'RA', 'INDEX', 'GAMMA2', 'Z', 'DEC'}
time for test_parquet_reader = 0.27
time for test_ascii_reader = 0.01
time for test_pandas_reader = 0.05
names =  {'KAPPA', 'GAMMA1', 'MU', 'RA', 'INDEX', 'GAMMA2', 'Z', 'DEC'}
names =  {'KAPPA', 'GAMMA1', 'MU', 'RA', 'INDEX', 'GAMMA2', 'Z', 'DEC'}
time for test_arrow_reader = 0.03

Comparing the arrow reader to the current parquet reader, the performance appears to be much better though this is of course not a systematic comparison.

edit: the performance disparity is in part a result of caching behavior. Running the arrow reader before the parquet reader results in both taking ~0.06 seconds (which does suggest the arrow reader still performs much better on unseen data)

sidneymau · 2025-10-30T00:01:14Z

A few more changes:

update ParquetReader to use pyarrow as well
implement ParquetWriter
add support for reading/writing pyarrow tensors (e.g., for covariance matrices) -- this is a little awkward at present but it does work ok; e.g.,

>>> cov = np.eye(2)
... with ParquetWriter("test.arrow") as w:
...     for i in range(3):
...         w.write_array(i * cov)
...         
>>> with ParquetReader("test.arrow") as r:
...     for i in range(3):
...         r.read_array()
...         
array([[0., 0.],
       [0., 0.]])
array([[1., 0.],
       [0., 1.]])
array([[2., 0.],
       [0., 2.]])

The other awkward aspect about this is that these aren't parquet files -- maybe it would be worth renaming to arrow

rmjarvis · 2026-01-27T17:04:51Z

Hi Sid, this is still marked as Draft. Did you want to do anything else on it before I take a look at it?

If not, my quick high-level review is that I like the bulk of this change. The one thing I would prefer is when reading parquet files, that the code would check if pyarrow is available, and use it if possible, but if now use the old pandas implementation. See the code in catalog.py where it decides which reader class to use for ascii files. I think we could do the same for parquet files using either the existing ParquetReader (which might want to be renamed) or when possible the new ArrowReader.

sidneymau · 2026-01-27T17:13:21Z

Hi Mike, apologies, I had forgotten about this PR with everything else going on lately. If I recall where this PR was correctly, there were some awkward bits about using Arrow to write its own tensor format for covariance files... which, while technically the correct way to do that in an "arrow-native" way, seems unnecessary for here (especially because I don't know if you can actually work with the native arrow tensors in python without converting them to numpy objects first); I think it all works ok, it just feels a bit too format-shifty.

I'll mark it as ready to review so we can look at things in more detail.

sidneymau · 2026-01-27T17:15:00Z

The one thing I would prefer is when reading parquet files, that the code would check if pyarrow is available, and use it if possible, but if now use the old pandas implementation.

Do you have a preferred method of checking this? I know there was a recent PEP regarding lazy imports that may impact this type of import checking, but I haven't looked too closely

edit: it's PEP 810 and isn't planned to be available until 3.15

rmjarvis · 2026-01-27T17:55:49Z

For that kind of thing, I usually just try the import and catch the ImportError to do the other thing. cf. https://github.com/rmjarvis/TreeCorr/blob/releases/5.1/treecorr/catalog.py#L892

So this would look the same a couple lines up in the file_type="PARQUET" option.

sidneymau · 2026-02-04T17:58:21Z

Following up on the pandas vs. pyarrow depdency: pandas lists pyarrow as an optional depdendency for reading parquet files (see https://pandas.pydata.org/docs/getting_started/install.html#other-data-sources). The conda feedstock seems to include this under "run_constraints", but I'm not sure what that means (link).

Here is what the pandas documentation says:

engine{{‘auto’, ‘pyarrow’, ‘fastparquet’}}, default ‘auto’

Parquet library to use. If ‘auto’, then the option io.parquet.engine is used. The default io.parquet.engine behavior is to try ‘pyarrow’, falling back to ‘fastparquet’ if ‘pyarrow’ is unavailable.

sidneymau · 2026-02-04T18:05:16Z

I'm not sure which ascii/csv function you were using from pandas, but here's what they have to say there:

engine{{‘c’, ‘python’, ‘pyarrow’}}, optional

Parser engine to use. The C and pyarrow engines are faster, while the python engine is currently more feature-complete. Multithreading is currently only supported by the pyarrow engine. Some features of the “pyarrow” engine are unsupported or may not work correctly.

rmjarvis · 2026-02-04T19:20:31Z

I'm using whatever is the default engine in each case. I don't specify it manually.

This is the function that PandasReader uses to read the file:

df = pandas.read_csv(self.file, comment=self.comment_marker,
                             sep=self.sep, usecols=icols, header=None,
                             skiprows=skiprows, nrows=nrows)

Last time I benchmarked it (which admittedly is many years ago now), this was much faster than the numpy genfromtxt function.

sidneymau · 2026-02-04T19:54:45Z

I'm using whatever is the default engine in each case. I don't specify it manually.
In this case, it probably is using pyarrow, but there's a nonzero chance that someone is working in a more bespoke environment that uses fastparquet instead... I think it would probably be ok to assume pyarrow is already being used in most cases. Maybe pandas could be deprecated with a warning in favor of pyarrow for now, and then removed in a future update.

At the same time, I'll have to check the pyarrow csv parser again in case it doesn't support some of the options that the pandas parser adds on top, in case those are relied on

Implement Arrow reader

5e2ff1e

sidneymau changed the base branch from releases/5.1 to main October 29, 2025 19:01

sidneymau marked this pull request as draft October 29, 2025 23:54

Use pyarrow for parquet reader/writer

e9c6ad7

sidneymau force-pushed the main branch from 5c3ecbe to e9c6ad7 Compare October 30, 2025 00:00

sidneymau changed the title ~~Implement Arrow Reader~~ Update I/O for parquet/etc. using pyarrow Oct 30, 2025

Update arrow reader/writer

81a98ef

sidneymau marked this pull request as ready for review January 27, 2026 17:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update I/O for parquet/etc. using pyarrow#190

Update I/O for parquet/etc. using pyarrow#190
sidneymau wants to merge 3 commits intormjarvis:mainfrom
sidneymau:main

sidneymau commented Oct 29, 2025 •

edited

Loading

Uh oh!

sidneymau commented Oct 30, 2025 •

edited

Loading

Uh oh!

rmjarvis commented Jan 27, 2026 •

edited

Loading

Uh oh!

sidneymau commented Jan 27, 2026

Uh oh!

sidneymau commented Jan 27, 2026 •

edited

Loading

Uh oh!

rmjarvis commented Jan 27, 2026 •

edited

Loading

Uh oh!

sidneymau commented Feb 4, 2026 •

edited

Loading

Uh oh!

sidneymau commented Feb 4, 2026 •

edited

Loading

Uh oh!

rmjarvis commented Feb 4, 2026

Uh oh!

sidneymau commented Feb 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sidneymau commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sidneymau commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rmjarvis commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sidneymau commented Jan 27, 2026

Uh oh!

sidneymau commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rmjarvis commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sidneymau commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sidneymau commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rmjarvis commented Feb 4, 2026

Uh oh!

sidneymau commented Feb 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sidneymau commented Oct 29, 2025 •

edited

Loading

sidneymau commented Oct 30, 2025 •

edited

Loading

rmjarvis commented Jan 27, 2026 •

edited

Loading

sidneymau commented Jan 27, 2026 •

edited

Loading

rmjarvis commented Jan 27, 2026 •

edited

Loading

sidneymau commented Feb 4, 2026 •

edited

Loading

sidneymau commented Feb 4, 2026 •

edited

Loading