Skip to content

Update I/O for parquet/etc. using pyarrow#190

Open
sidneymau wants to merge 3 commits intormjarvis:mainfrom
sidneymau:main
Open

Update I/O for parquet/etc. using pyarrow#190
sidneymau wants to merge 3 commits intormjarvis:mainfrom
sidneymau:main

Conversation

@sidneymau
Copy link
Copy Markdown

@sidneymau sidneymau commented Oct 29, 2025

I implemented a Reader that leverages pyarrow datasets. There are a few benefits to this:

  • More direct handling of parquet files than with pandas
  • Potential support for files in arrow, csv, json, orc in addition to parquet (I didn't test this but it should work in principle)
  • Support for "files" that are actually directories full of many parquet files (but which you want to treat as one large source of data) and support for directory partitioning (e.g., if there is a directory for each healpixel, then the directory structure can be parsed as a column for performing selections/etc.). Note that this is not otherwise supported by TreeCorr at the moment, so some changes would need to be made in catalog.py to leverage this

I mostly copied over the parquet reader tests for the arrow reader tests. Running test_reader.py, I get the following output:

time for test_fits_reader = 0.23
time for test_hdf_reader = 0.03
names =  {'KAPPA', 'GAMMA1', 'MU', 'RA', 'INDEX', 'GAMMA2', 'Z', 'DEC'}
names =  {'KAPPA', 'GAMMA1', 'MU', 'RA', 'INDEX', 'GAMMA2', 'Z', 'DEC'}
time for test_parquet_reader = 0.27
time for test_ascii_reader = 0.01
time for test_pandas_reader = 0.05
names =  {'KAPPA', 'GAMMA1', 'MU', 'RA', 'INDEX', 'GAMMA2', 'Z', 'DEC'}
names =  {'KAPPA', 'GAMMA1', 'MU', 'RA', 'INDEX', 'GAMMA2', 'Z', 'DEC'}
time for test_arrow_reader = 0.03

Comparing the arrow reader to the current parquet reader, the performance appears to be much better though this is of course not a systematic comparison.

edit: the performance disparity is in part a result of caching behavior. Running the arrow reader before the parquet reader results in both taking ~0.06 seconds (which does suggest the arrow reader still performs much better on unseen data)

@sidneymau sidneymau changed the base branch from releases/5.1 to main October 29, 2025 19:01
@sidneymau sidneymau marked this pull request as draft October 29, 2025 23:54
@sidneymau
Copy link
Copy Markdown
Author

sidneymau commented Oct 30, 2025

A few more changes:

  • update ParquetReader to use pyarrow as well
  • implement ParquetWriter
  • add support for reading/writing pyarrow tensors (e.g., for covariance matrices) -- this is a little awkward at present but it does work ok; e.g.,
>>> cov = np.eye(2)
... with ParquetWriter("test.arrow") as w:
...     for i in range(3):
...         w.write_array(i * cov)
...         
>>> with ParquetReader("test.arrow") as r:
...     for i in range(3):
...         r.read_array()
...         
array([[0., 0.],
       [0., 0.]])
array([[1., 0.],
       [0., 1.]])
array([[2., 0.],
       [0., 2.]])

The other awkward aspect about this is that these aren't parquet files -- maybe it would be worth renaming to arrow

@sidneymau sidneymau changed the title Implement Arrow Reader Update I/O for parquet/etc. using pyarrow Oct 30, 2025
@rmjarvis
Copy link
Copy Markdown
Owner

rmjarvis commented Jan 27, 2026

Hi Sid, this is still marked as Draft. Did you want to do anything else on it before I take a look at it?

If not, my quick high-level review is that I like the bulk of this change. The one thing I would prefer is when reading parquet files, that the code would check if pyarrow is available, and use it if possible, but if now use the old pandas implementation. See the code in catalog.py where it decides which reader class to use for ascii files. I think we could do the same for parquet files using either the existing ParquetReader (which might want to be renamed) or when possible the new ArrowReader.

@sidneymau
Copy link
Copy Markdown
Author

Hi Mike, apologies, I had forgotten about this PR with everything else going on lately. If I recall where this PR was correctly, there were some awkward bits about using Arrow to write its own tensor format for covariance files... which, while technically the correct way to do that in an "arrow-native" way, seems unnecessary for here (especially because I don't know if you can actually work with the native arrow tensors in python without converting them to numpy objects first); I think it all works ok, it just feels a bit too format-shifty.

I'll mark it as ready to review so we can look at things in more detail.

@sidneymau sidneymau marked this pull request as ready for review January 27, 2026 17:13
@sidneymau
Copy link
Copy Markdown
Author

sidneymau commented Jan 27, 2026

The one thing I would prefer is when reading parquet files, that the code would check if pyarrow is available, and use it if possible, but if now use the old pandas implementation.

Do you have a preferred method of checking this? I know there was a recent PEP regarding lazy imports that may impact this type of import checking, but I haven't looked too closely

edit: it's PEP 810 and isn't planned to be available until 3.15

@rmjarvis
Copy link
Copy Markdown
Owner

rmjarvis commented Jan 27, 2026

For that kind of thing, I usually just try the import and catch the ImportError to do the other thing. cf. https://github.com/rmjarvis/TreeCorr/blob/releases/5.1/treecorr/catalog.py#L892

So this would look the same a couple lines up in the file_type="PARQUET" option.

@sidneymau
Copy link
Copy Markdown
Author

sidneymau commented Feb 4, 2026

Following up on the pandas vs. pyarrow depdency: pandas lists pyarrow as an optional depdendency for reading parquet files (see https://pandas.pydata.org/docs/getting_started/install.html#other-data-sources). The conda feedstock seems to include this under "run_constraints", but I'm not sure what that means (link).

Here is what the pandas documentation says:

engine{{‘auto’, ‘pyarrow’, ‘fastparquet’}}, default ‘auto’

Parquet library to use. If ‘auto’, then the option io.parquet.engine is used. The default io.parquet.engine behavior is to try ‘pyarrow’, falling back to ‘fastparquet’ if ‘pyarrow’ is unavailable.

@sidneymau
Copy link
Copy Markdown
Author

sidneymau commented Feb 4, 2026

I'm not sure which ascii/csv function you were using from pandas, but here's what they have to say there:

engine{{‘c’, ‘python’, ‘pyarrow’}}, optional

Parser engine to use. The C and pyarrow engines are faster, while the python engine is currently more feature-complete. Multithreading is currently only supported by the pyarrow engine. Some features of the “pyarrow” engine are unsupported or may not work correctly.

@rmjarvis
Copy link
Copy Markdown
Owner

rmjarvis commented Feb 4, 2026

I'm using whatever is the default engine in each case. I don't specify it manually.

This is the function that PandasReader uses to read the file:

df = pandas.read_csv(self.file, comment=self.comment_marker,
                             sep=self.sep, usecols=icols, header=None,
                             skiprows=skiprows, nrows=nrows)

Last time I benchmarked it (which admittedly is many years ago now), this was much faster than the numpy genfromtxt function.

@sidneymau
Copy link
Copy Markdown
Author

I'm using whatever is the default engine in each case. I don't specify it manually.
In this case, it probably is using pyarrow, but there's a nonzero chance that someone is working in a more bespoke environment that uses fastparquet instead... I think it would probably be ok to assume pyarrow is already being used in most cases. Maybe pandas could be deprecated with a warning in favor of pyarrow for now, and then removed in a future update.

At the same time, I'll have to check the pyarrow csv parser again in case it doesn't support some of the options that the pandas parser adds on top, in case those are relied on

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants