Update I/O for parquet/etc. using pyarrow#190
Update I/O for parquet/etc. using pyarrow#190sidneymau wants to merge 3 commits intormjarvis:mainfrom
Conversation
|
A few more changes:
>>> cov = np.eye(2)
... with ParquetWriter("test.arrow") as w:
... for i in range(3):
... w.write_array(i * cov)
...
>>> with ParquetReader("test.arrow") as r:
... for i in range(3):
... r.read_array()
...
array([[0., 0.],
[0., 0.]])
array([[1., 0.],
[0., 1.]])
array([[2., 0.],
[0., 2.]])The other awkward aspect about this is that these aren't parquet files -- maybe it would be worth renaming to arrow |
|
Hi Sid, this is still marked as Draft. Did you want to do anything else on it before I take a look at it? If not, my quick high-level review is that I like the bulk of this change. The one thing I would prefer is when reading parquet files, that the code would check if pyarrow is available, and use it if possible, but if now use the old pandas implementation. See the code in catalog.py where it decides which reader class to use for ascii files. I think we could do the same for parquet files using either the existing ParquetReader (which might want to be renamed) or when possible the new ArrowReader. |
|
Hi Mike, apologies, I had forgotten about this PR with everything else going on lately. If I recall where this PR was correctly, there were some awkward bits about using Arrow to write its own tensor format for covariance files... which, while technically the correct way to do that in an "arrow-native" way, seems unnecessary for here (especially because I don't know if you can actually work with the native arrow tensors in python without converting them to numpy objects first); I think it all works ok, it just feels a bit too format-shifty. I'll mark it as ready to review so we can look at things in more detail. |
Do you have a preferred method of checking this? I know there was a recent PEP regarding lazy imports that may impact this type of import checking, but I haven't looked too closely edit: it's PEP 810 and isn't planned to be available until 3.15 |
|
For that kind of thing, I usually just try the import and catch the ImportError to do the other thing. cf. https://github.com/rmjarvis/TreeCorr/blob/releases/5.1/treecorr/catalog.py#L892 So this would look the same a couple lines up in the file_type="PARQUET" option. |
|
Following up on the pandas vs. pyarrow depdency: pandas lists Here is what the pandas documentation says:
|
|
I'm not sure which ascii/csv function you were using from pandas, but here's what they have to say there:
|
|
I'm using whatever is the default engine in each case. I don't specify it manually. This is the function that PandasReader uses to read the file: Last time I benchmarked it (which admittedly is many years ago now), this was much faster than the numpy |
At the same time, I'll have to check the |
I implemented a
Readerthat leverages pyarrow datasets. There are a few benefits to this:arrow,csv,json,orcin addition toparquet(I didn't test this but it should work in principle)catalog.pyto leverage thisI mostly copied over the parquet reader tests for the arrow reader tests. Running
test_reader.py, I get the following output:Comparing the arrow reader to the current parquet reader, the performance appears to be much better though this is of course not a systematic comparison.
edit: the performance disparity is in part a result of caching behavior. Running the arrow reader before the parquet reader results in both taking ~0.06 seconds (which does suggest the arrow reader still performs much better on unseen data)