WIP: Add rechunking example#197
Conversation
|
Great to see you trying this! First time someone has tried to use these two
libraries together!
I don't think you should import the rechunk primitive from Cubed. I think
instead you should open the kerchunked dataset as an Xarray dataset using
cubed-xarray, then call Xarray's chunk method with the desired chunks. This
might smooth out low-level chunk type considerations for you, and if it
doesn't that's a bug.
…On Fri, Jul 19, 2024, 12:24 PM Timothy Hodson ***@***.***> wrote:
This PR adds an example script demonstrating how to rechunk a VirtualiZarr
dataset with Cubed.
However, this is still a WIP. I'm creating the PR to elicit feedback about
what changes might be necessary in order for the script to run as intended.
@TomNicholas <https://github.com/TomNicholas> and @norlandrhagen
<https://github.com/norlandrhagen> might have some thoughts.
After creating the combined virtual dataset, I specify the source chunking
before passing it off the Cubed for rechunking
source_chunks = {'Time':1, 'south_north':250, 'west_east':320}
combined_chunked = combined_ds.chunk(
chunks = source_chunks,
)
combined_chunks
returns
Frozen({'Time': (1, 1, 1, 1), 'south_north': (250,), 'west_east': (320,), 'interp_levels': (9,), 'soil_layers_stag': (4,)})
The virtual dataset contains four files, indicated by 'Time': (1, 1, 1, 1)
.
Then I attempt to rechunk:
from cubed.primitive.rechunk import rechunk
target_chunks = {'Time':5, 'south_north':25, 'west_east':32}
rechunk(
combined_chunked['TMAX'], # requires shape attr, so can't pass full Dataset
target_chunks=target_chunks,
source_array_name='virtual',
int_array_name='temp',
allowed_mem=2000,
reserved_mem=1000,
target_store="test.zarr",
#temp_store="s3://cubed-thodson-temp",
)
which errors with
TypeError: can't multiply sequence by non-int of type 'tuple'
Apparently, Cubed won't tolerate the Time chunk tuple 'Time': (1, 1, 1, 1).
Is there a simple way to convert it to Time': (1, )? Alternatively, I
could prepare a PR to Cubed, which would set the memory constraint around
the largest chunk size when chunks are variable.
kerchunk
I also tested this workflow with kerchunk but I ran into a bug while
following the Pythia cookbook example
<https://projectpythia.org/kerchunk-cookbook/notebooks/foundations/02_kerchunk_multi_file.html>
:
/home/runner/miniconda3/envs/kerchunk-cookbook/lib/python3.10/site-packages/kerchunk/combine.py:370: UserWarning: Concatenated coordinate 'Time' contains less than expected number of values across the datasets: [0]
warnings.warn(
------------------------------
You can view, comment on, or merge this pull request online at:
#197
Commit Summary
- 5b3a12f
<5b3a12f>
Add rechunking example
File Changes
(4 files <https://github.com/zarr-developers/VirtualiZarr/pull/197/files>)
- *A* examples/rechunking/Dockerfile_virtualizarr
<https://github.com/zarr-developers/VirtualiZarr/pull/197/files#diff-e33d97ab70b867b634eac16c235c7e9795dbc2115bb3bade311c6d46c12e59d8>
(59)
- *A* examples/rechunking/README.md
<https://github.com/zarr-developers/VirtualiZarr/pull/197/files#diff-8ace617142d2db0b36c6513b371549f3d6b2d7671d2eb4bb511f015ed9c6b406>
(15)
- *A* examples/rechunking/cubed-rechunk.py
<https://github.com/zarr-developers/VirtualiZarr/pull/197/files#diff-ec679ad647666b70abb5797750210430ab4b4e237c35724b0bd8afe818d7ae35>
(81)
- *A* examples/rechunking/requirements.txt
<https://github.com/zarr-developers/VirtualiZarr/pull/197/files#diff-9ca5deb54705a5494a79902c3670b1fd6c9134f20dc2d64d5e11882e8efc8fca>
(9)
Patch Links:
- https://github.com/zarr-developers/VirtualiZarr/pull/197.patch
- https://github.com/zarr-developers/VirtualiZarr/pull/197.diff
—
Reply to this email directly, view it on GitHub
<#197>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AISNPI3EWX52ZPQ6EJ4D45LZNFRW3AVCNFSM6AAAAABLFD3DE2VHI2DSMVQWIX3LMV43ASLTON2WKOZSGQYTSNRZGIZDANA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Yes, that seems to work, but I'm still working through several errors when I write out to Zarr. I'll report more in a day or two. |
Great - very curious to see the details. I think what you're doing here should live in the cubed repo though - once you have the kerchunk reference files on disk virtualizarr is out of the picture, all of the rechunking is about using cubed. I do think that this use case would make an important example to have in the cubed docs though - as its basically showing how the original |
|
Sounds great. Happy for this to be added as a Cubed example. |
|
Closing and opening a PR on cubed cubed-dev/cubed#520. |
This PR adds an example script demonstrating how to rechunk a VirtualiZarr dataset with Cubed.
However, this is still a WIP. I'm creating the PR to elicit feedback about what changes might be necessary in order for the script to run as intended. @TomNicholas and @norlandrhagen might have some thoughts.
After creating the combined virtual dataset, I specify the source chunking before passing it off the Cubed for rechunking
returns
The virtual dataset contains four files, indicated by
'Time': (1, 1, 1, 1).Then I attempt to rechunk:
which errors with
Apparently, Cubed won't tolerate the Time chunk tuple
'Time': (1, 1, 1, 1). Is there a simple way to convert it toTime': (1, )? Alternatively, I could prepare a PR to Cubed, which would set the memory constraint around the largest chunk size when chunks are variable.kerchunk
I also tested this workflow with
kerchunkbut I ran into a bug while following the Pythia cookbook example: