Skip to content

[SEDONA-756] feat: raster Python serde and with_bands() support#2956

Open
prantogg wants to merge 1 commit into
apache:masterfrom
prantogg:pranav/feature/raster-python-serde
Open

[SEDONA-756] feat: raster Python serde and with_bands() support#2956
prantogg wants to merge 1 commit into
apache:masterfrom
prantogg:pranav/feature/raster-python-serde

Conversation

@prantogg
Copy link
Copy Markdown
Contributor

@prantogg prantogg commented May 15, 2026

Did you read the Contributor Guide?

Is this PR related to a ticket?

  • Yes, and the PR name follows the format [SEDONA-XXX] my subject.

What changes were proposed in this PR?

Returning raster data from Python UDFs currently requires .tolist() + RS_MakeRaster, which forces Float64 promotion, creates 262K Python float objects per 512×512 tile, and loses all raster metadata (CRS, nodata, affine transform).

This PR adds:

  • raster_serde.serialize() — Writes InDbSedonaRaster to Sedona's binary format, byte-compatible with JVM Serde.deserialize(). Uses cache-and-replay for opaque Kryo blobs (categories, properties, colorModel).
  • InDbSedonaRaster.with_bands() — Creates a new raster with replaced pixel data (NumPy array) but preserved spatial metadata. Band count and dtype may differ from the source.
  • RasterType.serialize() — Delegates to raster_serde.serialize() instead of raising NotImplementedError.
  • DeepCopiedRenderedImage.reconcileColorModel() (JVM) — Fixes colorModel/sampleModel mismatches at deserialization time when Python UDFs change band count or dtype.
  • KryoUtil.skipUTF8String() and GridSampleDimensionSerializer.skip() — Utility methods for navigating Kryo streams without full deserialization.

Benchmarked on Apple M2 Pro, 4-band rasters, median of 50 iterations:

Tile Size Old .tolist() (ms) New serialize() (ms) Speedup Old mem (KB) New mem (KB) Mem ratio
64×64 0.16 0.04 3.6× 384 66 5.8×
256×256 2.56 0.11 23× 6,144 1,026
512×512 11.63 0.66 18× 24,576 4,098

How was this patch tested?

  • 8 with_bands() tests (band count changes, dtype changes, metadata survival)
  • 2 serialize round-trip tests
  • 1 JVM serde test (colorModel mismatch handling)
  • Passes all existing tests

Did this PR include necessary documentation updates?

  • Yes, updated the "Writing Python UDF" section in docs/tutorial/raster.md to show the new raster-to-raster UDF pattern using with_bands().

@prantogg prantogg force-pushed the pranav/feature/raster-python-serde branch from ed31473 to 782a8e1 Compare May 16, 2026 00:11
@prantogg prantogg changed the title feat: raster Python serde and with_bands() support [SEDONA-756] feat: raster Python serde and with_bands() support May 16, 2026
@prantogg prantogg force-pushed the pranav/feature/raster-python-serde branch 2 times, most recently from 3a9890f to 782a8e1 Compare May 16, 2026 01:33
Add Python-side serialize() for InDbSedonaRaster, enabling Python UDFs
to return raster objects directly instead of the lossy .tolist() +
RS_MakeRaster workaround. Rasters now round-trip as contiguous bytes
preserving native dtypes and all metadata (CRS, nodata, affine, etc.).

Add with_bands() to InDbSedonaRaster for replacing pixel data (NumPy
array) while preserving spatial metadata. Band count and dtype may
differ from the source raster.

Add reconcileColorModel() to DeepCopiedRenderedImage (JVM) to fix
colorModel/sampleModel mismatches at deserialization when Python UDFs
change band count or dtype.

Cherry-picked from wherobots/wherobots-compute@e08bde1da08 with
vectorized UDF wiring excluded.
@prantogg prantogg force-pushed the pranav/feature/raster-python-serde branch from 782a8e1 to 288bfe7 Compare May 16, 2026 01:34
@prantogg prantogg marked this pull request as ready for review May 16, 2026 02:22
@prantogg prantogg requested a review from jiayuasu as a code owner May 16, 2026 02:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant