[GH-2939] Box2D spatial join: ST_BoxIntersects / ST_BoxContains#2953
[GH-2939] Box2D spatial join: ST_BoxIntersects / ST_BoxContains#2953jiayuasu wants to merge 2 commits into
Conversation
Wire `ST_BoxIntersects(box_a, box_b)` and `ST_BoxContains(box_a, box_b)` through the existing spatial join planner (broadcast index, range, and distance-capable join executors). No new partitioner, no new index, no new refine path. Approach: at the executor boundary, dispatch on the shape column's `dataType`. If it's `Box2DUDT`, read the four doubles out of the serialized InternalRow and materialise the implied rectangular Polygon via `Constructors.polygonFromEnvelope`. The materialised Polygon flows through `SpatialRDD<T extends Geometry>`, the partitioner sample step, the R-tree `IndexBuilder`, and the refine evaluator with no API change. JTS already short-circuits axis-aligned rectangle-rectangle predicates through `RectangleIntersects` / `RectangleContains` (gated on `Polygon.isRectangle()`), so the refine step pays only the four-double envelope comparison naturally. Changes: - `TraitJoinQueryBase.shapeToGeometry`: centralised dispatch used by `toSpatialRDD` and `toExpandedEnvelopeRDD`. Box2DUDT → Polygon; GeometryUDT → existing `GeometrySerializer.deserialize`. - `BroadcastIndexJoinExec.createStreamShapes`: same dispatch at the stream-side eval sites for both the distance and non-distance paths (raster path unchanged). - `JoinQueryDetector`: recognise `ST_BoxIntersects` (→ INTERSECTS) and `ST_BoxContains` (→ COVERS). `COVERS` is intentional — ST_BoxContains is closed-interval, matching JTS `covers`; JTS `contains` would reject edge-touching pairs. - `OptimizableJoinCondition.isOptimizablePredicate`: accept both predicates so the condition is forwarded to the detector. - Storage and the `Box2D` class hierarchy are untouched. Tests: new `Box2DJoinSuite` covering broadcast index join, range join, the symmetric form of `ST_BoxIntersects`, the COVERS semantics for `ST_BoxContains` (incl. edge-touching closed-interval case), and an equivalence check that the Box2D path yields the same row pairs as `ST_Intersects` over the boxes materialised as polygons. Closes apache#2939.
There was a problem hiding this comment.
Pull request overview
Extends Sedona Spark SQL’s spatial-join planning/execution pipeline to recognize Box2D predicates (ST_BoxIntersects, ST_BoxContains) as join conditions and route them through existing range joins and broadcast index joins (partitioner + R-tree + refine evaluator), rather than falling back to non-optimized join strategies.
Changes:
- Adds
ST_BoxIntersects/ST_BoxContainsto join-condition detection and optimization eligibility. - Introduces Box2D→JTS-geometry materialization (as rectangles) so existing join machinery can operate on Box2D keys.
- Adds a dedicated Scala test suite covering broadcast and non-broadcast joins for Box2D predicates.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| spark/common/src/test/scala/org/apache/sedona/sql/Box2DJoinSuite.scala | New join-planner/executor coverage for Box2D predicates across broadcast index and range join plans. |
| spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/strategy/join/TraitJoinQueryBase.scala | Centralizes shape materialization and adds Box2D handling for join inputs. |
| spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/strategy/join/OptimizableJoinCondition.scala | Marks Box2D predicates as optimizable join predicates. |
| spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/strategy/join/JoinQueryDetector.scala | Detects Box2D predicates as spatial join conditions and maps them to SpatialPredicate. |
| spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/strategy/join/BroadcastIndexJoinExec.scala | Uses the centralized shape materialization logic for stream-side shape extraction (non-raster paths). |
Comments suppressed due to low confidence (1)
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/strategy/join/TraitJoinQueryBase.scala:131
toExpandedEnvelopeRDDnow materializes shapes viashapeToGeometry, which may return null for NULL inputs; passing a nullshapeintogeometryToExpandedEnvelopewill throw. The previous implementation deserialized withGeometrySerializer, which handles null bytes. Add null handling here (skip rows or substitute an empty geometry) to avoid crashing distance joins on NULL shapes.
spatialRdd.setRawSpatialRDD(
rdd
.map { x =>
val shape = TraitJoinQueryBase.shapeToGeometry(shapeExpression, x)
val distance = boundRadius.eval(x).asInstanceOf[Double]
val expandedEnvelope =
JoinedGeometry.geometryToExpandedEnvelope(shape, distance, isGeography)
expandedEnvelope.setUserData(x.copy)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| rdd | ||
| .map { x => | ||
| val shape = | ||
| GeometrySerializer.deserialize(shapeExpression.eval(x).asInstanceOf[Array[Byte]]) | ||
| val shape = TraitJoinQueryBase.shapeToGeometry(shapeExpression, x) | ||
| shape.setUserData(x.copy) | ||
| shape |
There was a problem hiding this comment.
Fixed in 9144fb8 — the helper now keeps returning null on null inputs, and toSpatialRDD/toExpandedEnvelopeRDD wrap it in a shapeToGeometryOrEmpty that substitutes an empty GeometryCollection. This matches GeometrySerializer.deserialize(null)'s legacy behaviour. The KnnJoinSuite null-geom regression is gone in local runs.
| Constructors | ||
| .polygonFromEnvelope( | ||
| box.getDouble(0), | ||
| box.getDouble(1), | ||
| box.getDouble(2), | ||
| box.getDouble(3)) |
There was a problem hiding this comment.
Fixed in 9144fb8 — the Box2D branch now validates xmin <= xmax and ymin <= ymax and throws the same IllegalArgumentException raised by Predicates.boxIntersects/boxContains. Kept polygonFromEnvelope as the materialiser; JTS handles degenerate boxes (zero-area polygon) correctly in both index and refine paths, and switching to geomFromBox2D would introduce shape polymorphism (Point/Line/Polygon) at the join boundary for what's nominally a single rectangle type.
| describe("Box2D spatial join") { | ||
|
|
||
| it("ST_BoxIntersects: broadcast index join produces correct pairs") { | ||
| val df = leftBoxes | ||
| .alias("L") | ||
| .join(broadcast(rightBoxes.alias("R")), expr("ST_BoxIntersects(L.box, R.box)")) | ||
| val plan = df.queryExecution.sparkPlan |
There was a problem hiding this comment.
Added in 9144fb8: 'Null Box2D rows are safe and produce no matches' (covers both broadcast and range paths) and 'Inverted Box2D bounds in a join throw IllegalArgumentException' (verifies the exception type and message).
- Restore null-shape fallback. `shapeToGeometry` returns null on null inputs; `toSpatialRDD` / `toExpandedEnvelopeRDD` substitute an empty GeometryCollection via the new `shapeToGeometryOrEmpty` wrapper, matching the pre-existing `GeometrySerializer.deserialize(null)` behaviour. Fixes the `KnnJoinSuite` null-geom NPE regression. - Reject inverted Box2D bounds (`xmin > xmax` / `ymin > ymax`) with an `IllegalArgumentException` whose message matches `Predicates.boxIntersects` / `boxContains`. Inverted bounds have no defined planar meaning and would silently mis-prune the R-tree. - Add two edge-case tests to `Box2DJoinSuite`: null Box2D rows survive the broadcast and range paths without crashing and without contributing matches; inverted Box2D bounds surface the same `IllegalArgumentException` as the scalar predicate. Verified locally against Spark 3.5 / Scala 2.12: Box2DJoinSuite 8/8; regression run across KnnJoinSuite, BroadcastIndexJoinSuite, SpatialJoinSuite, Box2DJoinSuite — 262/262.
| val ex = intercept[org.apache.spark.SparkException] { | ||
| invertedLeft | ||
| .alias("L") | ||
| .join(broadcast(rightBoxes.alias("R")), expr("ST_BoxIntersects(L.box, R.box)")) | ||
| .collect() |
Did you read the Contributor Guide?
Is this PR related to a ticket?
[GH-XXX] my subject. Closes Spatial join planner: recognize ST_BoxIntersects / ST_BoxContains as join predicates #2939.What changes were proposed in this PR?
Teaches Sedona's spatial join planner about the Box2D predicates from #2926. After this PR, both broadcast index joins and distributed range joins handle:
ST_BoxIntersects(box_a, box_b)ST_BoxContains(box_a, box_b)(both argument orders)…using the same machinery (partitioner, R-tree, refine evaluator) that already powers
ST_Intersects/ST_Containsjoins. No new physical operator, no new index implementation, no new partitioning code.How it works
At every executor boundary where a shape column is materialised —
TraitJoinQueryBase.toSpatialRDD,TraitJoinQueryBase.toExpandedEnvelopeRDD, andBroadcastIndexJoinExec.createStreamShapes— dispatch on the shape expression'sdataType. If it isBox2DUDT, read the four doubles out of the serializedInternalRowand materialise the implied closed rectangular Polygon viaConstructors.polygonFromEnvelope. Geometry columns continue throughGeometrySerializer.deserializeas before.The materialised Polygon flows through
SpatialRDD<T extends Geometry>, the spatial partitioner's sample step,IndexBuilder's R-tree, andSpatialPredicateEvaluatorsunchanged. JTS already short-circuits axis-aligned rectangle predicates viaRectangleIntersects/RectangleContains(gated onPolygon.isRectangle()), whichpolygonFromEnvelopeproduces exactly. The refine step therefore pays only a four-double envelope comparison per pair — the savings the user expects from "we know the data is a box".Predicate-kind mapping
SpatialPredicateST_BoxIntersects(a, b)INTERSECTSST_BoxContains(a, b)COVERSST_BoxContainsdeliberately maps toCOVERS, notCONTAINS. PostGIS-style closed-interval containment counts edge-touching pairs as contained; JTSGeometry.containsexcludes shared-edge pairs (strict interior), whereasGeometry.coversaccepts them — which is what we want.What's left untouched
Box2Dclass hierarchy (still a value class).Box2DUDT).SpatialRDD, the partitioners, the R-tree, the refine evaluator.Scope notes
ST_DWithin(distance join) only has(Geometry, Geometry, distance)and(Geography, Geography, distance)overloads today, so Box2D × Box2D distance joins are out of scope for this PR. Adding a(Box2D, Box2D, distance)overload is a follow-up.How was this patch tested?
New
Box2DJoinSuiteunderspark/common/src/test/scala/org/apache/sedona/sql/:ST_BoxIntersectsbroadcast index join produces the expected 4 pairs from a 3×3 fixture, withBroadcastIndexJoinExecin the plan.ST_BoxIntersects(R, L)produces the same 4 pairs (argument order symmetric).ST_BoxContainsbroadcast index join produces the expected 2 closed-interval containment pairs.COVERS-not-CONTAINSmapping.RangeJoinExecin the plan.ST_Intersectson the same data materialised as polygons viaST_GeomFromBox2D.Run locally against Spark 3.5 / Scala 2.12. Regression runs:
BroadcastIndexJoinSuite65/65,SpatialJoinSuite160/160,Box2DUDTSuite5/5,Box2DCastResolutionRuleSuite3/3,GeoParquetSpatialFilterPushDownSuite25/25 — all still pass.Did this PR include necessary documentation updates?
ST_BoxIntersects/ST_BoxContainsis covered by the consolidated Phase 1+2+3 Box2D docs update.