Skip to content

Add GeoParquet read support#285

Merged
robinskil merged 2 commits into
mainfrom
features/geoparquet-source-reader
Jun 18, 2026
Merged

Add GeoParquet read support#285
robinskil merged 2 commits into
mainfrom
features/geoparquet-source-reader

Conversation

@robinskil

Copy link
Copy Markdown
Collaborator

Summary

Refactors the previously write-only beacon-arrow-geoparquet crate into a full read + write GeoParquet format, and wires reading into the data lake and SQL layer.

Read path

  • infer_schema reads each file's GeoArrow schema concurrently and super-types them. Geometry columns described in the file's geo metadata are decoded to their native GeoArrow representation (CoordType::Separated); files without a geo key fall back to the plain Arrow schema, so ordinary Parquet is still readable.
  • GeoParquetSource (FileSource) + GeoParquetOpener (FileOpener) stream files through the async Parquet reader and the geoparquet crate's GeoParquetRecordBatchStream, applying column projection via a BatchAdapterFactory.
  • infer_stats returns Statistics::new_unknown; create_physical_plan builds a DataSourceExec.
  • FileFormatFactoryExt::discover_datasets registers .geoparquet files. The factory is registered in beacon-data-lake, so auto-discovery and CREATE EXTERNAL TABLE ... STORED AS GEOPARQUET both work.

SQL

  • New read_geoparquet() table function mirroring read_parquet.

Tests

5 unit tests in the crate (all passing):

  • native-geometry schema inference
  • full round-trip read (rows + geometry coordinates)
  • column projection
  • dataset discovery filtered by extension
  • plain-Parquet fallback (no geo metadata)

Docs

GeoParquet sections added to Supported Formats, External Tables, and a read_geoparquet entry in the table-function reference (v1.7.2).

Not included

Spatial bbox row-group pruning (skipping via the GeoParquet bbox covering) is not implemented — reads are a full scan with column projection. Documented as a planned enhancement.

Verification

cargo test -p beacon-arrow-geoparquet passes; beacon-functions, beacon-data-lake, and beacon-core build clean.

Refactor the previously write-only beacon-arrow-geoparquet crate into a full
read+write GeoParquet format.

Read path:
- infer_schema reads each file's GeoArrow schema concurrently and super-types
  them; geometry columns described in the file's `geo` metadata are decoded to
  their native GeoArrow representation (CoordType::Separated). Files without a
  `geo` key fall back to the plain Arrow schema.
- New GeoParquetSource (FileSource) + GeoParquetOpener (FileOpener) stream files
  via the async Parquet reader and the geoparquet crate's
  GeoParquetRecordBatchStream, applying column projection through a
  BatchAdapterFactory.
- FileFormatFactoryExt::discover_datasets registers `.geoparquet` files, and the
  factory is registered in beacon-data-lake so external tables
  (STORED AS GEOPARQUET) and auto-discovery work.

SQL:
- New read_geoparquet() table function mirroring read_parquet.

Tests:
- 5 unit tests: native-geometry schema inference, full round-trip read,
  column projection, dataset discovery by extension, and plain-Parquet fallback.

Docs:
- GeoParquet sections added to Supported Formats, External Tables, and the
  read_geoparquet table-function reference.

Note: spatial bbox row-group pruning is not yet applied; reads are a full scan
with column projection.
Copilot AI review requested due to automatic review settings June 18, 2026 11:31

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends Beacon’s GeoParquet support from write-only to read + write, integrating GeoParquet scanning into the DataFusion file-format layer, the data lake format registry, and the SQL/table-function surface area.

Changes:

  • Adds a GeoParquet read path (schema inference + file scanning) via a custom FileSource/FileOpener and GeoArrow schema decoding from GeoParquet geo metadata.
  • Registers GeoParquet as a discoverable data-lake format and adds a read_geoparquet() SQL table function mirroring read_parquet().
  • Documents GeoParquet usage in supported formats, external tables, and table-function reference docs.

Reviewed changes

Copilot reviewed 14 out of 15 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
docs/docs/1.7.2/sql/table-functions.md Documents the new read_geoparquet() table function.
docs/docs/1.7.2/data-lake/external-tables.md Adds STORED AS GEOPARQUET external table example and notes GeoArrow decoding.
docs/docs/1.7.2/data-lake/datasets.md Adds GeoParquet to Supported Formats, including examples and a pruning warning.
Cargo.lock Updates dependency graph to include GeoParquet reader-related dependencies and crate wiring.
beacon-functions/src/file_formats/read_geoparquet.rs Introduces read_geoparquet() table function that builds a listing table over glob paths using GeoParquetFormat.
beacon-functions/src/file_formats/mod.rs Registers the new ReadGeoParquetFunc in the table-function registry.
beacon-functions/Cargo.toml Adds dependency on beacon-arrow-geoparquet.
beacon-file-formats/beacon-arrow-geoparquet/src/lib.rs Updates crate-level docs to reflect read + write support.
beacon-file-formats/beacon-arrow-geoparquet/src/datafusion/mod.rs Implements GeoParquet FileFormat read integration (infer schema/stats, physical plan via DataSourceExec, dataset discovery).
beacon-file-formats/beacon-arrow-geoparquet/src/datafusion/source.rs Adds GeoParquetSource (FileSource) to produce openers and preserve pushed-down projections.
beacon-file-formats/beacon-arrow-geoparquet/src/datafusion/opener.rs Adds GeoParquetOpener (FileOpener) streaming record batches through async Parquet + GeoParquet decoding and batch adaptation.
beacon-file-formats/beacon-arrow-geoparquet/src/datafusion/reader.rs Adds helpers to open async Parquet builders and derive GeoArrow output schemas (GeoParquet metadata-aware).
beacon-file-formats/beacon-arrow-geoparquet/Cargo.toml Adds new dependencies for async reading + schema handling (e.g., geoparquet async feature, geoarrow-schema, beacon-common).
beacon-data-lake/src/file_formats.rs Registers GeoParquetFormatFactory so .geoparquet files can be auto-discovered.
beacon-data-lake/Cargo.toml Adds dependency on beacon-arrow-geoparquet.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +53 to +57
let parquet_stream = builder
.with_batch_size(batch_size)
.build()
.map_err(|e| DataFusionError::External(Box::new(e)))?;

@robinskil robinskil merged commit 5da5c76 into main Jun 18, 2026
2 checks passed
@robinskil robinskil mentioned this pull request Jun 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants