Summary
When ND variables (netCDF, Zarr, TIFF, Atlas) are flattened to tabular form, broadcast coordinate columns are fully expanded into flat Arrow arrays (see beacon-nd-array/src/arrow/batch.rs:505 and beacon-nd-arrow/src/array/mod.rs:270). A coordinate broadcast across inner dimensions of total size K is materialized as O(rows) even though the run structure is known analytically from the broadcast geometry. This wastes memory and CPU.
Goal: emit run-end-encoded (REE) / dictionary arrays for broadcast columns and let that encoding survive through the DataFusion physical plan, reconciling at scan / union / join boundaries, without changing the logical schema that queries see.
Design / pipeline
Encoding is born from broadcast geometry, not statistical detection:
| Column kind |
Encoding |
Why |
| Outer / slow-varying broadcast dim |
REE (RunArray) |
Long contiguous runs; run-ends are arithmetic, built in O(distinct) |
| Inner / fast-varying cyclic dim |
Dictionary |
Repeats as short cycles in row-major → no long runs, but tiny cardinality |
| Data column |
Flat |
No repetition to exploit |
Layering
beacon-nd-arrow (pure Arrow, no DataFusion dep) — the producer: a geometry classifier (dims + target shape + flatten order → per-column encoding) plus a broadcast→RunArray/DictionaryArray constructor that replaces the current full expansion.
beacon-datafusion-ext (format-agnostic plan layer) — flat↔REE SchemaAdapterFactory, an EnforceColumnEncoding physical optimizer rule, and operator encoding-capability handling.
- format sources /
beacon-data-lake — declare the encoded scan schema + output_ordering, and reconcile mixed-encoding files within a scan via the SchemaAdapter.
Hard constraint
A single ExecutionPlan node has exactly one output schema, so mixed encodings are reconciled only at boundaries:
- intra-scan (some files REE, some flat) →
SchemaAdapter coerces each file to the scan's declared encoding;
- cross-table (union/join of tables with different encodings) → the
EnforceColumnEncoding optimizer rule inserts casts, choosing the target encoding cost-driven (toward the dominant branch / stats cache) and pushing materialization as late as possible.
Insertion is the provider's job
If we graduate to dedicated operators (stretch), TableProvider::scan() returns BroadcastExec(NdScanExec(..)) directly — not an analyzer rule (that would leak ND geometry into the logical layer). The "custom ND array type" flowing between the two nodes is the existing Struct{ values: List<T>, dim_names: List<Dictionary>, dim_sizes: List<UInt32> } representation, optionally tagged with Arrow extension-type metadata.
Note: coordinate-predicate pushdown into the ND domain already exists (is_pushdown_candidate / mask_pushdown in beacon-nd-array/src/arrow/pushdown.rs), so the dedicated-operator stage is justified by ordering/stats propagation and ND-form computation, not pushdown.
Phasing
- Phase 1 — in-place memory win (low risk): sub-issues 1 → 2 → 3
- Phase 2 — propagation layer: sub-issues 4 → 5
- Phase 3 — validation: sub-issue 6
- Stretch — dedicated operators: sub-issue 7
Sub-issues
Summary
When ND variables (netCDF, Zarr, TIFF, Atlas) are flattened to tabular form, broadcast coordinate columns are fully expanded into flat Arrow arrays (see
beacon-nd-array/src/arrow/batch.rs:505andbeacon-nd-arrow/src/array/mod.rs:270). A coordinate broadcast across inner dimensions of total sizeKis materialized asO(rows)even though the run structure is known analytically from the broadcast geometry. This wastes memory and CPU.Goal: emit run-end-encoded (REE) / dictionary arrays for broadcast columns and let that encoding survive through the DataFusion physical plan, reconciling at scan / union / join boundaries, without changing the logical schema that queries see.
Design / pipeline
Encoding is born from broadcast geometry, not statistical detection:
RunArray)O(distinct)Layering
beacon-nd-arrow(pure Arrow, no DataFusion dep) — the producer: a geometry classifier (dims + target shape + flatten order → per-column encoding) plus a broadcast→RunArray/DictionaryArrayconstructor that replaces the current full expansion.beacon-datafusion-ext(format-agnostic plan layer) — flat↔REESchemaAdapterFactory, anEnforceColumnEncodingphysical optimizer rule, and operator encoding-capability handling.beacon-data-lake— declare the encoded scan schema +output_ordering, and reconcile mixed-encoding files within a scan via the SchemaAdapter.Hard constraint
A single
ExecutionPlannode has exactly one output schema, so mixed encodings are reconciled only at boundaries:SchemaAdaptercoerces each file to the scan's declared encoding;EnforceColumnEncodingoptimizer rule inserts casts, choosing the target encoding cost-driven (toward the dominant branch / stats cache) and pushing materialization as late as possible.Insertion is the provider's job
If we graduate to dedicated operators (stretch),
TableProvider::scan()returnsBroadcastExec(NdScanExec(..))directly — not an analyzer rule (that would leak ND geometry into the logical layer). The "custom ND array type" flowing between the two nodes is the existingStruct{ values: List<T>, dim_names: List<Dictionary>, dim_sizes: List<UInt32> }representation, optionally tagged with Arrow extension-type metadata.Phasing
Sub-issues
RunArray/DictionaryArrayconstructorSchemaAdapterFactoryin beacon-datafusion-extEnforceColumnEncodingphysical optimizer ruleNdScanExec+BroadcastExec