Skip to content

Run-end / dictionary encoding for ND broadcast columns, propagated through the physical plan #276

@robinskil

Description

@robinskil

Summary

When ND variables (netCDF, Zarr, TIFF, Atlas) are flattened to tabular form, broadcast coordinate columns are fully expanded into flat Arrow arrays (see beacon-nd-array/src/arrow/batch.rs:505 and beacon-nd-arrow/src/array/mod.rs:270). A coordinate broadcast across inner dimensions of total size K is materialized as O(rows) even though the run structure is known analytically from the broadcast geometry. This wastes memory and CPU.

Goal: emit run-end-encoded (REE) / dictionary arrays for broadcast columns and let that encoding survive through the DataFusion physical plan, reconciling at scan / union / join boundaries, without changing the logical schema that queries see.

Design / pipeline

Encoding is born from broadcast geometry, not statistical detection:

Column kind Encoding Why
Outer / slow-varying broadcast dim REE (RunArray) Long contiguous runs; run-ends are arithmetic, built in O(distinct)
Inner / fast-varying cyclic dim Dictionary Repeats as short cycles in row-major → no long runs, but tiny cardinality
Data column Flat No repetition to exploit

Layering

  • beacon-nd-arrow (pure Arrow, no DataFusion dep) — the producer: a geometry classifier (dims + target shape + flatten order → per-column encoding) plus a broadcast→RunArray/DictionaryArray constructor that replaces the current full expansion.
  • beacon-datafusion-ext (format-agnostic plan layer) — flat↔REE SchemaAdapterFactory, an EnforceColumnEncoding physical optimizer rule, and operator encoding-capability handling.
  • format sources / beacon-data-lake — declare the encoded scan schema + output_ordering, and reconcile mixed-encoding files within a scan via the SchemaAdapter.

Hard constraint

A single ExecutionPlan node has exactly one output schema, so mixed encodings are reconciled only at boundaries:

  • intra-scan (some files REE, some flat) → SchemaAdapter coerces each file to the scan's declared encoding;
  • cross-table (union/join of tables with different encodings) → the EnforceColumnEncoding optimizer rule inserts casts, choosing the target encoding cost-driven (toward the dominant branch / stats cache) and pushing materialization as late as possible.

Insertion is the provider's job

If we graduate to dedicated operators (stretch), TableProvider::scan() returns BroadcastExec(NdScanExec(..)) directly — not an analyzer rule (that would leak ND geometry into the logical layer). The "custom ND array type" flowing between the two nodes is the existing Struct{ values: List<T>, dim_names: List<Dictionary>, dim_sizes: List<UInt32> } representation, optionally tagged with Arrow extension-type metadata.

Note: coordinate-predicate pushdown into the ND domain already exists (is_pushdown_candidate / mask_pushdown in beacon-nd-array/src/arrow/pushdown.rs), so the dedicated-operator stage is justified by ordering/stats propagation and ND-form computation, not pushdown.

Phasing

  • Phase 1 — in-place memory win (low risk): sub-issues 1 → 2 → 3
  • Phase 2 — propagation layer: sub-issues 4 → 5
  • Phase 3 — validation: sub-issue 6
  • Stretch — dedicated operators: sub-issue 7

Sub-issues

  • 1. nd-arrow: broadcast geometry encoding classifier
  • 2. nd-arrow: broadcast→RunArray/DictionaryArray constructor
  • 3. Emit encoded arrays in the nd scan + declare schema/ordering
  • 4. flat↔REE SchemaAdapterFactory in beacon-datafusion-ext
  • 5. EnforceColumnEncoding physical optimizer rule
  • 6. Benchmarks + docs
  • 7. (Stretch) dedicated NdScanExec + BroadcastExec

Metadata

Metadata

Assignees

No one assigned

    Labels

    beacon-kernelenhancementNew feature or requestrustPull requests that update rust code

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions