-
Notifications
You must be signed in to change notification settings - Fork 1
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Summary
Extend the existing PartitionFilter infrastructure to support split-level data skipping — evaluating column statistics (min/max/count) against query filters entirely in Rust, returning only the split IDs that cannot be pruned.
Priority: P1
Motivation
Currently, data skipping is done on the JVM driver in Scala (PartitionPruning.scala, ExpressionSimplifier.scala):
- Recursive filter tree traversal with Spark's
Expression.eval() - Per-partition string → typed value conversion (creates
LocalDate/Instantobjects per value) zipWithIndex+groupBy+flatMapchains on large collections- Serial evaluation on the driver for all splits × all partitions
For highly-partitioned tables (100+ partitions × 1000+ splits), this is a measurable driver-side bottleneck.
Proposed Approach
- Extend
PartitionFilterto support column statistics predicates (min ≤ value, max ≥ value) - New method (strawman):
// Given split metadata (partition values + column stats) and filters, // return indices of splits that survive pruning int[] evaluateDataSkipping( SplitMetadata[] splits, // partition values + column stats per split PartitionFilter partitionFilter, // partition pruning ColumnStatsFilter statsFilter // column-level data skipping );
- Or, if transaction log state is already in Rust (depends on P0 transaction log reader), the pruning can happen as part of state materialization — filter during read rather than read-then-filter.
Expected Impact
- 5-10x faster scan planning on highly-partitioned tables
- Moves filter evaluation from interpreted Spark expressions to compiled Rust code
- Enables combined partition + column stats pruning in a single native pass
Dependencies
- P0: Native transaction log state reader (if pruning during state read)
- Existing
PartitionFilterinfrastructure (equality, range, IN, IS NULL, AND/OR)
Related
- indextables/indextables_spark integration issue (to be linked)
PartitionPruning.scala,ExpressionSimplifier.scala(current JVM implementation)SparkPredicateToPartitionFilter.scala(existing Spark → PartitionFilter conversion)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request