Skip to content

Native split-level data skipping with PartitionFilter and column statistics #102

@schenksj

Description

@schenksj

Summary

Extend the existing PartitionFilter infrastructure to support split-level data skipping — evaluating column statistics (min/max/count) against query filters entirely in Rust, returning only the split IDs that cannot be pruned.

Priority: P1

Motivation

Currently, data skipping is done on the JVM driver in Scala (PartitionPruning.scala, ExpressionSimplifier.scala):

  • Recursive filter tree traversal with Spark's Expression.eval()
  • Per-partition string → typed value conversion (creates LocalDate/Instant objects per value)
  • zipWithIndex + groupBy + flatMap chains on large collections
  • Serial evaluation on the driver for all splits × all partitions

For highly-partitioned tables (100+ partitions × 1000+ splits), this is a measurable driver-side bottleneck.

Proposed Approach

  1. Extend PartitionFilter to support column statistics predicates (min ≤ value, max ≥ value)
  2. New method (strawman):
    // Given split metadata (partition values + column stats) and filters,
    // return indices of splits that survive pruning
    int[] evaluateDataSkipping(
        SplitMetadata[] splits,          // partition values + column stats per split
        PartitionFilter partitionFilter, // partition pruning
        ColumnStatsFilter statsFilter    // column-level data skipping
    );
  3. Or, if transaction log state is already in Rust (depends on P0 transaction log reader), the pruning can happen as part of state materialization — filter during read rather than read-then-filter.

Expected Impact

  • 5-10x faster scan planning on highly-partitioned tables
  • Moves filter evaluation from interpreted Spark expressions to compiled Rust code
  • Enables combined partition + column stats pruning in a single native pass

Dependencies

  • P0: Native transaction log state reader (if pruning during state read)
  • Existing PartitionFilter infrastructure (equality, range, IN, IS NULL, AND/OR)

Related

  • indextables/indextables_spark integration issue (to be linked)
  • PartitionPruning.scala, ExpressionSimplifier.scala (current JVM implementation)
  • SparkPredicateToPartitionFilter.scala (existing Spark → PartitionFilter conversion)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions