-
Notifications
You must be signed in to change notification settings - Fork 1
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Summary
Add a new TransactionLogReader component to tantivy4java that reads the IndexTables transaction log state natively in Rust and exports it via Arrow FFI — the same pattern used by DeltaTableReader.readCheckpointPartArrowFfi().
Priority: P0
Motivation
Currently, every read operation starts by reading the transaction log state on the JVM side. For large tables this involves:
- Reading potentially thousands of Avro manifest files, each GZIP-compressed
- Allocating
GenericDatumWriter/GenericDatumReaderper manifest file - GZIP decompression with 8KB buffers allocated per file
- JSON schema parsing for deduplication per entry
- Parallel reads via
Future.sequencewith JVM thread pool overhead
This is the single largest driver-side bottleneck for cold query startup on large tables.
Proposed Approach
- New Rust component:
TransactionLogReaderthat understands the IndexTables Avro manifest format (V4 state format) - Operations to handle natively:
- GZIP decompression (Rust's
flate2is significantly faster than Java'sGZIPOutputStream) - Avro manifest deserialization (using
apache-avrocrate) - Schema deduplication (currently done via JSON parsing per-entry)
- Partition filter evaluation during reading (reuse existing
PartitionFilterinfrastructure)
- GZIP decompression (Rust's
- Output: Arrow columnar batches via the existing Arrow C Data Interface FFI (same pattern as
docBatchArrowFfiandreadCheckpointPartArrowFfi) - JNI interface (strawman):
// Read all state from a transaction log directory, optionally filtered int readStateArrowFfi( String transactionLogPath, long[] arrayAddrs, long[] schemaAddrs, PartitionFilter partitionFilter // optional, for partition pruning during read );
Expected Impact
- 2-5x faster cold query startup on large tables (1000+ splits)
- Eliminates per-manifest JVM object allocation overhead
- Enables native partition pruning during state materialization
Dependencies
- Avro V4 state format specification (see
docs/reference/protocol.mdin indextables_spark) - GZIP compression codec
- Existing
PartitionFilterinfrastructure - Arrow C Data Interface FFI (already implemented)
Related
- indextables/indextables_spark integration issue (to be linked)
- Existing pattern:
DeltaTableReader.readCheckpointPartArrowFfi()
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request