Skip to content

indextables/indextables_spark

Repository files navigation

IndexTables for Spark

IndexTables is an open-table format for Apache Spark that enables fast retrieval and full-text search across large-scale data. It integrates seamlessly with Spark SQL, combining powerful search capabilities with joins, aggregations, and standard SQL operations. Originally built for log observability and cybersecurity investigations, IndexTables works well for any use case requiring fast data retrieval.

IndexTables runs entirely within your existing Spark cluster with no additional infrastructure. It stores data in object storage (AWS S3 and Azure Blob Storage) and has been verified on OSS Spark 3.5.2 and Databricks 15.4 LTS.

Documentation: https://www.indextables.io

Development Status: IndexTables is under active development. APIs and features may change. We recommend testing in non-production environments before deploying to production workloads.

Key Features

  • Embedded Search - Runs directly within Spark executors, no additional infrastructure
  • Multi-Cloud Storage - AWS S3 and Azure Blob Storage fully supported
  • Full-Text Search - Native indexquery operator with complete Tantivy search syntax
  • Predicate Pushdown - WHERE clause filters convert to native search operations
  • Aggregate Pushdown - COUNT, SUM, AVG, MIN, MAX execute directly in the search engine
  • Bucket Aggregations - DateHistogram, Histogram, and Range for time-series analysis
  • JSON Field Support - Native Struct, Array, and Map fields with filter pushdown
  • Smart File Skipping - Delta/Iceberg-style transaction log with min/max statistics
  • Batch Optimization - 90-95% reduction in S3 GET requests (enabled by default)
  • L2 Disk Cache - Persistent NVMe caching with LZ4/ZSTD compression (auto-enabled on Databricks/EMR)
  • Cache Prewarming - SQL command to eliminate cold-start latency

Quick Start

Installation

  1. Add the IndexTables JAR to your Spark classpath
  2. Set spark.sql.extensions=io.indextables.extensions.IndexTablesSparkExtensions
  3. Requires Java 11+

See the Installation Guide for detailed instructions including Databricks setup.

Basic Example

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("IndexTables Example").getOrCreate()

# Write data (declare "message" as full-text searchable)
df.write \
    .format("io.indextables.provider.IndexTablesProvider") \
    .mode("append") \
    .option("spark.indextables.indexing.typemap.message", "text") \
    .save("s3://bucket/path/table")

# Merge index segments for optimal performance
spark.sql("MERGE SPLITS 's3://bucket/path/table' TARGET SIZE 4G")

# Read and query
df = spark.read \
    .format("io.indextables.provider.IndexTablesProvider") \
    .load("s3://bucket/path/table")

df.createOrReplaceTempView("my_table")

# Full-text search with SQL
spark.sql("""
    SELECT * FROM my_table
    WHERE category = 'technology'
      AND message indexquery 'critical AND infrastructure'
    LIMIT 100
""").show()

# Cross-field search
spark.sql("SELECT * FROM my_table WHERE _indexall indexquery 'error'").show()

Documentation

Topic Description
Getting Started Installation, quickstart, first index
Core Concepts Split architecture, transaction log, field types
Configuration Writer, reader, cache, and cloud settings
Query Guide Filter pushdown, aggregations, full-text search
SQL Commands MERGE SPLITS, PURGE, PREWARM CACHE, and more
Cloud Deployment Databricks, AWS EMR deployment guides
Configuration Reference Complete configuration options
Protocol Specification Transaction log format, file structure, protocol versions
Table Protocol ACID guarantees, schema evolution, time travel

Development

export JAVA_HOME=/opt/homebrew/opt/openjdk@11  # Java 11 required
mvn clean compile                              # Build
mvn test                                       # Run tests

# Run single test
mvn test-compile scalatest:test -DwildcardSuites='io.indextables.spark.core.YourTest'

Under the hood, IndexTables uses Tantivy and Quickwit splits instead of Parquet.

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Support

About

IndexTables is an experimental open-table format for Apache Spark that enables fast retrieval and full-text search across large-scale data. It integrates seamlessly with Spark SQL, allowing you to combine powerful search capabilities with joins, aggregations, and standard SQL operations.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages