Skip to content

Support CIDR notation and wildcard patterns for IP address queries #111

@schenksj

Description

@schenksj

Summary

Add support for two common IP address range shorthands in queries:

  1. CIDR notation192.168.1.0/24, 10.0.0.0/8, 2001:db8::/32
  2. Wildcard patterns192.168.1.*, 10.0.*.*

Both are syntactic sugar for range queries and should be handled transparently in tantivy4java's query AST processing, so that callers (like IndexTables4Spark) can pass CIDR/wildcard values through existing term query and range query APIs without any special handling.

Motivation

IP address fields already support range queries (e.g., [192.168.1.0 TO 192.168.1.255]), but requiring callers to manually compute range bounds defeats the purpose of having a typed IP field. CIDR and wildcard notations are the standard ways IP ranges are expressed and tantivy4java should understand them natively.

Shorthand Equivalent Range
192.168.1.0/24 [192.168.1.0 TO 192.168.1.255]
10.0.0.0/8 [10.0.0.0 TO 10.255.255.255]
2001:db8::/32 [2001:db8:: TO 2001:db8:ffff:ffff:ffff:ffff:ffff:ffff]
192.168.1.* [192.168.1.0 TO 192.168.1.255]
10.0.*.* [10.0.0.0 TO 10.0.255.255]

Proposed Approach: AST Touchup

The expansion logic should be concentrated in tantivy4java's query AST touchup phase. When tantivy4java receives a term query or equality match against an IP address field, it should:

  1. Inspect the value string for CIDR (/) or wildcard (*) patterns
  2. If detected, compute the inclusive range bounds (network address → broadcast address)
  3. Rewrite the AST node from a term query to an equivalent range query

This means callers like IndexTables4Spark do not need any special handling — they simply pass the CIDR/wildcard string as a normal value to SplitTermQuery or an IN list, and tantivy4java's AST processing transparently expands it.

CIDR Expansion

  1. Parse <ip>/<prefix_len> from the term value
  2. Apply the prefix mask to compute the network address (lower bound)
  3. Fill host bits with 1s to compute the broadcast address (upper bound)
  4. Rewrite: TermQuery(field, "10.0.0.0/8")RangeQuery(field, inclusive("10.0.0.0"), inclusive("10.255.255.255"))

Wildcard Expansion

  1. Detect * in IPv4 octet positions
  2. Replace * with 0 for lower bound, 255 for upper bound
  3. Rewrite: TermQuery(field, "192.168.1.*")RangeQuery(field, inclusive("192.168.1.0"), inclusive("192.168.1.255"))

Where in the Code

The AST touchup for IP fields should sit alongside existing query rewriting logic. The key is that this runs after the caller constructs the query but before Tantivy's native query execution, so it's invisible to callers.

API Contract

From the caller's perspective, these should all just work without any code changes:

// Term query with CIDR — tantivy4java expands internally
Query.termQuery("ip", "192.168.1.0/24")

// Term query with wildcard — tantivy4java expands internally
Query.termQuery("ip", "10.0.*.*")

// SplitTermQuery with CIDR
new SplitTermQuery("ip", "192.168.1.0/24")

// IN-style boolean OR with mixed values — each term expanded independently
BooleanQuery.or(
  Query.termQuery("ip", "10.0.0.0/8"),       // → range query
  Query.termQuery("ip", "172.16.0.0/12"),     // → range query
  Query.termQuery("ip", "192.168.1.1")        // → stays as term query
)

// parseQuery with CIDR
parseQuery("ip:192.168.1.0/24")

// parseQuery with wildcard
parseQuery("ip:10.0.*.*")

Testing

tantivy4java should include unit tests covering the core expansion logic and query execution:

CIDR Tests

  • Byte-aligned IPv4 masks: /8, /16, /24, /32 — verify correct lower/upper bounds
  • Non-byte-aligned IPv4 masks: /17, /23, /27 — verify bit-level mask arithmetic is correct
  • Edge cases: /0 (all IPs), /32 (single IP, equivalent to exact match)
  • IPv6 CIDR: 2001:db8::/32, fe80::/10, ::1/128 — verify 128-bit mask math
  • IPv4-mapped IPv6 CIDR: Ensure CIDR on IPv4 addresses works correctly given the internal ::ffff: representation

Wildcard Tests

  • Single wildcard: 192.168.1.* — last octet
  • Multiple wildcards: 10.0.*.*, 10.*.*.* — multiple octets
  • All wildcards: *.*.*.* — equivalent to match-all (edge case)
  • Mixed positions: Verify wildcards work in any octet position

Query Execution Tests

  • Term query with CIDR: Index several IPs, query with CIDR, verify correct docs returned
  • Term query with wildcard: Same pattern
  • Boolean OR with mixed CIDR/literal: Verify both expanded ranges and exact terms match correctly
  • parseQuery with CIDR: Verify ip:192.168.1.0/24 works in query string syntax
  • Boundary IPs: Verify that IPs exactly at the subnet boundaries are included (e.g., 192.168.1.0 and 192.168.1.255 for /24)
  • IPs just outside the range: Verify that 192.168.0.255 and 192.168.2.0 are excluded for 192.168.1.0/24
  • Result count verification: For a known dataset, verify CIDR query returns same count as equivalent explicit range query

Scope

  • AST touchup: detect CIDR/wildcard in IP field term values, rewrite to range queries
  • CIDR: both IPv4 (/0 to /32) and IPv6 (/0 to /128)
  • Wildcards: IPv4 only (not standard for IPv6)
  • Works transparently with SplitTermQuery, Query.termQuery(), parseQuery(), and boolean combinations

Companion Issue

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions