-
Notifications
You must be signed in to change notification settings - Fork 1
Description
Summary
Add support for two common IP address range shorthands in queries:
- CIDR notation —
192.168.1.0/24,10.0.0.0/8,2001:db8::/32 - Wildcard patterns —
192.168.1.*,10.0.*.*
Both are syntactic sugar for range queries and should be handled transparently in tantivy4java's query AST processing, so that callers (like IndexTables4Spark) can pass CIDR/wildcard values through existing term query and range query APIs without any special handling.
Motivation
IP address fields already support range queries (e.g., [192.168.1.0 TO 192.168.1.255]), but requiring callers to manually compute range bounds defeats the purpose of having a typed IP field. CIDR and wildcard notations are the standard ways IP ranges are expressed and tantivy4java should understand them natively.
| Shorthand | Equivalent Range |
|---|---|
192.168.1.0/24 |
[192.168.1.0 TO 192.168.1.255] |
10.0.0.0/8 |
[10.0.0.0 TO 10.255.255.255] |
2001:db8::/32 |
[2001:db8:: TO 2001:db8:ffff:ffff:ffff:ffff:ffff:ffff] |
192.168.1.* |
[192.168.1.0 TO 192.168.1.255] |
10.0.*.* |
[10.0.0.0 TO 10.0.255.255] |
Proposed Approach: AST Touchup
The expansion logic should be concentrated in tantivy4java's query AST touchup phase. When tantivy4java receives a term query or equality match against an IP address field, it should:
- Inspect the value string for CIDR (
/) or wildcard (*) patterns - If detected, compute the inclusive range bounds (network address → broadcast address)
- Rewrite the AST node from a term query to an equivalent range query
This means callers like IndexTables4Spark do not need any special handling — they simply pass the CIDR/wildcard string as a normal value to SplitTermQuery or an IN list, and tantivy4java's AST processing transparently expands it.
CIDR Expansion
- Parse
<ip>/<prefix_len>from the term value - Apply the prefix mask to compute the network address (lower bound)
- Fill host bits with 1s to compute the broadcast address (upper bound)
- Rewrite:
TermQuery(field, "10.0.0.0/8")→RangeQuery(field, inclusive("10.0.0.0"), inclusive("10.255.255.255"))
Wildcard Expansion
- Detect
*in IPv4 octet positions - Replace
*with0for lower bound,255for upper bound - Rewrite:
TermQuery(field, "192.168.1.*")→RangeQuery(field, inclusive("192.168.1.0"), inclusive("192.168.1.255"))
Where in the Code
The AST touchup for IP fields should sit alongside existing query rewriting logic. The key is that this runs after the caller constructs the query but before Tantivy's native query execution, so it's invisible to callers.
API Contract
From the caller's perspective, these should all just work without any code changes:
// Term query with CIDR — tantivy4java expands internally
Query.termQuery("ip", "192.168.1.0/24")
// Term query with wildcard — tantivy4java expands internally
Query.termQuery("ip", "10.0.*.*")
// SplitTermQuery with CIDR
new SplitTermQuery("ip", "192.168.1.0/24")
// IN-style boolean OR with mixed values — each term expanded independently
BooleanQuery.or(
Query.termQuery("ip", "10.0.0.0/8"), // → range query
Query.termQuery("ip", "172.16.0.0/12"), // → range query
Query.termQuery("ip", "192.168.1.1") // → stays as term query
)
// parseQuery with CIDR
parseQuery("ip:192.168.1.0/24")
// parseQuery with wildcard
parseQuery("ip:10.0.*.*")Testing
tantivy4java should include unit tests covering the core expansion logic and query execution:
CIDR Tests
- Byte-aligned IPv4 masks:
/8,/16,/24,/32— verify correct lower/upper bounds - Non-byte-aligned IPv4 masks:
/17,/23,/27— verify bit-level mask arithmetic is correct - Edge cases:
/0(all IPs),/32(single IP, equivalent to exact match) - IPv6 CIDR:
2001:db8::/32,fe80::/10,::1/128— verify 128-bit mask math - IPv4-mapped IPv6 CIDR: Ensure CIDR on IPv4 addresses works correctly given the internal
::ffff:representation
Wildcard Tests
- Single wildcard:
192.168.1.*— last octet - Multiple wildcards:
10.0.*.*,10.*.*.*— multiple octets - All wildcards:
*.*.*.*— equivalent to match-all (edge case) - Mixed positions: Verify wildcards work in any octet position
Query Execution Tests
- Term query with CIDR: Index several IPs, query with CIDR, verify correct docs returned
- Term query with wildcard: Same pattern
- Boolean OR with mixed CIDR/literal: Verify both expanded ranges and exact terms match correctly
- parseQuery with CIDR: Verify
ip:192.168.1.0/24works in query string syntax - Boundary IPs: Verify that IPs exactly at the subnet boundaries are included (e.g.,
192.168.1.0and192.168.1.255for/24) - IPs just outside the range: Verify that
192.168.0.255and192.168.2.0are excluded for192.168.1.0/24 - Result count verification: For a known dataset, verify CIDR query returns same count as equivalent explicit range query
Scope
- AST touchup: detect CIDR/wildcard in IP field term values, rewrite to range queries
- CIDR: both IPv4 (
/0to/32) and IPv6 (/0to/128) - Wildcards: IPv4 only (not standard for IPv6)
- Works transparently with
SplitTermQuery,Query.termQuery(),parseQuery(), and boolean combinations