Skip to content

Fix Arrow FFI hash-to-string resolution for fingerprint terms aggregations#126

Merged
schenksj merged 2 commits intomainfrom
fix/arrow-ffi-hash-resolution
Mar 7, 2026
Merged

Fix Arrow FFI hash-to-string resolution for fingerprint terms aggregations#126
schenksj merged 2 commits intomainfrom
fix/arrow-ffi-hash-resolution

Conversation

@schenksj
Copy link
Collaborator

@schenksj schenksj commented Mar 7, 2026

Summary

  • Fix regression where Arrow FFI terms aggregations on string fingerprint fields return raw U64 hash values instead of original strings
  • Thread hash_resolution_map from Phase 3 touchup through all Arrow FFI conversion paths
  • Applies to single-split, multi-split, and multi-split-multi-agg Arrow FFI methods
  • Bump version to 0.32.4

Root Cause

perform_search_async_impl_leaf_response_with_aggregations correctly rewrites terms aggs to use _phash_<field> and builds a hash→string resolution map (Phase 3). The JNI object path used this map, but the Arrow FFI path discarded it — key_to_string(Key::U64(v)) just did v.to_string(), producing raw hashes like "18264002321125465436" instead of "alice".

Test plan

  • Verify terms aggregation on string fingerprint fields returns original strings via Arrow FFI
  • Verify multi-split terms aggregation merges hash resolution maps correctly
  • Verify nested bucket flattening (Terms→Terms) resolves hash keys
  • Verify non-fingerprint splits are unaffected (resolution map is None)

🤖 Generated with Claude Code

…t terms aggregations

The Arrow FFI aggregation path discarded the hash_resolution_map built by
Phase 3 touchup, causing terms aggregations on string fingerprint fields
to return raw U64 hash values (e.g. "18264002321125465436") instead of
the original string values ("alice", "bob").

Thread hash_resolution_map through aggregation_result_to_record_batch and
all nested bucket flattening functions. Apply resolution in all three
Arrow FFI JNI methods: single-split, multi-split, and multi-split-multi-agg.

Bump version to 0.32.4.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@claude
Copy link

claude bot commented Mar 7, 2026

Claude finished @schenksj's task —— View job


Code Review Complete ✅

Summary: This PR correctly fixes the Arrow FFI hash resolution regression. The implementation properly threads the hash_resolution_map through all Arrow FFI conversion paths and correctly resolves U64 hash values back to original strings.

Review Results:

  • Correctness: ✅ No logic errors found
  • Security: ✅ No security issues identified
  • Build Issues: ✅ No merge conflicts or build-breaking changes
  • Error Propagation: ✅ Proper error handling maintained
  • API Consistency: ✅ Consistent with existing patterns

Key Changes Reviewed:

1. Enhanced Arrow FFI functions (aggregation_arrow_ffi.rs):

  • aggregation_result_to_record_batch_with_hash_resolution() correctly accepts hash_resolution_map parameter
  • key_to_string_resolved() properly resolves U64 hash keys using the resolution map
  • ✅ Terms aggregation now uses key_to_string_resolved() instead of key_to_string()
  • ✅ Nested bucket flattening correctly passes hash resolution through recursion

2. JNI Arrow FFI entry points (jni_agg_arrow.rs):

  • ✅ All three methods properly extract hash_resolution_map from SearchWithHashContext
  • ✅ Single-split aggregation: Line 127 correctly passes ctx.hash_resolution_map.as_ref()
  • ✅ Multi-split aggregation: Lines 245-247 correctly merge hash maps from multiple splits
  • ✅ Multi-split multi-agg: Lines 543-545 correctly collect and merge hash maps

3. Hash resolution flow verification:

  • perform_search_async_impl_leaf_response_with_aggregations() builds hash resolution map in Phase 3
  • ✅ Hash map is correctly bundled in SearchWithHashContext
  • ✅ JNI methods extract the hash map and pass it to Arrow FFI conversion
  • ✅ Arrow FFI methods use the map to resolve hash values back to strings

No Issues Found:

  • Logic: Hash resolution logic is correct throughout the call chain
  • Memory Safety: Proper use of Option<&HashMap<u64, String>> references
  • Error Handling: Existing error propagation patterns maintained
  • Thread Safety: No thread safety concerns in the changes
  • Performance: Efficient hash map merging for multi-split scenarios

The fix correctly addresses the root cause where Arrow FFI paths were discarding the hash resolution map that JNI paths were correctly using.

Two new tests in AggregationArrowFfiTest:
- testArrowFfiTermsAggHashResolution: terms agg on "category" field with
  companion hash opt, verifies keys are "cat_X" strings not U64 hashes
- testArrowFfiTermsSubAggHashResolution: terms agg on "name" field with
  avg(score) sub-aggregation, verifies keys are "item_X" strings

Both create a companion split with withStringHashOptimization(true) and
exercise the Arrow FFI path (aggregateArrowFfi + nativeReadAggArrowColumnsAsJson)
to catch any future regression in hash resolution.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@claude
Copy link

claude bot commented Mar 7, 2026

Claude finished @schenksj's task —— View job


PR Code Review ✅

Reviewing the PR for:

  • Correctness bugs (logic errors, off-by-one, null safety)
  • Security issues (injection, unsafe operations, credential handling)
  • Merge conflict markers or build-breaking issues
  • Missing error propagation (especially JNI error swallowing)
  • API consistency with existing patterns in the codebase

Review Results: ✅ No Issues Found

Summary: This PR correctly fixes the Arrow FFI hash resolution regression. The implementation properly threads the hash_resolution_map through all Arrow FFI conversion paths and correctly resolves U64 hash values back to original strings.

Key Changes Reviewed:

1. Enhanced Arrow FFI functions (aggregation_arrow_ffi.rs):

  • aggregation_result_to_record_batch_with_hash_resolution() correctly accepts hash_resolution_map parameter
  • key_to_string_resolved() properly resolves U64 hash keys using the resolution map
  • ✅ Terms aggregation now uses key_to_string_resolved() instead of key_to_string()
  • ✅ Nested bucket flattening correctly passes hash resolution through recursion

2. JNI Arrow FFI entry points (jni_agg_arrow.rs):

  • ✅ All three methods properly extract hash_resolution_map from SearchWithHashContext
  • ✅ Single-split aggregation: Line 127 correctly passes ctx.hash_resolution_map.as_ref()
  • ✅ Multi-split aggregation: Lines 245-247 correctly merge hash maps from multiple splits
  • ✅ Multi-split multi-agg: Lines 543-545 correctly collect and merge hash maps

3. Hash resolution flow verification:

  • perform_search_async_impl_leaf_response_with_aggregations() builds hash resolution map in Phase 3
  • ✅ Hash map is correctly bundled in SearchWithHashContext
  • ✅ JNI methods extract the hash map and pass it to Arrow FFI conversion
  • ✅ Arrow FFI methods use the map to resolve hash values back to strings

4. Comprehensive test coverage (AggregationArrowFfiTest.java):

  • ✅ New regression tests specifically validate string key resolution vs. raw U64 hashes
  • ✅ Tests cover both simple terms aggregations and terms+sub-aggregations
  • ✅ Tests use companion splits with string hash optimization enabled to reproduce the exact regression scenario

Specific Findings:

✅ Correctness:

  • No logic errors found
  • Hash resolution logic is correct throughout the call chain
  • Null safety properly handled with Option<&HashMap<u64, String>>

✅ Security:

  • No security issues identified
  • Proper memory handling with Arrow FFI operations
  • Safe use of HashMap references without ownership transfers

✅ Build Issues:

  • No merge conflicts or build-breaking changes
  • Version bump to 0.32.4 properly applied

✅ Error Propagation:

  • Existing error handling patterns maintained throughout
  • No JNI error swallowing introduced
  • Proper anyhow::Error propagation preserved

✅ API Consistency:

  • Changes follow existing patterns in the codebase
  • Consistent parameter naming and function signatures
  • Proper use of Option<&HashMap<u64, String>> references

The fix correctly addresses the root cause where Arrow FFI paths were discarding the hash resolution map that JNI paths were correctly using. Hash-to-string resolution now works consistently across all aggregation export paths.


@schenksj schenksj merged commit 6ec1df1 into main Mar 7, 2026
1 check passed
@schenksj schenksj deleted the fix/arrow-ffi-hash-resolution branch March 7, 2026 15:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant