-
Notifications
You must be signed in to change notification settings - Fork 1
Description
Feature Request: Quickwit Fork Change Analysis and Cleanup
Overview
Analyze all modifications made to the quickwit fork (../quickwit) and determine which changes are actually utilized by the final tantivy4java product. Remove unnecessary changes that were made for features that were ultimately redesigned or not implemented.
Problem Statement
During the development of tantivy4java, various modifications were made to our quickwit fork to support different integration approaches. However, some of these changes:
- Were made for features that were later redesigned
- Are no longer used in the current implementation
- May complicate future maintenance and upstream synchronization
- Could cause confusion about actual dependencies
Objectives
1. Comprehensive Change Analysis
- Identify all modifications in the quickwit fork relative to upstream
- Map each change to specific tantivy4java features or use cases
- Document the purpose of each modification
- Verify active usage through code path analysis
2. Usage Verification
For each modification, determine:
- Is it actively used? - Called by current tantivy4java code
- Is it critical? - Required for core functionality
- Is it redundant? - Superseded by alternative implementations
- Is it experimental? - Made for testing but not production code
3. Cleanup Strategy
- Remove unused modifications that serve no current purpose
- Document remaining changes with clear justification
- Simplify maintenance burden by minimizing divergence from upstream
- Prepare for potential upstreaming of valuable changes
Analysis Methodology
Step 1: Identify All Quickwit Fork Changes
# Compare fork against upstream quickwit
cd ../quickwit
git remote add upstream https://github.com/quickwit-oss/quickwit.git
git fetch upstream
git diff upstream/main..HEAD > /tmp/quickwit-fork-changes.patch
# Analyze commit history
git log upstream/main..HEAD --oneline --no-merges > /tmp/quickwit-fork-commits.txtStep 2: Map Changes to Tantivy4Java Usage
For each modification, check:
- Native Rust code (
tantivy4java/native/src/*.rs) - JNI bindings that call quickwit - Dependency declarations (
tantivy4java/native/Cargo.toml) - Which quickwit crates are used - Merge functionality (
perform_quickwit_merge_standalone,QuickwitSplit.mergeSplits()) - Split conversion (
convertIndexFromPath, split file operations) - Process-based merge binary (
tantivy4java-mergestandalone executable)
Step 3: Categorize Changes
Category A: Active Production Use
Changes that are:
- Called by production tantivy4java code paths
- Required for core functionality (split merge, split search)
- Part of public API features
Category B: Development/Testing Only
Changes that are:
- Used only in test code or examples
- Made for experimental features not in production
- Debugging aids not required for operation
Category C: Obsolete/Superseded
Changes that are:
- Made for features that were redesigned
- No longer reachable from current code
- Replaced by alternative implementations
Category D: Uncertain/Needs Investigation
Changes that:
- Have unclear purpose or documentation
- May have indirect usage through dependencies
- Require deeper analysis to verify usage
Key Areas to Investigate
1. Merge Functionality
Files to check: quickwit/quickwit-indexing/src/merge_policy.rs, merge executor code
- Which merge-related changes are used by
perform_quickwit_merge_standalone()? - Are modifications to
MergeExecutor,MergePolicy, or merge configuration actually used? - Do we use quickwit's merge logic directly or through our own wrappers?
2. Split File Format
Files to check: Split serialization/deserialization, bundle directory code
- Which split format changes are required for our split conversion?
- Are modifications to split metadata, compression, or file structure necessary?
- Do we rely on any custom split file format extensions?
3. Storage Backend (S3)
Files to check: S3 storage implementation, credential handling
- Which S3-related changes support our
s3://URL handling? - Are modifications to AWS credential passing actually used?
- Do we need custom endpoint or path-style access changes?
4. Search/Query API
Files to check: Query parser, search executor, aggregation code
- Which query-related changes support SplitSearcher functionality?
- Are modifications to aggregation types (DateHistogram, Histogram, Range) used?
- Do we rely on any custom query parsing or execution logic?
5. Schema/Index Management
Files to check: Schema building, field types, indexing pipeline
- Which schema-related changes are required for our split operations?
- Are modifications to field capabilities or metadata access used?
- Do we need custom schema introspection APIs?
Expected Deliverables
1. Analysis Report
File: QUICKWIT_FORK_ANALYSIS_REPORT.md
Should include:
- Complete list of all quickwit fork modifications
- Categorization (Active/Testing/Obsolete/Uncertain)
- Usage evidence for each active change
- Recommendation for each obsolete change
2. Cleanup Pull Request
- Remove all Category C (Obsolete/Superseded) changes
- Remove Category B changes not needed for testing
- Document Category A changes with usage comments
- Resolve Category D items through investigation
3. Dependency Documentation
File: QUICKWIT_DEPENDENCIES.md
Should document:
- Which quickwit crates tantivy4java depends on
- Which specific APIs/functions from quickwit are used
- Why each dependency is necessary
- Any custom modifications and their justification
4. Upstreaming Opportunities
File: QUICKWIT_UPSTREAM_CANDIDATES.md
Identify changes that:
- Provide general value (not tantivy4java-specific)
- Fix bugs or add features useful to quickwit community
- Could be contributed back to upstream quickwit
- Would reduce maintenance burden if upstreamed
Success Criteria
- ✅ Complete change inventory - All quickwit fork modifications cataloged
- ✅ Clear categorization - Each change marked Active/Testing/Obsolete/Uncertain
- ✅ Usage verification - Active changes mapped to tantivy4java code paths
- ✅ Cleanup execution - Obsolete changes removed from fork
- ✅ Documentation - Remaining changes clearly documented with justification
- ✅ Reduced divergence - Fork complexity minimized to essential changes only
- ✅ Upstream plan - Valuable changes identified for potential contribution
Benefits
Immediate Benefits
- Clearer dependency picture - Understand exactly what we need from quickwit
- Easier maintenance - Fewer custom changes to track and update
- Faster debugging - Less confusion about which code paths are active
- Better documentation - Clear record of why each modification exists
Long-term Benefits
- Simpler upstream sync - Easier to incorporate quickwit updates
- Upstreaming potential - Opportunity to contribute valuable changes back
- Reduced technical debt - Eliminate obsolete experimental code
- Team knowledge - Better understanding of quickwit integration points
Priority
Medium-High - Important for long-term maintainability but not blocking current functionality
Estimated Effort
- Analysis phase: 4-8 hours
- Cleanup implementation: 2-4 hours
- Testing and validation: 2-4 hours
- Documentation: 2-3 hours
- Total: ~10-19 hours
Notes
- Should be done after current development stabilizes
- Requires access to both tantivy4java and quickwit fork repositories
- May reveal opportunities for simplification in tantivy4java as well
- Could identify features we thought we needed but never actually used