Skip to content

Analysis Request: Quickwit Fork Change Analysis and Cleanup #21

@schenksj

Description

@schenksj

Feature Request: Quickwit Fork Change Analysis and Cleanup

Overview

Analyze all modifications made to the quickwit fork (../quickwit) and determine which changes are actually utilized by the final tantivy4java product. Remove unnecessary changes that were made for features that were ultimately redesigned or not implemented.

Problem Statement

During the development of tantivy4java, various modifications were made to our quickwit fork to support different integration approaches. However, some of these changes:

  • Were made for features that were later redesigned
  • Are no longer used in the current implementation
  • May complicate future maintenance and upstream synchronization
  • Could cause confusion about actual dependencies

Objectives

1. Comprehensive Change Analysis

  • Identify all modifications in the quickwit fork relative to upstream
  • Map each change to specific tantivy4java features or use cases
  • Document the purpose of each modification
  • Verify active usage through code path analysis

2. Usage Verification

For each modification, determine:

  • Is it actively used? - Called by current tantivy4java code
  • Is it critical? - Required for core functionality
  • Is it redundant? - Superseded by alternative implementations
  • Is it experimental? - Made for testing but not production code

3. Cleanup Strategy

  • Remove unused modifications that serve no current purpose
  • Document remaining changes with clear justification
  • Simplify maintenance burden by minimizing divergence from upstream
  • Prepare for potential upstreaming of valuable changes

Analysis Methodology

Step 1: Identify All Quickwit Fork Changes

# Compare fork against upstream quickwit
cd ../quickwit
git remote add upstream https://github.com/quickwit-oss/quickwit.git
git fetch upstream
git diff upstream/main..HEAD > /tmp/quickwit-fork-changes.patch

# Analyze commit history
git log upstream/main..HEAD --oneline --no-merges > /tmp/quickwit-fork-commits.txt

Step 2: Map Changes to Tantivy4Java Usage

For each modification, check:

  • Native Rust code (tantivy4java/native/src/*.rs) - JNI bindings that call quickwit
  • Dependency declarations (tantivy4java/native/Cargo.toml) - Which quickwit crates are used
  • Merge functionality (perform_quickwit_merge_standalone, QuickwitSplit.mergeSplits())
  • Split conversion (convertIndexFromPath, split file operations)
  • Process-based merge binary (tantivy4java-merge standalone executable)

Step 3: Categorize Changes

Category A: Active Production Use

Changes that are:

  • Called by production tantivy4java code paths
  • Required for core functionality (split merge, split search)
  • Part of public API features

Category B: Development/Testing Only

Changes that are:

  • Used only in test code or examples
  • Made for experimental features not in production
  • Debugging aids not required for operation

Category C: Obsolete/Superseded

Changes that are:

  • Made for features that were redesigned
  • No longer reachable from current code
  • Replaced by alternative implementations

Category D: Uncertain/Needs Investigation

Changes that:

  • Have unclear purpose or documentation
  • May have indirect usage through dependencies
  • Require deeper analysis to verify usage

Key Areas to Investigate

1. Merge Functionality

Files to check: quickwit/quickwit-indexing/src/merge_policy.rs, merge executor code

  • Which merge-related changes are used by perform_quickwit_merge_standalone()?
  • Are modifications to MergeExecutor, MergePolicy, or merge configuration actually used?
  • Do we use quickwit's merge logic directly or through our own wrappers?

2. Split File Format

Files to check: Split serialization/deserialization, bundle directory code

  • Which split format changes are required for our split conversion?
  • Are modifications to split metadata, compression, or file structure necessary?
  • Do we rely on any custom split file format extensions?

3. Storage Backend (S3)

Files to check: S3 storage implementation, credential handling

  • Which S3-related changes support our s3:// URL handling?
  • Are modifications to AWS credential passing actually used?
  • Do we need custom endpoint or path-style access changes?

4. Search/Query API

Files to check: Query parser, search executor, aggregation code

  • Which query-related changes support SplitSearcher functionality?
  • Are modifications to aggregation types (DateHistogram, Histogram, Range) used?
  • Do we rely on any custom query parsing or execution logic?

5. Schema/Index Management

Files to check: Schema building, field types, indexing pipeline

  • Which schema-related changes are required for our split operations?
  • Are modifications to field capabilities or metadata access used?
  • Do we need custom schema introspection APIs?

Expected Deliverables

1. Analysis Report

File: QUICKWIT_FORK_ANALYSIS_REPORT.md

Should include:

  • Complete list of all quickwit fork modifications
  • Categorization (Active/Testing/Obsolete/Uncertain)
  • Usage evidence for each active change
  • Recommendation for each obsolete change

2. Cleanup Pull Request

  • Remove all Category C (Obsolete/Superseded) changes
  • Remove Category B changes not needed for testing
  • Document Category A changes with usage comments
  • Resolve Category D items through investigation

3. Dependency Documentation

File: QUICKWIT_DEPENDENCIES.md

Should document:

  • Which quickwit crates tantivy4java depends on
  • Which specific APIs/functions from quickwit are used
  • Why each dependency is necessary
  • Any custom modifications and their justification

4. Upstreaming Opportunities

File: QUICKWIT_UPSTREAM_CANDIDATES.md

Identify changes that:

  • Provide general value (not tantivy4java-specific)
  • Fix bugs or add features useful to quickwit community
  • Could be contributed back to upstream quickwit
  • Would reduce maintenance burden if upstreamed

Success Criteria

  • Complete change inventory - All quickwit fork modifications cataloged
  • Clear categorization - Each change marked Active/Testing/Obsolete/Uncertain
  • Usage verification - Active changes mapped to tantivy4java code paths
  • Cleanup execution - Obsolete changes removed from fork
  • Documentation - Remaining changes clearly documented with justification
  • Reduced divergence - Fork complexity minimized to essential changes only
  • Upstream plan - Valuable changes identified for potential contribution

Benefits

Immediate Benefits

  • Clearer dependency picture - Understand exactly what we need from quickwit
  • Easier maintenance - Fewer custom changes to track and update
  • Faster debugging - Less confusion about which code paths are active
  • Better documentation - Clear record of why each modification exists

Long-term Benefits

  • Simpler upstream sync - Easier to incorporate quickwit updates
  • Upstreaming potential - Opportunity to contribute valuable changes back
  • Reduced technical debt - Eliminate obsolete experimental code
  • Team knowledge - Better understanding of quickwit integration points

Priority

Medium-High - Important for long-term maintainability but not blocking current functionality

Estimated Effort

  • Analysis phase: 4-8 hours
  • Cleanup implementation: 2-4 hours
  • Testing and validation: 2-4 hours
  • Documentation: 2-3 hours
  • Total: ~10-19 hours

Notes

  • Should be done after current development stabilizes
  • Requires access to both tantivy4java and quickwit fork repositories
  • May reveal opportunities for simplification in tantivy4java as well
  • Could identify features we thought we needed but never actually used

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions