IndexingMerger: lessen work and retry by default #3732

jjezra · 2025-11-06T22:53:23Z

Catching an exception, the IndexingMerger attempts to detect if it should lessen work and retry. Currently it is an include list of known "retyable" errors. If the error is not detected correctly, or not predicted, it will not retry.
Being a background process, however, it is better and safer to retry after timeouts and after any FDB exception.

Resolves #3731

Catching an exception, the IndexingMerger attempts to detect if it should lessen work and retry. Currently it is an include list of known "retyable" errors. If the error is not detected correctly, or not predicted, it will not retry. Being a background process, however, it is better and safer to retry by default. Resolves FoundationDB#3731

...core/src/main/java/com/apple/foundationdb/record/provider/foundationdb/IndexingThrottle.java

...r-core/src/main/java/com/apple/foundationdb/record/provider/foundationdb/IndexingMerger.java

ScottDugas · 2025-12-01T21:49:55Z

...r-core/src/main/java/com/apple/foundationdb/record/provider/foundationdb/IndexingMerger.java

+        // Expecting AsyncToSyncTimeoutException or an instance of TimeoutException. However, cannot
+        // refer to AsyncToSyncTimeoutException without creating a lucene dependency
+        // TODO: remove this kludge
+        if (e.getClass().getCanonicalName().contains("Timeout") ) {


From the PR description:

Being a background process, however, it is better and safer to retry by default.

Why not just have this return false if there is any exception, except InterruptedException.
Here you abort for any exception that doesn't have Timeout in its name, or an FDBException as its cause.

Either way, it seems like having a test with a mock IndexMaintainer that has various failure scenarios would be valuable.

...yer-lucene/src/test/java/com/apple/foundationdb/record/lucene/FDBLuceneIndexFailureTest.java

jjezra · 2025-12-04T21:55:50Z

...er-lucene/src/test/java/com/apple/foundationdb/record/lucene/LuceneIndexMaintenanceTest.java

                }
            };
-            for (int i = 0; i < 100; i++) {
+            for (int i = 0; i < 20; i++) {


@ScottDugas , please verify -

Using now a lower level index maintainer's merge to avoid retries on timeout.

With "100" repetitions the flakyMergeQuick test was reaching the 5 minutes test limit. Had to take it down to 20. Is that acceptable?

I think it would be better to disable all retries in the indexer used for merging.
The point of this test is to validate that if Lucene merge fails randomly in the middle it will still be usable, and not corrupted. Having retries in the IndexingMerger means that it could heal, but we want to make sure that requests coming in while the merge is ongoing don't get a corrupted view.

I'd also recommend on any comments necessary to make that clear.
The javadoc on the test is, apparently, too brief.

Added this clarification to the javadoc.
Since there is not control on the number of retries (the merger will retry until it cannot reduce the amount of work / transaction time quota anymore), this PR is calling the lower level mergeIndex from the index maintainer.
I wonder if the 100 loop will not get the 5 minutes test timeout because the lower level is quick enough to set the file-lock before getting the asyncToSync timeout...

...rc/test/java/com/apple/foundationdb/record/provider/foundationdb/OnlineIndexerMergeTest.java

...r-core/src/main/java/com/apple/foundationdb/record/provider/foundationdb/IndexingMerger.java

...rc/test/java/com/apple/foundationdb/record/provider/foundationdb/OnlineIndexerMergeTest.java

ScottDugas · 2025-12-10T16:07:18Z

...ecord-layer-lucene/src/main/java/com/apple/foundationdb/record/lucene/LuceneConcurrency.java

     * An exception that is thrown when the async to sync operation times out.
     */
-    public static class AsyncToSyncTimeoutException extends RecordCoreException {
+    public static class AsyncToSyncTimeoutException extends RecordCoreTimeoutException {


Other than backwards compatibility, is there any reason to keep this class around? Should we mark it as API.Status.DEPRECATED, and then, in the future have lucene throw RecordCoreTimeoutException instead?

Probably. Should that be done in another PR?

ScottDugas · 2025-12-10T16:09:59Z

...er-lucene/src/test/java/com/apple/foundationdb/record/lucene/LuceneIndexMaintenanceTest.java

                }
            };
-            for (int i = 0; i < 100; i++) {
+            for (int i = 0; i < 20; i++) {


I think it would be better to disable all retries in the indexer used for merging.
The point of this test is to validate that if Lucene merge fails randomly in the middle it will still be usable, and not corrupted. Having retries in the IndexingMerger means that it could heal, but we want to make sure that requests coming in while the merge is ongoing don't get a corrupted view.

I'd also recommend on any comments necessary to make that clear.
The javadoc on the test is, apparently, too brief.

...er-lucene/src/test/java/com/apple/foundationdb/record/lucene/LuceneIndexMaintenanceTest.java

ScottDugas · 2025-12-10T16:12:28Z

...rc/test/java/com/apple/foundationdb/record/provider/foundationdb/OnlineIndexerMergeTest.java

+        AtomicInteger attemptCount = new AtomicInteger(0);
+
+        TestFactory.register(indexType, state -> {
+            adjustMergeControl(state);


The code does not retry any errors while opening the store. Are you intentionally leaving that as-is?
Are you intending to change that in a followup?

Do you mean the merger's code or the test?

I mean the code. It still only retries if the MergeControl is set, which happens after the store is opened.

ScottDugas · 2025-12-15T21:03:41Z

...rc/test/java/com/apple/foundationdb/record/provider/foundationdb/OnlineIndexerMergeTest.java

+        AtomicInteger attemptCount = new AtomicInteger(0);
+
+        TestFactory.register(indexType, state -> {
+            adjustMergeControl(state);


I mean the code. It still only retries if the MergeControl is set, which happens after the store is opened.

jjezra added the enhancement New feature or request label Nov 6, 2025

jjezra requested review from ScottDugas and ohadzeliger November 6, 2025 23:04

Backward compatible - abort if the exception is not an FDBException

06ff451

ohadzeliger requested changes Nov 7, 2025

View reviewed changes

...core/src/main/java/com/apple/foundationdb/record/provider/foundationdb/IndexingThrottle.java Outdated Show resolved Hide resolved

...r-core/src/main/java/com/apple/foundationdb/record/provider/foundationdb/IndexingMerger.java Outdated Show resolved Hide resolved

jjezra requested a review from ohadzeliger November 7, 2025 21:03

Tmp kludge for testing

168a312

ScottDugas reviewed Dec 1, 2025

View reviewed changes

jjezra added 4 commits December 1, 2025 18:55

Add tests

7633735

Fix a failing test

7b25720

Fix test (tmp)

0c27c3f

use RecordCoreTimeoutException

a1afad4

jjezra commented Dec 4, 2025

View reviewed changes

...yer-lucene/src/test/java/com/apple/foundationdb/record/lucene/FDBLuceneIndexFailureTest.java Show resolved Hide resolved

jjezra commented Dec 4, 2025

View reviewed changes

ohadzeliger reviewed Dec 5, 2025

View reviewed changes

...rc/test/java/com/apple/foundationdb/record/provider/foundationdb/OnlineIndexerMergeTest.java Show resolved Hide resolved

Make lessenWorkCodes a private static var

692f23a

jjezra requested review from ScottDugas and ohadzeliger December 8, 2025 21:49

ohadzeliger approved these changes Dec 9, 2025

View reviewed changes

ScottDugas requested changes Dec 10, 2025

View reviewed changes

Apply Scott's requested changes

3fe0338

jjezra force-pushed the merger_retry_by_default branch from 9d09231 to 3fe0338 Compare December 13, 2025 08:02

jjezra requested a review from ScottDugas December 13, 2025 08:02

ScottDugas approved these changes Dec 15, 2025

View reviewed changes

IndexingMerger: lessen work and retry by default #3732

Are you sure you want to change the base?

IndexingMerger: lessen work and retry by default #3732

Conversation

jjezra commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jjezra commented Nov 6, 2025 •

edited

Loading