-
Notifications
You must be signed in to change notification settings - Fork 119
IndexingMerger: lessen work and retry by default #3732
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Catching an exception, the IndexingMerger attempts to detect if it should lessen work and retry. Currently it is an include list of known "retyable" errors. If the error is not detected correctly, or not predicted, it will not retry. Being a background process, however, it is better and safer to retry by default. Resolves FoundationDB#3731
...core/src/main/java/com/apple/foundationdb/record/provider/foundationdb/IndexingThrottle.java
Outdated
Show resolved
Hide resolved
...r-core/src/main/java/com/apple/foundationdb/record/provider/foundationdb/IndexingMerger.java
Outdated
Show resolved
Hide resolved
| // Expecting AsyncToSyncTimeoutException or an instance of TimeoutException. However, cannot | ||
| // refer to AsyncToSyncTimeoutException without creating a lucene dependency | ||
| // TODO: remove this kludge | ||
| if (e.getClass().getCanonicalName().contains("Timeout") ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the PR description:
Being a background process, however, it is better and safer to retry by default.
Why not just have this return false if there is any exception, except InterruptedException.
Here you abort for any exception that doesn't have Timeout in its name, or an FDBException as its cause.
Either way, it seems like having a test with a mock IndexMaintainer that has various failure scenarios would be valuable.
...yer-lucene/src/test/java/com/apple/foundationdb/record/lucene/FDBLuceneIndexFailureTest.java
Show resolved
Hide resolved
| } | ||
| }; | ||
| for (int i = 0; i < 100; i++) { | ||
| for (int i = 0; i < 20; i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ScottDugas , please verify -
- Using now a lower level index maintainer's merge to avoid retries on timeout.
- With "100" repetitions the
flakyMergeQuicktest was reaching the 5 minutes test limit. Had to take it down to 20. Is that acceptable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better to disable all retries in the indexer used for merging.
The point of this test is to validate that if Lucene merge fails randomly in the middle it will still be usable, and not corrupted. Having retries in the IndexingMerger means that it could heal, but we want to make sure that requests coming in while the merge is ongoing don't get a corrupted view.
I'd also recommend on any comments necessary to make that clear.
The javadoc on the test is, apparently, too brief.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added this clarification to the javadoc.
Since there is not control on the number of retries (the merger will retry until it cannot reduce the amount of work / transaction time quota anymore), this PR is calling the lower level mergeIndex from the index maintainer.
I wonder if the 100 loop will not get the 5 minutes test timeout because the lower level is quick enough to set the file-lock before getting the asyncToSync timeout...
...rc/test/java/com/apple/foundationdb/record/provider/foundationdb/OnlineIndexerMergeTest.java
Show resolved
Hide resolved
...r-core/src/main/java/com/apple/foundationdb/record/provider/foundationdb/IndexingMerger.java
Outdated
Show resolved
Hide resolved
...rc/test/java/com/apple/foundationdb/record/provider/foundationdb/OnlineIndexerMergeTest.java
Outdated
Show resolved
Hide resolved
...rc/test/java/com/apple/foundationdb/record/provider/foundationdb/OnlineIndexerMergeTest.java
Outdated
Show resolved
Hide resolved
...rc/test/java/com/apple/foundationdb/record/provider/foundationdb/OnlineIndexerMergeTest.java
Show resolved
Hide resolved
...rc/test/java/com/apple/foundationdb/record/provider/foundationdb/OnlineIndexerMergeTest.java
Outdated
Show resolved
Hide resolved
...rc/test/java/com/apple/foundationdb/record/provider/foundationdb/OnlineIndexerMergeTest.java
Outdated
Show resolved
Hide resolved
| * An exception that is thrown when the async to sync operation times out. | ||
| */ | ||
| public static class AsyncToSyncTimeoutException extends RecordCoreException { | ||
| public static class AsyncToSyncTimeoutException extends RecordCoreTimeoutException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than backwards compatibility, is there any reason to keep this class around? Should we mark it as API.Status.DEPRECATED, and then, in the future have lucene throw RecordCoreTimeoutException instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably. Should that be done in another PR?
| } | ||
| }; | ||
| for (int i = 0; i < 100; i++) { | ||
| for (int i = 0; i < 20; i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better to disable all retries in the indexer used for merging.
The point of this test is to validate that if Lucene merge fails randomly in the middle it will still be usable, and not corrupted. Having retries in the IndexingMerger means that it could heal, but we want to make sure that requests coming in while the merge is ongoing don't get a corrupted view.
I'd also recommend on any comments necessary to make that clear.
The javadoc on the test is, apparently, too brief.
...er-lucene/src/test/java/com/apple/foundationdb/record/lucene/LuceneIndexMaintenanceTest.java
Show resolved
Hide resolved
| AtomicInteger attemptCount = new AtomicInteger(0); | ||
|
|
||
| TestFactory.register(indexType, state -> { | ||
| adjustMergeControl(state); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code does not retry any errors while opening the store. Are you intentionally leaving that as-is?
Are you intending to change that in a followup?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean the merger's code or the test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean the code. It still only retries if the MergeControl is set, which happens after the store is opened.
9d09231 to
3fe0338
Compare
| AtomicInteger attemptCount = new AtomicInteger(0); | ||
|
|
||
| TestFactory.register(indexType, state -> { | ||
| adjustMergeControl(state); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean the code. It still only retries if the MergeControl is set, which happens after the store is opened.
Catching an exception, the IndexingMerger attempts to detect if it should lessen work and retry. Currently it is an include list of known "retyable" errors. If the error is not detected correctly, or not predicted, it will not retry.
Being a background process, however, it is better and safer to retry after timeouts and after any FDB exception.
Resolves #3731