enhancement(search): backoff and abort on repeated extraction failures during indexing#12111
Conversation
3ed7caf to
6e66737
Compare
|
3c90475 to
9b0bd9e
Compare
9b0bd9e to
1cd0a86
Compare
1cd0a86 to
f76df92
Compare
Not fan of it. You usually want either an I don't think we want to deal with 2 different type of extractors. Either include the health-check as part of the regular interface (to be implemented in all extractors), or let it handle internally on each extractor that wants to implement the feature. For the health-check, I think it's easier to use the On another note, I'm not sure if we want this feature at the moment. There are several challenges:
I think we need to consider the worst case: even after the 30 minutes of retries, the tika service is still down and the file isn't processed. What's worse, the system might need manual intervention (likely a manual restart) and it might even take longer. In this case, I don't think the backoff would do anything, and it might be even worse if the user doesn't get any kind of error. From my point of view, if you want to retry 5 times with a 1-2 seconds delay (for small hiccups), that's fine, but the user should get an error if the content fails to get extracted. Let the user decide what to do next.
While I was thinking about regular usage (average user uploading files through the web interface), this is a use case to consider. In your particular case, I think the regular "retry after 1 second", with a maximum of 5 retries should work (basically the usual quick retry). If that fails, it's likely a more serious problem and it might be better to abort. |
482dcd7 to
1493750
Compare
|
Thanks @jvillafanez — you're right, the 30-minute exponential backoff was overengineered. Reworked completely: What changed:
Why abort after consecutive failures: If 5 files in a row fail extraction even after 5 retries each, the extraction service is down, not a single problematic file. The admin gets an error and can investigate (restart Tika, check resources, etc.) rather than silently accumulating thousands of wasted errors. For the file-deleted-during-retry edge case: the retry window is at most 5 seconds (5 × 1s), which is short enough that it won't cause stale index entries in practice. |
During IndexSpace bulk reindexing, file extraction is now retried up to 5 times with a 1-second delay between attempts. If 5 consecutive files fail (indicating the extraction service is down), the walk is aborted. This replaces the previous 30-minute exponential backoff approach per review feedback from jvillafanez: keep retries quick and short, return an error to the admin rather than sleeping for extended periods. No interface changes — retry logic is internal to the service. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Paul Faure <paul@faure.ca>
1493750 to
c46980a
Compare



Summary
ocis search index), the indexer now detects repeated extraction failures and backs off instead of continuing at full speedHealthCheckerinterface for extractors that support health checks (implemented by Tika)Background
During a production reindex of 148K files, Tika crashed 6 times. The indexer continued walking all files at full speed, accumulating 14,363 "connection refused" failures over 19 hours. Each failed file still cost network + index lookup time. The
Extracted:trueguard (PR #12095) prevents stale entries, but the walk time was completely wasted.Design decisions
HealthChecker(likeOptimizerin enhancement(search): optimize bleve index after bulk reindexing #12104) — no changes to theExtractorinterface contractupsertItem()returning error for IndexSpace path; publicUpsertItem()unchanged for NATS events_backoffThreshold=5,_backoffDuration=30s,_maxBackoffCycles=5Test plan
make -C services/search test— all 75 specs pass🤖 Generated with Claude Code