enhancement(search): backoff and abort on repeated extraction failures during indexing by paul43210 · Pull Request #12111 · owncloud/ocis

paul43210 · 2026-03-13T01:21:10Z

Summary

When Tika crashes or restarts during bulk indexing (ocis search index), the indexer now detects repeated extraction failures and backs off instead of continuing at full speed
After 5 consecutive failures, pauses for 30 seconds and checks if the extraction service has recovered
After 5 backoff cycles (2.5 minutes total) with no recovery, aborts with a clear error message
Logs a summary at the end of every walk: extracted, skipped, failed, and backoff cycle counts
Adds optional HealthChecker interface for extractors that support health checks (implemented by Tika)

Background

During a production reindex of 148K files, Tika crashed 6 times. The indexer continued walking all files at full speed, accumulating 14,363 "connection refused" failures over 19 hours. Each failed file still cost network + index lookup time. The Extracted:true guard (PR #12095) prevents stale entries, but the walk time was completely wasted.

Design decisions

Optional interface pattern for HealthChecker (like Optimizer in enhancement(search): optimize bleve index after bulk reindexing #12104) — no changes to the Extractor interface contract
Private upsertItem() returning error for IndexSpace path; public UpsertItem() unchanged for NATS events
Constants are package-level for easy tuning: _backoffThreshold=5, _backoffDuration=30s, _maxBackoffCycles=5

Test plan

make -C services/search test — all 75 specs pass
Build succeeds
Verify backoff triggers when Tika is stopped during indexing
Verify walk resumes after Tika restart within backoff window
Verify walk aborts after max backoff cycles

🤖 Generated with Claude Code

sonarqubecloud · 2026-03-14T03:39:40Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
48.2% Coverage on New Code
12.4% Duplication on New Code

See analysis details on SonarQube Cloud

jvillafanez · 2026-04-10T10:33:08Z

Optional interface pattern for HealthChecker

Not fan of it. You usually want either an extractor or a healthChecker, and most of the times they're different objects because you want to check at compile time. Type-checking at runtime is something to be avoided.

I don't think we want to deal with 2 different type of extractors. Either include the health-check as part of the regular interface (to be implemented in all extractors), or let it handle internally on each extractor that wants to implement the feature.
From a service point of view, the second options seems better. We also don't need to touch the interface.

For the health-check, I think it's easier to use the tikaClient.Version(). No need to keep the tika URL around to check a different endpoint.

On another note, I'm not sure if we want this feature at the moment. There are several challenges:

With the PRs code, the thread will sleep for up to 30 minutes. I'd need to check the callstack, but unless the process is already async, this could froze the request and potentially cause problems in the browser (timeouts).
We need to check what happens with the file while this is in progress:
- If the file is available to download, I don't think it will be available for searching (not indexed yet). This might be acceptable if the processing is short (it could be a hiccup), but we'll get bug reports if it takes too long. In addition, we also need to check what happens with other operations such as the file being deleted.
  Consider the following case: file is uploaded, extraction fails and sleeps for 30 minutes, during those 30 minutes, tika comes back, the file is deleted, file in the index is also deleted (I'm not sure about this, but let's assume so), and the "delayed" extraction kicks in. This means that we'd have a file searchable in the index but missing in the FS.
- If the file isn't available because it's being processed, we'd need to check the client's behavior (web, mobile and desktop). For the user, it will be weird if an uploaded file isn't available after 5 or 10 minutes.

I think we need to consider the worst case: even after the 30 minutes of retries, the tika service is still down and the file isn't processed. What's worse, the system might need manual intervention (likely a manual restart) and it might even take longer. In this case, I don't think the backoff would do anything, and it might be even worse if the user doesn't get any kind of error.

From my point of view, if you want to retry 5 times with a 1-2 seconds delay (for small hiccups), that's fine, but the user should get an error if the content fails to get extracted. Let the user decide what to do next.

During a production reindex of 148K files, Tika crashed 6 times. The indexer continued walking all files at full speed, accumulating 14,363 "connection refused" failures over 19 hours. Each failed file still cost network + index lookup time. The Extracted:true guard (PR #12095) prevents stale entries, but the walk time was completely wasted.

While I was thinking about regular usage (average user uploading files through the web interface), this is a use case to consider.

In your particular case, I think the regular "retry after 1 second", with a maximum of 5 retries should work (basically the usual quick retry). If that fails, it's likely a more serious problem and it might be better to abort.
Depending on the error you're facing, a rate-limiting approach to the indexing might be an option, although I'd need to check if the indexing is fully sequential or not.

paul43210 · 2026-04-15T18:45:49Z

Thanks @jvillafanez — you're right, the 30-minute exponential backoff was overengineered. Reworked completely:

What changed:

Simple retry: 5 attempts per file, 1-second delay between retries
After 5 consecutive file failures (not retries — 5 different files all failing), the walk aborts with an error
No HealthChecker interface, no tikaURL field, no /tika endpoint check — all removed
No separate upsertItem duplicate — retry wraps the existing extraction logic
No interface changes at all

Why abort after consecutive failures: If 5 files in a row fail extraction even after 5 retries each, the extraction service is down, not a single problematic file. The admin gets an error and can investigate (restart Tika, check resources, etc.) rather than silently accumulating thousands of wasted errors.

For the file-deleted-during-retry edge case: the retry window is at most 5 seconds (5 × 1s), which is short enough that it won't cause stale index entries in practice.

During IndexSpace bulk reindexing, file extraction is now retried up to 5 times with a 1-second delay between attempts. If 5 consecutive files fail (indicating the extraction service is down), the walk is aborted. This replaces the previous 30-minute exponential backoff approach per review feedback from jvillafanez: keep retries quick and short, return an error to the admin rather than sleeping for extended periods. No interface changes — retry logic is internal to the service. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Paul Faure <paul@faure.ca>

mmattel requested a review from mklos-kw March 13, 2026 09:07

paul43210 force-pushed the fix/search-indexer-backoff branch from 3ed7caf to 6e66737 Compare March 14, 2026 02:23

mmattel requested a review from kobergj March 24, 2026 11:18

paul43210 force-pushed the fix/search-indexer-backoff branch from 3c90475 to 9b0bd9e Compare March 26, 2026 15:45

mmattel force-pushed the fix/search-indexer-backoff branch from 9b0bd9e to 1cd0a86 Compare March 26, 2026 16:05

paul43210 force-pushed the fix/search-indexer-backoff branch from 1cd0a86 to f76df92 Compare April 9, 2026 01:02

mmattel requested a review from jvillafanez April 10, 2026 07:43

paul43210 force-pushed the fix/search-indexer-backoff branch 2 times, most recently from 482dcd7 to 1493750 Compare April 15, 2026 18:45

jvillafanez reviewed Apr 16, 2026

View reviewed changes

Comment thread services/search/pkg/search/service.go Outdated

Comment thread services/search/pkg/search/service.go Outdated

Comment thread services/search/pkg/search/service.go Outdated

paul43210 force-pushed the fix/search-indexer-backoff branch from 1493750 to c46980a Compare April 16, 2026 12:28

jvillafanez approved these changes Apr 17, 2026

View reviewed changes

mmattel enabled auto-merge (squash) April 17, 2026 08:41

Merge branch 'master' into fix/search-indexer-backoff

9df108d

mmattel merged commit cc9b170 into owncloud:master Apr 17, 2026
54 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

enhancement(search): backoff and abort on repeated extraction failures during indexing#12111

enhancement(search): backoff and abort on repeated extraction failures during indexing#12111
mmattel merged 2 commits intoowncloud:masterfrom
paul43210:fix/search-indexer-backoff

paul43210 commented Mar 13, 2026

Uh oh!

sonarqubecloud bot commented Mar 14, 2026

Uh oh!

jvillafanez commented Apr 10, 2026

Uh oh!

paul43210 commented Apr 15, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

paul43210 commented Mar 13, 2026

Summary

Background

Design decisions

Test plan

Uh oh!

sonarqubecloud bot commented Mar 14, 2026

Quality Gate passed

Uh oh!

jvillafanez commented Apr 10, 2026

Uh oh!

paul43210 commented Apr 15, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants