Skip to content

enhancement(search): backoff and abort on repeated extraction failures during indexing#12111

Merged
mmattel merged 2 commits intoowncloud:masterfrom
paul43210:fix/search-indexer-backoff
Apr 17, 2026
Merged

enhancement(search): backoff and abort on repeated extraction failures during indexing#12111
mmattel merged 2 commits intoowncloud:masterfrom
paul43210:fix/search-indexer-backoff

Conversation

@paul43210
Copy link
Copy Markdown
Contributor

Summary

  • When Tika crashes or restarts during bulk indexing (ocis search index), the indexer now detects repeated extraction failures and backs off instead of continuing at full speed
  • After 5 consecutive failures, pauses for 30 seconds and checks if the extraction service has recovered
  • After 5 backoff cycles (2.5 minutes total) with no recovery, aborts with a clear error message
  • Logs a summary at the end of every walk: extracted, skipped, failed, and backoff cycle counts
  • Adds optional HealthChecker interface for extractors that support health checks (implemented by Tika)

Background

During a production reindex of 148K files, Tika crashed 6 times. The indexer continued walking all files at full speed, accumulating 14,363 "connection refused" failures over 19 hours. Each failed file still cost network + index lookup time. The Extracted:true guard (PR #12095) prevents stale entries, but the walk time was completely wasted.

Design decisions

  • Optional interface pattern for HealthChecker (like Optimizer in enhancement(search): optimize bleve index after bulk reindexing #12104) — no changes to the Extractor interface contract
  • Private upsertItem() returning error for IndexSpace path; public UpsertItem() unchanged for NATS events
  • Constants are package-level for easy tuning: _backoffThreshold=5, _backoffDuration=30s, _maxBackoffCycles=5

Test plan

  • make -C services/search test — all 75 specs pass
  • Build succeeds
  • Verify backoff triggers when Tika is stopped during indexing
  • Verify walk resumes after Tika restart within backoff window
  • Verify walk aborts after max backoff cycles

🤖 Generated with Claude Code

@mmattel mmattel requested a review from mklos-kw March 13, 2026 09:07
@paul43210 paul43210 force-pushed the fix/search-indexer-backoff branch from 3ed7caf to 6e66737 Compare March 14, 2026 02:23
@sonarqubecloud
Copy link
Copy Markdown

@mmattel mmattel requested a review from kobergj March 24, 2026 11:18
@paul43210 paul43210 force-pushed the fix/search-indexer-backoff branch from 3c90475 to 9b0bd9e Compare March 26, 2026 15:45
@mmattel mmattel force-pushed the fix/search-indexer-backoff branch from 9b0bd9e to 1cd0a86 Compare March 26, 2026 16:05
@paul43210 paul43210 force-pushed the fix/search-indexer-backoff branch from 1cd0a86 to f76df92 Compare April 9, 2026 01:02
@mmattel mmattel requested a review from jvillafanez April 10, 2026 07:43
@jvillafanez
Copy link
Copy Markdown
Member

Optional interface pattern for HealthChecker

Not fan of it. You usually want either an extractor or a healthChecker, and most of the times they're different objects because you want to check at compile time. Type-checking at runtime is something to be avoided.

I don't think we want to deal with 2 different type of extractors. Either include the health-check as part of the regular interface (to be implemented in all extractors), or let it handle internally on each extractor that wants to implement the feature.
From a service point of view, the second options seems better. We also don't need to touch the interface.

For the health-check, I think it's easier to use the tikaClient.Version(). No need to keep the tika URL around to check a different endpoint.


On another note, I'm not sure if we want this feature at the moment. There are several challenges:

  • With the PRs code, the thread will sleep for up to 30 minutes. I'd need to check the callstack, but unless the process is already async, this could froze the request and potentially cause problems in the browser (timeouts).
  • We need to check what happens with the file while this is in progress:
    • If the file is available to download, I don't think it will be available for searching (not indexed yet). This might be acceptable if the processing is short (it could be a hiccup), but we'll get bug reports if it takes too long. In addition, we also need to check what happens with other operations such as the file being deleted.
      Consider the following case: file is uploaded, extraction fails and sleeps for 30 minutes, during those 30 minutes, tika comes back, the file is deleted, file in the index is also deleted (I'm not sure about this, but let's assume so), and the "delayed" extraction kicks in. This means that we'd have a file searchable in the index but missing in the FS.
    • If the file isn't available because it's being processed, we'd need to check the client's behavior (web, mobile and desktop). For the user, it will be weird if an uploaded file isn't available after 5 or 10 minutes.

I think we need to consider the worst case: even after the 30 minutes of retries, the tika service is still down and the file isn't processed. What's worse, the system might need manual intervention (likely a manual restart) and it might even take longer. In this case, I don't think the backoff would do anything, and it might be even worse if the user doesn't get any kind of error.

From my point of view, if you want to retry 5 times with a 1-2 seconds delay (for small hiccups), that's fine, but the user should get an error if the content fails to get extracted. Let the user decide what to do next.


During a production reindex of 148K files, Tika crashed 6 times. The indexer continued walking all files at full speed, accumulating 14,363 "connection refused" failures over 19 hours. Each failed file still cost network + index lookup time. The Extracted:true guard (PR #12095) prevents stale entries, but the walk time was completely wasted.

While I was thinking about regular usage (average user uploading files through the web interface), this is a use case to consider.

In your particular case, I think the regular "retry after 1 second", with a maximum of 5 retries should work (basically the usual quick retry). If that fails, it's likely a more serious problem and it might be better to abort.
Depending on the error you're facing, a rate-limiting approach to the indexing might be an option, although I'd need to check if the indexing is fully sequential or not.

@paul43210 paul43210 force-pushed the fix/search-indexer-backoff branch 2 times, most recently from 482dcd7 to 1493750 Compare April 15, 2026 18:45
@paul43210
Copy link
Copy Markdown
Contributor Author

Thanks @jvillafanez — you're right, the 30-minute exponential backoff was overengineered. Reworked completely:

What changed:

  • Simple retry: 5 attempts per file, 1-second delay between retries
  • After 5 consecutive file failures (not retries — 5 different files all failing), the walk aborts with an error
  • No HealthChecker interface, no tikaURL field, no /tika endpoint check — all removed
  • No separate upsertItem duplicate — retry wraps the existing extraction logic
  • No interface changes at all

Why abort after consecutive failures: If 5 files in a row fail extraction even after 5 retries each, the extraction service is down, not a single problematic file. The admin gets an error and can investigate (restart Tika, check resources, etc.) rather than silently accumulating thousands of wasted errors.

For the file-deleted-during-retry edge case: the retry window is at most 5 seconds (5 × 1s), which is short enough that it won't cause stale index entries in practice.

Comment thread services/search/pkg/search/service.go Outdated
Comment thread services/search/pkg/search/service.go Outdated
Comment thread services/search/pkg/search/service.go Outdated
During IndexSpace bulk reindexing, file extraction is now retried up to
5 times with a 1-second delay between attempts. If 5 consecutive files
fail (indicating the extraction service is down), the walk is aborted.

This replaces the previous 30-minute exponential backoff approach per
review feedback from jvillafanez: keep retries quick and short, return
an error to the admin rather than sleeping for extended periods.

No interface changes — retry logic is internal to the service.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Paul Faure <paul@faure.ca>
@paul43210 paul43210 force-pushed the fix/search-indexer-backoff branch from 1493750 to c46980a Compare April 16, 2026 12:28
@mmattel mmattel enabled auto-merge (squash) April 17, 2026 08:41
@mmattel mmattel merged commit cc9b170 into owncloud:master Apr 17, 2026
54 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants