Skip to content

Await in-flight cache downloads + reconcile stranded download flags#469

Open
jubishop wants to merge 27 commits into
mainfrom
cache-await-downloads
Open

Await in-flight cache downloads + reconcile stranded download flags#469
jubishop wants to merge 27 commits into
mainfrom
cache-await-downloads

Conversation

@jubishop

@jubishop jubishop commented Jun 15, 2026

Copy link
Copy Markdown
Owner

Fixes #503

Generic CacheManager API split out of #461 so it can be reviewed and merged on its own. No transcription code here.

What

Await a download instead of polling. CacheManager.cachedURL(downloadingIfNeeded:) ensures an episode's audio is cached — starting a download if needed — and suspends until it finishes, returning the cached URL (or nil if the episode vanished or the download failed). It's event-driven via a per-episode AsyncLatch: CacheBackgroundDelegate opens the latch on every terminal callback (didFinishDownloadingTo and didCompleteWithError), keyed off the task's taskDescription so a deleted row still unblocks the awaiter.

Force-quit recovery. A background task cancelled by a force-quit never delivers its completion callback, stranding downloading == true with no live task.

  • reconcileStaleDownloads() runs on launch and clears any downloading row absent from the recreated session's live-task list, via the new Repo.downloadingEpisodeIDs().
  • timeoutIntervalForResource is capped well below the framework's multi-day default so a download that never progresses is abandoned in bounded time.

Why now

cachedURL(downloadingIfNeeded:)'s only consumer is the manual-transcription work in #461 (it needs the audio on disk before transcribing). The capability is generic and self-contained, so it lands here with its own tests; #461 will rebase onto main once this merges.

Tests

  • CacheManagerAwaitTests (new): immediate return when already cached, start-and-await-completion, and resolve-nil-on-failure.
  • CacheManagerReconcileTests (new): a stranded flag is cleared on start() while a live download is left untouched.

All four CacheManager* suites pass on My Mac (36 tests), zero compiler warnings.

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes
    • Improved download state recovery when the app is force-quit or backgrounded
    • Enhanced cleanup of failed or incomplete downloads
    • Fixed potential issues with concurrent download requests for the same episode
    • Added timeout protection for stalled downloads to prevent indefinite hangs
    • Better coordination of download completion signals to prevent stranded requests

Add CacheManager.cachedURL(downloadingIfNeeded:) so callers can suspend
until an episode's audio is cached instead of polling, backed by a
per-episode AsyncLatch that CacheBackgroundDelegate opens on every
terminal download callback (finish or failure).

Harden background downloads against force-quit:
- reconcileStaleDownloads() clears downloading flags that have no live
  task on launch, via the new Repo.downloadingEpisodeIDs()
- cap timeoutIntervalForResource so a download that never progresses is
  abandoned in bounded time

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@coderabbitai

This comment has been minimized.

Comment thread PodHaven/Cache/CacheManager.swift Outdated
Comment thread PodHaven/Cache/CacheManager.swift Outdated
Comment thread PodHaven/Cache/CacheManager.swift
Comment thread PodHaven/Cache/CacheManager.swift Outdated
jubishop and others added 3 commits June 14, 2026 22:20
- Refcount per-episode download latches so a cancelled or early-returning
  awaiter can't remove the shared latch and strand a concurrent awaiter.
- Await reconcileStaleDownloads before starting the current-episode/queue
  observers, so a stranded downloading flag is cleared before an observer
  skips the re-download as "already caching".
- Pass the already-fetched episode into ensureDownloadIsActive to drop a
  redundant DB read on the await path.
- Signal download completion when the non-async didFinishDownloadingTo
  bails on a temp-move failure, so an awaiter isn't left hanging.

Regression tests cover the latch-stranding and reconcile-ordering fixes,
both proven failing on the prior code. Review ledger added.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replace the shared per-episode AsyncLatch (monotonic + refcounted) with
one latch per download attempt: downloadToCache registers it before
resume(), and the delegate opens and removes it on the terminal
callback. A finished attempt's latch can no longer resolve a later
awaiter early, matching DownloadManager's per-DownloadTask ownership;
drops the refcount bookkeeping with no re-check loop.

Prune finished/failed tasks from FakeDataFetchable so it mirrors the
real URLSession dropping completed tasks, and cover recovery of a fresh
download after an earlier attempt for the same episode failed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
coderabbitai[bot]

This comment was marked as resolved.

The non-async didFinishDownloadingTo move-failure catch signaled the
completion latch but returned without clearing the downloading flag or
progress; the subsequent didCompleteWithError(nil) early-returns on
success, so the episode stayed .caching with no live task until the next
launch's reconcile. Clear progress and (in a Task) the downloading flag
before signaling, mirroring the normal completion paths.

Also correct the downloadLatches comment: the latch is episode-keyed, so
it can't guarantee "a finished download's signal can never resolve a
later one" — state the single-in-flight-attempt-per-episode assumption
instead.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
coderabbitai[bot]

This comment was marked as resolved.

Comment thread PodHaven/Cache/CacheBackgroundDelegate.swift Outdated
jubishop and others added 11 commits June 15, 2026 09:15
cachedURL(downloadingIfNeeded:) now adopts or registers the per-episode
latch and re-checks live-task state before awaiting, so a background
download reattached after a relaunch (live task, in-memory latch gone)
is awaited rather than returning a premature nil.

Extract clearDownloadState(for:) and call it from every give-up path in
CacheBackgroundDelegate (safe-temp move failure, URL mismatch,
didCompleteWithError). The URL-mismatch path previously returned without
clearing the downloading flag, stranding the episode .caching with no
live task.

Regression tests: awaitsReattachedDownloadWithoutLatch; mismatchedURLRefused
now sets downloading and asserts .uncached.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
cachedURL(downloadingIfNeeded:) get-or-creates the per-episode latch when
it suspends, so downloadToCache no longer needs to pre-register one. Doing
so only allocated a latch for every observer-driven download that nobody
awaits; cachedURL is now the sole latch registrar and awaiter.

Update the field, await, and signal comments (and the reattach test
comment) to reflect lazy, awaiter-owned latch creation.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
cachedURL registered its completion latch after ensureDownloadIsActive
and gated the wait on task liveness. A success signal from
didFinishDownloadingTo fires into an empty slot while the task still
lingers in allTasks (until the later didCompleteWithError), so the
liveness re-check saw a live task and awaited a fresh latch that never
reopened — a permanent hang for a future awaiting consumer.

Clear the downloading flag last on success (after the cached filename is
written) so it is a terminal marker written before the signal, and gate
cachedURL on that flag instead of task liveness: still downloading means
the signal is still coming and will open the latch just registered;
anything else is already resolved. Also drops the skip-wait latch leak.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Gates loadEpisodeAsset so the success delegate parks mid-completion —
after the file move, before the cached filename is written and the flag
cleared — and asserts a concurrent downloadToCache no-ops. Red on the old
ordering (which cleared downloading before the asset load, briefly
exposing .uncached and inviting a duplicate download), green after.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Simplify and tighten comments throughout the cache manager and delegate to improve clarity and reduce verbosity. Remove redundant explanations and consolidate key points into more concise language.

- Shorten comments in CacheBackgroundDelegate about error handling, state clearing, and completion signaling
- Streamline CacheManager comments on latch registration, stale download reconciliation, and timeout configuration
- Tighten test comments to remove unnecessary detail while preserving intent
- Clarify the relationship between downloading flag, file caching, and completion callbacks in simplified language

These changes maintain the technical accuracy of the comments while making them easier to scan and understand at a glance.
…ngs)

Independent code-only round at 278a2d9 confirms the latch/reconcile
mechanism is correct and all 11 prior findings remain settled; logs the
intake state and a dropped sub-threshold latch-leak candidate.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ead its episode

The async didFinishDownloadingTo/didCompleteWithError callbacks signal any
awaiter on every exit, but the episode-fetch-throws catch returned without
clearing the downloading flag, stranding the row as .caching with no live
task until next launch (didFinishDownloadingTo also orphaned the temp file).
Clear state via the already-parsed episode id in both catches and drop the
unattributable temp file. Regression tests use a new one-shot
FakeRepo.episodeFetchError.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… findings)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… additional session notes

Add comprehensive documentation for the F14 finding regarding background URLSession completion timing issues. Include the complete triage session from 2026-06-15T15:07:39 detailing why F14 was escalated to a follow-up issue (#482) rather than fixed in-PR, with evidence that the issue is pre-existing in main and not introduced by #469. Add the intermediate code-only review session (2026-06-15T14:51:01) that identified the pending background URLSession completion concern. Update the findings summary count to reflect 10 Warnings, 12 fixed items, and 1 follow-up issue. Document the F14 finding with full severity, architecture, location, and recommendation details, including the decision rationale and reframed fix approach around missing beginBackgroundTask assertion and deferred completion handling.
F15 (non-atomic same-episode download claim) decided as a follow-up rather than fixed in-PR: downloadToCache's claim is pre-existing/byte-identical to main, and the premature-nil impact rides on the unshipped await API. Filed #483; also folds in the prior session's F15-pending intake.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread PodHaven/Cache/CacheManager.swift Outdated
Comment thread PodHaven/Cache/CacheManager.swift Outdated
Comment thread PodHaven/Cache/CacheManager.swift
Comment thread PodHaven/Cache/CacheManager.swift Outdated
Comment thread PodHaven/Cache/CacheManager.swift Outdated
coderabbitai[bot]

This comment was marked as resolved.

jubishop and others added 5 commits June 16, 2026 10:53
…eRabbit

Fresh code-only pass at HEAD found no new findings; all 16 prior findings
remain settled (13 fixed, 1 skipped, 2 decided to follow-ups #482/#483).
Dismissed a new non-minimized CodeRabbit review claiming the
requestMidCompletionDoesNotRestartDownload test leaks its asset-loader
handler: .container (FactoryKit ContainerTrait) gives each test a fresh
Container + cloned singleton scope, and fakeEpisodeAssetLoader is
.scope(.cached), so injected state is per-test isolated.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…findings)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jubishop

Copy link
Copy Markdown
Owner Author

Real-world instance of the strand this PR's reconcileStaleDownloads() fixes.

Sentry user feedback podhaven:7566596723 (TestFlight 1.0+537, anonymous):

Why is "experts warn of superintelligence explosion" in the queue but never caching?

Analysis of the attached NDJSON logs: episode 147064 ("Experts Warn About the Intelligence Explosion") was auto-queued by a background feedRefresh BGTask and started downloading (updateDownloading: 147064 to true) at 06-21 09:56:13 PT. iOS expired the BGTask at 09:56:31 (BackgroundTaskScheduler: expiration triggered, cancelling running task) while the download was still in flight. Across the next four launches (~6 h) no CacheBackgroundDelegate callback ever fired for it, so downloading stayed true → permanent .caching → every later downloadToCache(147064) no-oped on the .caching guard. It was the only one of ~15 downloads in 3 days left stranded.

main has no recovery for this. This PR fixes it: reconcileStaleDownloads() runs in start() before the observers, finds 147064 absent from the recreated session's allCreatedTasks, clears the flag, and the queue observer re-downloads it — exactly the reconcilesStaleDownloadsOnStart regression test. (It also closes the in-delegate strand paths via clearDownloadState.)

Build 537 was cut from main on 2026-06-20; this PR has been green/mergeable since 2026-06-17 but wasn't merged yet — that gap is why the user hit it. Merging this resolves the report.

Triage notes: memory/sentry_feedback/podhaven-7566596723.md.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Recover queued episodes stuck permanently in .caching after an interrupted background download

2 participants