TMDB matching produces wrong cache entries for ~3–10% of watch-history titles

## Summary

The TMDB matching pipeline (`recommender/tmdb_client.py` → `recommender/watch_index.py` → `recommender/cache/tmdb/*.json`) returns wrong matches for a small but persistent fraction of ingested titles. The wrong matches then poison everything downstream: enrichments are generated against the wrong work, the taste profile inherits noise, and recommendations for the affected titles are off.

The audit infrastructure (`recommender/setup.py::_audit_cache_mismatches`) surfaces these problems at the end of `--refresh-data`, but it only reports — there is no feedback loop into the matcher, no automated correction, and the noise filter is too strict so the signal gets lost.

This issue tracks the design and implementation of a real fix. **Do not patch this piecemeal**; the symptoms below are connected.

## Real-world impact

Counts from a recent run (1,993 indexed titles):

| Category | Count |
|---|---|
| Unmatched titles (no TMDB ID returned) | 10 |
| Title mismatches (cache title differs from source) | 105–299 depending on audit-noise filter |
| Year mismatches (`abs(cached_year − source_year) > 2`) | 7 |
| Runtime mismatches (`>30%` delta) | 47 |
| Weak matches (no poster, zero votes, low popularity) | 4 |

Of these, after manual triage, roughly **50–60 are real wrong matches**; the rest are cosmetic noise (punctuation/diacritic differences, runtime checks that don't apply to platforms without duration data).

## Failure modes (concrete examples)

### A. Generic / short titles map to the wrong popular work

No disambiguation hints reach the matcher, so TMDB search ranking picks the most popular candidate regardless of whether it's the right one.

```
"Don"            -> America's Sweethearts   (should be 2006 Hindi film)
"Tiger"          -> wrong wildlife doc      (should be 2023 Disney+ doc)
"Goodbye"        -> The Exorcism            (should be 2022 Hindi film)
"The After"      -> Taylor Swift: Eras Tour (should be 2023 Netflix short)
"10 Years"       -> From Up on Poppy Hill   (should be 2011 reunion film)
```

### B. Year-specific reboots/sequels resolve to wrong release

When `release_year_hint` is available it would disambiguate. But:
- `release_year_hint` is **not persisted** by `event_store.py` — round-tripping through SQLite drops it. Year-based disambiguation works during fresh-parse `--refresh-data` but is invisible to anything reading from `events.db`.
- Disney rows have no year hint at all (no duration → no derived year).

```
"Wonder Woman"      -> 2017 (should be 1984/2020)
"Jurassic World"    -> 2015 (should be Fallen Kingdom 2018)
"The Darkest Hour"  -> Ghost Rider 2011 (should be Darkest Hour 2017)
"Disobedience"      -> 1993 film (should be 2017 Rachel Weisz film)
```

### C. Bonus content (clips, trailers, featurettes, deleted scenes) match as movies

These entries should never reach TMDB at all — they're not real watch events. Disney's parser drops trailers; the others don't filter anything. Apple TV's `_SKIP_SUBTYPES` covers `FeaturedPromo, Preview, Promotional, Bonus` but not Disney-style content names.

```
"Ferdinand Clip"
"The Santa Clause Clip"
"Aladdin's Video Journal: A New Fantastic Point of View"
"Song Breakdowns: 'Under the Sea'"
"Stunts | More from Pandora's Box | Avatar: The Way of Water"
"Deleted Song: 'Desert Moon'"
"Descendants: The Rise of Red Sing- Along"   (matched to an adult film)
"Raya and the Last Dragon רליירט"            (Hebrew 'trailer' — non-Latin)
```

### D. Non-English / non-Latin-script titles don't get language-biased search

TMDB's default search is English-popularity-weighted. Hindi/Bollywood titles like `Don`, `Goodbye`, `Begum Jaan`, `Dhadak`, `Gulabo Sitabo`, `Phas Gaye Re Obama`, `Laapataa Ladies` get out-ranked by English titles with the same word.

`tmdb_client.py` has no `with_original_language` hint passed in any path. Worth verifying.

### E. Runtime mismatch flags are largely false positives

Disney events synthesize a 45-min runtime from `MANUAL_TV_DURATION_MINUTES`; comparing that against a 22-min Bluey episode triggers the >30% rule. Same for manual entries. The runtime check is informative for Netflix/Prime where duration is real but misleading everywhere else.

### F. User overrides (`data/overrides.json`) are not validated

Nothing checks that `{"tmdb_id": X}` actually resolves to the intended work. In this session an LLM-assisted batch override produced 30 entries that pinned titles to the *same wrong IDs* the audit was reporting (the LLM parroted the `cached as X` ID), permanently breaking those entries until manually corrected. The system silently accepted them.

```jsonc
"Don":          {"tmdb_id": 11467}    // = America's Sweethearts
"Foundation":   {"tmdb_id": 84958}    // = Loki
"Apollo 11":    {"tmdb_id": 553016}   // = Le Pont des Broignes
"The Dry":      {"tmdb_id": 556678}   // = Emma.
... 26 more
```

Five additional overrides point to TMDB IDs that don't exist at all (`Kuch Luv Jaisaa → 65593`, `Princess Diana → 430857`, `Sudha Murthy → tv/198603`, `Road Rage → tv/135718`, `Tiger → 1122822`). Those silently fail and the title stays unmatched.

## Root causes

1. **Hint pipeline is incomplete.** `MatchHints` exists and the matcher uses it, but `release_year_hint` doesn't persist in SQLite and language is never inferred or passed.
2. **No post-match validation.** The matcher returns the first/best TMDB candidate without verifying that the cached title is plausibly the source title.
3. **Ingestion is too permissive.** Bonus content (clips, trailers, featurettes, "Sing-Along", "Inside ...", "Behind the Scenes", "Deleted Scene", non-Latin trailer words) reaches TMDB lookup when it shouldn't.
4. **Overrides are blindly trusted.** A `{"tmdb_id": X}` override is accepted at face value.
5. **Audit is reporting-only.** No feedback loop into the matcher; noise filter is too strict to surface real bugs cleanly.

## Proposed architecture

A multi-layered fix. Land each layer as its own PR so the impact of each is isolated and reviewable.

### Layer 1 — Ingestion filters (smallest, highest signal-to-noise)
- Extend each parser (`netflix.py`, `prime.py`, `apple_tv.py`, `disney.py`) with a shared bonus-content regex covering: `\bclip\b`, `\btrailer\b`, `\bfeaturette\b`, `\b(behind the scenes|bts)\b`, `\bsing-?along\b`, `\binside\b` (when paired with `|`), `\bdeleted\s+(scene|song)\b`, `\bvideo journal\b`, `\bsong breakdown(s)?\b`, `\bpromo(tional)?\b`. Also drop entries where the program contains pipe-separated featurette markers (`X | Y | Z`).
- Add a Unicode-aware trailer word match (Hebrew `טריילר`, Hindi `ट्रेलर`, etc.) or simply skip rows whose title is mostly non-Latin script and matches a known short blocklist.
- Test: feed fixtures of known bonus rows; assert they are dropped.

### Layer 2 — Persist all hints (`event_store.py`)
- Add `release_year_hint INTEGER`, `language_hint TEXT` columns to the events table. Migrate existing DB by re-deriving from raw events on next refresh.
- Detect language at parse time: Devanagari/Hebrew/CJK/Arabic script in the source title → set `language_hint`. Otherwise leave null.
- Update `_build_hints_map` to use both fields.

### Layer 3 — Strengthen the matcher (`tmdb_client.py`)
- When `language_hint` is set, pass `with_original_language` to TMDB search.
- When `release_year_hint` is set, re-rank candidates penalising those further from the year (already partially present — verify it actually fires).
- Add a post-search validator: compute `_titles_are_compatible(source_title, candidate.title)` and `_titles_are_compatible(source_title, candidate.original_title)` for each top-N candidate; if the chosen winner fails BOTH checks AND alternatives pass, pick the alternative.

### Layer 4 — Override sanity check (`recommender/overrides.py`)
- At load time, for every `{"tmdb_id": X}` entry: load `cache/tmdb/<ct>/<X>.json`, compute normalized-title similarity to the source key. If the cache title is wildly unrelated (no shared content tokens, no substring overlap, no language match), log a loud warning and **ignore the override** (fall back to fresh search). Permanently fixes the LLM-poisoning failure mode.
- Optionally: emit a `overrides_warnings.txt` next to the audit listing rejected overrides so the user can clean up `data/overrides.json`.

### Layer 5 — Audit improvements
- Suppress runtime-mismatch flags for events whose `total_duration` is synthesized (manual, disney). Only check runtime for apple_tv/netflix/prime where it's real.
- Permissive title-compatibility: strip diacritics, drop articles, normalize "&"→"and", check both directions of substring.
- A "real bugs only" view that shows only entries failing multiple checks (e.g., title AND year both off → high confidence real bug).

## Acceptance criteria

- [ ] A fresh `setup --refresh-data` run against the real data set produces an audit with **≤ 20 entries flagged as `real bugs`** (currently ~50–60).
- [ ] No clips/trailers/featurettes reach the watch index.
- [ ] Hindi/Bollywood titles in the source export correctly resolve to their Indian-cinema counterparts on TMDB.
- [ ] An override pinning `"Don"` to `tmdb_id: 11467` is detected as bogus at load time, logged loudly, and ignored (fresh search runs instead).
- [ ] Existing tests pass; new tests cover the ingestion filters, hint persistence, language-biased search, and override sanity check.
- [ ] The audit's truncated-to-10 console output now reflects only real bugs, not cosmetic noise.

## Out of scope

- Refactoring the audit dump itself (already in PR-pending #47 branch).
- Adding a new metadata source (Wikipedia, IMDB, etc.) — TMDB is sufficient for almost all titles.
- LLM-assisted override generation (proven unreliable; failure mode is "LLM parrots the bad ID it sees in the audit").

## Pointers to current code

- Matcher entry point: `recommender/tmdb_client.py::TmdbClient.get_metadata`
- Hint construction: `recommender/setup.py::_build_hints_map`
- Audit: `recommender/setup.py::_audit_cache_mismatches`
- Override loader: `recommender/overrides.py::load`
- Event store schema: `recommender/event_store.py`
- Existing tests: `tests/test_tmdb_client.py`, `tests/test_watch_index.py`, `tests/test_main.py` (audit tests)

## What was tried in this session (for context, not a recommendation)

1. **LLM-generated overrides via the dumped audit file.** Failed: the LLM parroted `cached as X` IDs back as overrides, producing 30 broken entries. Documented in this issue and root-caused under failure mode F.
2. **Improving the audit's `_titles_are_compatible` function** (diacritic strip, article drop, bidirectional substring). Cut audit noise by ~65% but doesn't fix any actual matches — purely a reporting improvement. **Reverted in this session per request** so the matching fix can be designed cleanly.

## Estimated scope

5 PRs (one per layer), 3–5 days total focused work. Layer 1 alone (ingestion filters) is a half-day and would resolve a meaningful fraction of the unmatched/weak-match noise. Layer 4 (override sanity check) is also a half-day and permanently closes the LLM-poisoning failure mode.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TMDB matching produces wrong cache entries for ~3–10% of watch-history titles #51

Summary

Real-world impact

Failure modes (concrete examples)

A. Generic / short titles map to the wrong popular work

B. Year-specific reboots/sequels resolve to wrong release

C. Bonus content (clips, trailers, featurettes, deleted scenes) match as movies

D. Non-English / non-Latin-script titles don't get language-biased search

E. Runtime mismatch flags are largely false positives

F. User overrides (`data/overrides.json`) are not validated

Root causes

Proposed architecture

Layer 1 — Ingestion filters (smallest, highest signal-to-noise)

Layer 2 — Persist all hints (`event_store.py`)

Layer 3 — Strengthen the matcher (`tmdb_client.py`)

Layer 4 — Override sanity check (`recommender/overrides.py`)

Layer 5 — Audit improvements

Acceptance criteria

Out of scope

Pointers to current code

What was tried in this session (for context, not a recommendation)

Estimated scope

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Category	Count
Unmatched titles (no TMDB ID returned)	10
Title mismatches (cache title differs from source)	105–299 depending on audit-noise filter
Year mismatches (`abs(cached_year − source_year) > 2`)	7
Runtime mismatches (`>30%` delta)	47
Weak matches (no poster, zero votes, low popularity)	4

TMDB matching produces wrong cache entries for ~3–10% of watch-history titles #51

Description

Summary

Real-world impact

Failure modes (concrete examples)

A. Generic / short titles map to the wrong popular work

B. Year-specific reboots/sequels resolve to wrong release

C. Bonus content (clips, trailers, featurettes, deleted scenes) match as movies

D. Non-English / non-Latin-script titles don't get language-biased search

E. Runtime mismatch flags are largely false positives

F. User overrides (data/overrides.json) are not validated

Root causes

Proposed architecture

Layer 1 — Ingestion filters (smallest, highest signal-to-noise)

Layer 2 — Persist all hints (event_store.py)

Layer 3 — Strengthen the matcher (tmdb_client.py)

Layer 4 — Override sanity check (recommender/overrides.py)

Layer 5 — Audit improvements

Acceptance criteria

Out of scope

Pointers to current code

What was tried in this session (for context, not a recommendation)

Estimated scope

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

F. User overrides (`data/overrides.json`) are not validated

Layer 2 — Persist all hints (`event_store.py`)

Layer 3 — Strengthen the matcher (`tmdb_client.py`)

Layer 4 — Override sanity check (`recommender/overrides.py`)