Skip to content

TMDB matching produces wrong cache entries for ~3–10% of watch-history titles #51

@abhichandra21

Description

@abhichandra21

Summary

The TMDB matching pipeline (recommender/tmdb_client.pyrecommender/watch_index.pyrecommender/cache/tmdb/*.json) returns wrong matches for a small but persistent fraction of ingested titles. The wrong matches then poison everything downstream: enrichments are generated against the wrong work, the taste profile inherits noise, and recommendations for the affected titles are off.

The audit infrastructure (recommender/setup.py::_audit_cache_mismatches) surfaces these problems at the end of --refresh-data, but it only reports — there is no feedback loop into the matcher, no automated correction, and the noise filter is too strict so the signal gets lost.

This issue tracks the design and implementation of a real fix. Do not patch this piecemeal; the symptoms below are connected.

Real-world impact

Counts from a recent run (1,993 indexed titles):

Category Count
Unmatched titles (no TMDB ID returned) 10
Title mismatches (cache title differs from source) 105–299 depending on audit-noise filter
Year mismatches (abs(cached_year − source_year) > 2) 7
Runtime mismatches (>30% delta) 47
Weak matches (no poster, zero votes, low popularity) 4

Of these, after manual triage, roughly 50–60 are real wrong matches; the rest are cosmetic noise (punctuation/diacritic differences, runtime checks that don't apply to platforms without duration data).

Failure modes (concrete examples)

A. Generic / short titles map to the wrong popular work

No disambiguation hints reach the matcher, so TMDB search ranking picks the most popular candidate regardless of whether it's the right one.

"Don"            -> America's Sweethearts   (should be 2006 Hindi film)
"Tiger"          -> wrong wildlife doc      (should be 2023 Disney+ doc)
"Goodbye"        -> The Exorcism            (should be 2022 Hindi film)
"The After"      -> Taylor Swift: Eras Tour (should be 2023 Netflix short)
"10 Years"       -> From Up on Poppy Hill   (should be 2011 reunion film)

B. Year-specific reboots/sequels resolve to wrong release

When release_year_hint is available it would disambiguate. But:

  • release_year_hint is not persisted by event_store.py — round-tripping through SQLite drops it. Year-based disambiguation works during fresh-parse --refresh-data but is invisible to anything reading from events.db.
  • Disney rows have no year hint at all (no duration → no derived year).
"Wonder Woman"      -> 2017 (should be 1984/2020)
"Jurassic World"    -> 2015 (should be Fallen Kingdom 2018)
"The Darkest Hour"  -> Ghost Rider 2011 (should be Darkest Hour 2017)
"Disobedience"      -> 1993 film (should be 2017 Rachel Weisz film)

C. Bonus content (clips, trailers, featurettes, deleted scenes) match as movies

These entries should never reach TMDB at all — they're not real watch events. Disney's parser drops trailers; the others don't filter anything. Apple TV's _SKIP_SUBTYPES covers FeaturedPromo, Preview, Promotional, Bonus but not Disney-style content names.

"Ferdinand Clip"
"The Santa Clause Clip"
"Aladdin's Video Journal: A New Fantastic Point of View"
"Song Breakdowns: 'Under the Sea'"
"Stunts | More from Pandora's Box | Avatar: The Way of Water"
"Deleted Song: 'Desert Moon'"
"Descendants: The Rise of Red Sing- Along"   (matched to an adult film)
"Raya and the Last Dragon רליירט"            (Hebrew 'trailer' — non-Latin)

D. Non-English / non-Latin-script titles don't get language-biased search

TMDB's default search is English-popularity-weighted. Hindi/Bollywood titles like Don, Goodbye, Begum Jaan, Dhadak, Gulabo Sitabo, Phas Gaye Re Obama, Laapataa Ladies get out-ranked by English titles with the same word.

tmdb_client.py has no with_original_language hint passed in any path. Worth verifying.

E. Runtime mismatch flags are largely false positives

Disney events synthesize a 45-min runtime from MANUAL_TV_DURATION_MINUTES; comparing that against a 22-min Bluey episode triggers the >30% rule. Same for manual entries. The runtime check is informative for Netflix/Prime where duration is real but misleading everywhere else.

F. User overrides (data/overrides.json) are not validated

Nothing checks that {"tmdb_id": X} actually resolves to the intended work. In this session an LLM-assisted batch override produced 30 entries that pinned titles to the same wrong IDs the audit was reporting (the LLM parroted the cached as X ID), permanently breaking those entries until manually corrected. The system silently accepted them.

"Don":          {"tmdb_id": 11467}    // = America's Sweethearts
"Foundation":   {"tmdb_id": 84958}    // = Loki
"Apollo 11":    {"tmdb_id": 553016}   // = Le Pont des Broignes
"The Dry":      {"tmdb_id": 556678}   // = Emma.
... 26 more

Five additional overrides point to TMDB IDs that don't exist at all (Kuch Luv Jaisaa → 65593, Princess Diana → 430857, Sudha Murthy → tv/198603, Road Rage → tv/135718, Tiger → 1122822). Those silently fail and the title stays unmatched.

Root causes

  1. Hint pipeline is incomplete. MatchHints exists and the matcher uses it, but release_year_hint doesn't persist in SQLite and language is never inferred or passed.
  2. No post-match validation. The matcher returns the first/best TMDB candidate without verifying that the cached title is plausibly the source title.
  3. Ingestion is too permissive. Bonus content (clips, trailers, featurettes, "Sing-Along", "Inside ...", "Behind the Scenes", "Deleted Scene", non-Latin trailer words) reaches TMDB lookup when it shouldn't.
  4. Overrides are blindly trusted. A {"tmdb_id": X} override is accepted at face value.
  5. Audit is reporting-only. No feedback loop into the matcher; noise filter is too strict to surface real bugs cleanly.

Proposed architecture

A multi-layered fix. Land each layer as its own PR so the impact of each is isolated and reviewable.

Layer 1 — Ingestion filters (smallest, highest signal-to-noise)

  • Extend each parser (netflix.py, prime.py, apple_tv.py, disney.py) with a shared bonus-content regex covering: \bclip\b, \btrailer\b, \bfeaturette\b, \b(behind the scenes|bts)\b, \bsing-?along\b, \binside\b (when paired with |), \bdeleted\s+(scene|song)\b, \bvideo journal\b, \bsong breakdown(s)?\b, \bpromo(tional)?\b. Also drop entries where the program contains pipe-separated featurette markers (X | Y | Z).
  • Add a Unicode-aware trailer word match (Hebrew טריילר, Hindi ट्रेलर, etc.) or simply skip rows whose title is mostly non-Latin script and matches a known short blocklist.
  • Test: feed fixtures of known bonus rows; assert they are dropped.

Layer 2 — Persist all hints (event_store.py)

  • Add release_year_hint INTEGER, language_hint TEXT columns to the events table. Migrate existing DB by re-deriving from raw events on next refresh.
  • Detect language at parse time: Devanagari/Hebrew/CJK/Arabic script in the source title → set language_hint. Otherwise leave null.
  • Update _build_hints_map to use both fields.

Layer 3 — Strengthen the matcher (tmdb_client.py)

  • When language_hint is set, pass with_original_language to TMDB search.
  • When release_year_hint is set, re-rank candidates penalising those further from the year (already partially present — verify it actually fires).
  • Add a post-search validator: compute _titles_are_compatible(source_title, candidate.title) and _titles_are_compatible(source_title, candidate.original_title) for each top-N candidate; if the chosen winner fails BOTH checks AND alternatives pass, pick the alternative.

Layer 4 — Override sanity check (recommender/overrides.py)

  • At load time, for every {"tmdb_id": X} entry: load cache/tmdb/<ct>/<X>.json, compute normalized-title similarity to the source key. If the cache title is wildly unrelated (no shared content tokens, no substring overlap, no language match), log a loud warning and ignore the override (fall back to fresh search). Permanently fixes the LLM-poisoning failure mode.
  • Optionally: emit a overrides_warnings.txt next to the audit listing rejected overrides so the user can clean up data/overrides.json.

Layer 5 — Audit improvements

  • Suppress runtime-mismatch flags for events whose total_duration is synthesized (manual, disney). Only check runtime for apple_tv/netflix/prime where it's real.
  • Permissive title-compatibility: strip diacritics, drop articles, normalize "&"→"and", check both directions of substring.
  • A "real bugs only" view that shows only entries failing multiple checks (e.g., title AND year both off → high confidence real bug).

Acceptance criteria

  • A fresh setup --refresh-data run against the real data set produces an audit with ≤ 20 entries flagged as real bugs (currently ~50–60).
  • No clips/trailers/featurettes reach the watch index.
  • Hindi/Bollywood titles in the source export correctly resolve to their Indian-cinema counterparts on TMDB.
  • An override pinning "Don" to tmdb_id: 11467 is detected as bogus at load time, logged loudly, and ignored (fresh search runs instead).
  • Existing tests pass; new tests cover the ingestion filters, hint persistence, language-biased search, and override sanity check.
  • The audit's truncated-to-10 console output now reflects only real bugs, not cosmetic noise.

Out of scope

Pointers to current code

  • Matcher entry point: recommender/tmdb_client.py::TmdbClient.get_metadata
  • Hint construction: recommender/setup.py::_build_hints_map
  • Audit: recommender/setup.py::_audit_cache_mismatches
  • Override loader: recommender/overrides.py::load
  • Event store schema: recommender/event_store.py
  • Existing tests: tests/test_tmdb_client.py, tests/test_watch_index.py, tests/test_main.py (audit tests)

What was tried in this session (for context, not a recommendation)

  1. LLM-generated overrides via the dumped audit file. Failed: the LLM parroted cached as X IDs back as overrides, producing 30 broken entries. Documented in this issue and root-caused under failure mode F.
  2. Improving the audit's _titles_are_compatible function (diacritic strip, article drop, bidirectional substring). Cut audit noise by ~65% but doesn't fix any actual matches — purely a reporting improvement. Reverted in this session per request so the matching fix can be designed cleanly.

Estimated scope

5 PRs (one per layer), 3–5 days total focused work. Layer 1 alone (ingestion filters) is a half-day and would resolve a meaningful fraction of the unmatched/weak-match noise. Layer 4 (override sanity check) is also a half-day and permanently closes the LLM-poisoning failure mode.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions