Skip to content
Merged
136 changes: 136 additions & 0 deletions docs/superpowers/plans/2026-05-08-filmweb-original-titles.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
# Filmweb Original Title Extraction Implementation Plan

> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.

**Goal:** Improve TMDB matching accuracy by extracting original titles and production years from Filmweb showtimes pages.

**Architecture:** Enrich the `FilmwebScraper` to return `original_title` and `year` along with the Polish `title`. Update the `main.py` loop to prioritize the `original_title` and use the `year` in TMDB searches.

**Tech Stack:** Python, BeautifulSoup4, httpx, pytest.

---

### Task 1: Research and Mock Setup

**Files:**
- Create: `scraper/tests/mock_showtimes.html` (for testing reference)

- [ ] **Step 1: Save a mock HTML snippet of a showtimes page**
We already have the snippet from our research. Let's create a small file to use in unit tests.

```python
# No code here, just a task to ensure we have the data for the next step.
```

- [ ] **Step 2: Commit**
```bash
git commit -m "test: prepare mock data for filmweb extraction" --allow-empty
```

### Task 2: Update FilmwebScraper Extraction Logic

**Files:**
- Modify: `scraper/filmweb_scraper.py`
- Test: `scraper/tests/test_filmweb_unit.py` (New file for unit testing)

- [ ] **Step 1: Write a unit test with mocked HTML**
Create `scraper/tests/test_filmweb_unit.py` and add a test that checks for `original_title` and `year`.

```python
import pytest
from bs4 import BeautifulSoup
from scraper.filmweb_scraper import FilmwebScraper

def test_extract_movie_metadata():
html = """
<div class="preview__header">
<h2 class="preview__title"><a class="preview__link" href="/film/Projekt+Hail+Mary-2026-10047841">Projekt Hail Mary</a></h2>
<div class="preview__headerDetails">
<div class="preview__alternateTitle">Project Hail Mary</div><wbr>
<div class="preview__year">2026</div>
</div>
</div>
"""
soup = BeautifulSoup(html, "html.parser")
# We will need to expose or test the internal parsing logic
# For now, let's assume we update get_warsaw_movies
# Actually, it's better to refactor a small helper for parsing if possible.
```

- [ ] **Step 2: Run test to verify it fails**
Run: `pytest scraper/tests/test_filmweb_unit.py`
Expected: FAIL (fields not present in output)

- [ ] **Step 3: Update `scraper/filmweb_scraper.py`**
Modify the loop in `get_warsaw_movies` to extract these fields.

```python
# Inside the try block where showtimes_soup is parsed:
alt_title_element = showtimes_soup.select_one(".preview__alternateTitle")
original_title = alt_title_element.text.strip() if alt_title_element else None

year_element = showtimes_soup.select_one(".preview__year")
year = int(year_element.text.strip()) if year_element and year_element.text.strip().isdigit() else None

movies.append({
"title": title,
"original_title": original_title,
"year": year,
"cinemas": cinemas
})
```

- [ ] **Step 4: Run test to verify it passes**
Run: `pytest scraper/tests/test_filmweb_unit.py`
Expected: PASS

- [ ] **Step 5: Commit**
```bash
git add scraper/filmweb_scraper.py scraper/tests/test_filmweb_unit.py
git commit -m "feat: extract original title and year from filmweb showtimes"
```

### Task 3: Update Main Matching Logic

**Files:**
- Modify: `scraper/main.py`

- [ ] **Step 1: Update the TMDB search call in `main.py`**
Use `original_title` if available and pass `year`.

```python
# Replace:
# tmdb_movie = tmdb.search_movie(title)
# With:
fw_title = fw_movie.get("title")
fw_original_title = fw_movie.get("original_title")
fw_year = fw_movie.get("year")

# Try original title first
tmdb_movie = tmdb.search_movie(fw_original_title or fw_title, year=fw_year)

# Fallback to Polish title if original title search failed and they are different
if not tmdb_movie and fw_original_title and fw_original_title != fw_title:
print(f"info: TMDB search for original title '{fw_original_title}' failed. Retrying with Polish title '{fw_title}'...")
tmdb_movie = tmdb.search_movie(fw_title, year=fw_year)
```

- [ ] **Step 2: Commit**
```bash
git add scraper/main.py
git commit -m "feat: use original title and year for TMDB matching"
```

### Task 4: End-to-End Verification

- [ ] **Step 1: Run a dry run for a known movie**
Manually check if "Projekt Hail Mary" is correctly matched using the new logic. Since it's a future release, it might not be in the current repertuar, but we can check other movies.

- [ ] **Step 2: Run all tests**
Run: `pytest scraper/tests/`
Expected: PASS

- [ ] **Step 3: Commit final changes**
```bash
git commit -m "test: verify filmweb original title matching logic" --allow-empty
```
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
# Spec: Filmweb Original Title Extraction for Improved TMDB Matching

**Date:** 2026-05-08
**Status:** Draft
**Topic:** Improving TMDB matching logic by using original titles and years extracted from Filmweb.

## 1. Purpose
The current scraper uses Polish titles from Filmweb to search for movies on TMDB. This often leads to mismatches or "no results" for international films whose Polish titles differ significantly from their English/original titles. Filmweb showtimes pages contain both the original title and the production year, which can be used to perform highly accurate TMDB lookups.

## 2. Success Criteria
- [ ] `FilmwebScraper` correctly extracts `original_title` and `year` from individual movie showtimes pages.
- [ ] `main.py` prioritizes `original_title` for TMDB searches.
- [ ] `main.py` passes the production `year` to TMDB search to filter results.
- [ ] Polish titles remain the primary fallback when an original title is not available.
- [ ] No additional network requests are introduced (metadata is pulled from existing page loads).

## 3. Architecture & Data Flow
### 3.1. Metadata Extraction (`filmweb_scraper.py`)
The `FilmwebScraper.get_warsaw_movies` method already visits each movie's showtimes page (e.g., `https://www.filmweb.pl/film/Projekt+Hail+Mary-2026-10047841/showtimes/Warszawa`).
We will update the parser to extract:
- `original_title`: From `.preview__alternateTitle` text.
- `year`: From `.preview__year` text (parsed as an integer).

### 3.2. Matching Logic (`main.py`)
In the main processing loop:
1. Receive `title`, `original_title`, and `year` from the scraper.
2. Call `tmdb.search_movie(title=original_title or title, year=year)`.
3. If no match is found with `original_title`, retry with `title` (if different).

## 4. Components
### 4.1. `FilmwebScraper` (Python)
- **Method**: `get_warsaw_movies`
- **Logic**: Use BeautifulSoup selectors to find the alternate title and year. Ensure robust handling of missing fields.

### 4.2. `TMDBScraper` (Python)
- **Method**: `search_movie`
- **Logic**: Ensure the `year` parameter is correctly passed to the TMDB API `/search/movie` endpoint (this is already implemented, but we will ensure it's utilized).

## 5. Error Handling
- **Parsing Failures**: If `preview__alternateTitle` or `preview__year` are missing, they should default to `None` without stopping the scraper.
- **TMDB Zero Matches**: If searching with `original_title` and `year` returns nothing, the system should log the failure and move to the next film, or optionally try a broader search without the year.

## 6. Testing Strategy
- **Mocked Responses**: Create a test case in `scraper/tests/test_filmweb.py` that uses a saved HTML snippet from a Filmweb showtimes page to verify extraction.
- **Matching Verification**: Use a manual test run (or integration test) to verify that "Projekt Hail Mary" is matched via its original title "Project Hail Mary" and year "2026".
Loading