Reader: 'Passer la publicité' still appearing despite being in JUNK_PATTERNS_TO_REMOVE

## Symptom

In the iOS reader, French articles still display **\"Passer la publicité\"** as a standalone paragraph between body paragraphs. Screenshot from a recent reader session confirms it appears verbatim, mid-article, between two otherwise clean French sentences.

This is the French ad-skip prompt (literally \"Skip the ad\") — leftover from a video/ad placeholder on the source site that the cleaner should be stripping.

## What we already know

The exact string **is already in the blocklist** at [`api/zeeguu/core/content_cleaning/content_cleaner.py:27`](https://github.com/zeeguu/api/blob/master/zeeguu/core/content_cleaning/content_cleaner.py#L27):

```python
JUNK_PATTERNS_TO_REMOVE = [
    ...
    # French cookie/ad notices
    \"Passer la publicité\",
    \"La suite après cette publicité\",
    ...
]
```

So the question is **why the existing entry isn't catching this instance**.

## Hypotheses worth checking

1. **Article crawled before the entry was added.** Cleaning is applied at crawl time only — old articles in the DB retain the artifact. Confirmable by checking when this entry landed (git blame on line 27) vs the article's `published_time` / crawl date.

2. **Wiring gap: `JUNK_PATTERNS_TO_REMOVE` may not flow into `sent_filter_set`.** `filter_noise_patterns` (line 96) matches against `sent_filter_set` passed in by the caller, not `JUNK_PATTERNS_TO_REMOVE` directly. Worth verifying the caller actually unions both lists into the set.

3. **Sentence-tokenization splits the phrase.** Matching uses `sent_tokenize` + exact match on `normalize_sent(sent)` (`.lower().strip()`). If the phrase appears with attached punctuation or wrapped inside a longer line on this particular site's HTML, the tokenized sentence won't equal the blocklist entry, even after normalization.

4. **Second cleaning path bypasses the list.** The JS [`readability-server`](https://github.com/zeeguu/readability-server) does its own cleanup via `SpecificCleanup/`. If that path runs at render time instead of crawl time, it has no awareness of this Python list.

## Suggested diagnosis order

1. Check `git log -- zeeguu/core/content_cleaning/content_cleaner.py` and compare to article publish/crawl dates for affected URL.
2. Trace the caller of `filter_noise_patterns` to confirm `JUNK_PATTERNS_TO_REMOVE` is actually included in `sent_filter_set`.
3. Add a debug print of `sent` values being compared on a re-clean of an affected article — see whether the candidate sentence is exactly \"passer la publicité\" or has trailing characters.
4. If wiring + tokenization are fine, the article is likely pre-list — bulk re-clean older articles.

## Out of scope (separate issues)

- Whether other French ad-prompt variants are still missing from the list (e.g. \"Passer cette publicité\", different casing/punctuation)
- Whether to make matching less brittle (substring instead of exact-match) — separate design discussion

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reader: 'Passer la publicité' still appearing despite being in JUNK_PATTERNS_TO_REMOVE #586

Symptom

What we already know

Hypotheses worth checking

Suggested diagnosis order

Out of scope (separate issues)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Reader: 'Passer la publicité' still appearing despite being in JUNK_PATTERNS_TO_REMOVE #586

Description

Symptom

What we already know

Hypotheses worth checking

Suggested diagnosis order

Out of scope (separate issues)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions