Symptom
In the iOS reader, French articles still display "Passer la publicité" as a standalone paragraph between body paragraphs. Screenshot from a recent reader session confirms it appears verbatim, mid-article, between two otherwise clean French sentences.
This is the French ad-skip prompt (literally "Skip the ad") — leftover from a video/ad placeholder on the source site that the cleaner should be stripping.
What we already know
The exact string is already in the blocklist at api/zeeguu/core/content_cleaning/content_cleaner.py:27:
JUNK_PATTERNS_TO_REMOVE = [
...
# French cookie/ad notices
\"Passer la publicité\",
\"La suite après cette publicité\",
...
]
So the question is why the existing entry isn't catching this instance.
Hypotheses worth checking
-
Article crawled before the entry was added. Cleaning is applied at crawl time only — old articles in the DB retain the artifact. Confirmable by checking when this entry landed (git blame on line 27) vs the article's published_time / crawl date.
-
Wiring gap: JUNK_PATTERNS_TO_REMOVE may not flow into sent_filter_set. filter_noise_patterns (line 96) matches against sent_filter_set passed in by the caller, not JUNK_PATTERNS_TO_REMOVE directly. Worth verifying the caller actually unions both lists into the set.
-
Sentence-tokenization splits the phrase. Matching uses sent_tokenize + exact match on normalize_sent(sent) (.lower().strip()). If the phrase appears with attached punctuation or wrapped inside a longer line on this particular site's HTML, the tokenized sentence won't equal the blocklist entry, even after normalization.
-
Second cleaning path bypasses the list. The JS readability-server does its own cleanup via SpecificCleanup/. If that path runs at render time instead of crawl time, it has no awareness of this Python list.
Suggested diagnosis order
- Check
git log -- zeeguu/core/content_cleaning/content_cleaner.py and compare to article publish/crawl dates for affected URL.
- Trace the caller of
filter_noise_patterns to confirm JUNK_PATTERNS_TO_REMOVE is actually included in sent_filter_set.
- Add a debug print of
sent values being compared on a re-clean of an affected article — see whether the candidate sentence is exactly "passer la publicité" or has trailing characters.
- If wiring + tokenization are fine, the article is likely pre-list — bulk re-clean older articles.
Out of scope (separate issues)
- Whether other French ad-prompt variants are still missing from the list (e.g. "Passer cette publicité", different casing/punctuation)
- Whether to make matching less brittle (substring instead of exact-match) — separate design discussion
Symptom
In the iOS reader, French articles still display "Passer la publicité" as a standalone paragraph between body paragraphs. Screenshot from a recent reader session confirms it appears verbatim, mid-article, between two otherwise clean French sentences.
This is the French ad-skip prompt (literally "Skip the ad") — leftover from a video/ad placeholder on the source site that the cleaner should be stripping.
What we already know
The exact string is already in the blocklist at
api/zeeguu/core/content_cleaning/content_cleaner.py:27:So the question is why the existing entry isn't catching this instance.
Hypotheses worth checking
Article crawled before the entry was added. Cleaning is applied at crawl time only — old articles in the DB retain the artifact. Confirmable by checking when this entry landed (git blame on line 27) vs the article's
published_time/ crawl date.Wiring gap:
JUNK_PATTERNS_TO_REMOVEmay not flow intosent_filter_set.filter_noise_patterns(line 96) matches againstsent_filter_setpassed in by the caller, notJUNK_PATTERNS_TO_REMOVEdirectly. Worth verifying the caller actually unions both lists into the set.Sentence-tokenization splits the phrase. Matching uses
sent_tokenize+ exact match onnormalize_sent(sent)(.lower().strip()). If the phrase appears with attached punctuation or wrapped inside a longer line on this particular site's HTML, the tokenized sentence won't equal the blocklist entry, even after normalization.Second cleaning path bypasses the list. The JS
readability-serverdoes its own cleanup viaSpecificCleanup/. If that path runs at render time instead of crawl time, it has no awareness of this Python list.Suggested diagnosis order
git log -- zeeguu/core/content_cleaning/content_cleaner.pyand compare to article publish/crawl dates for affected URL.filter_noise_patternsto confirmJUNK_PATTERNS_TO_REMOVEis actually included insent_filter_set.sentvalues being compared on a re-clean of an affected article — see whether the candidate sentence is exactly "passer la publicité" or has trailing characters.Out of scope (separate issues)