Skip to content

Reader: 'Passer la publicité' still appearing despite being in JUNK_PATTERNS_TO_REMOVE #586

@mircealungu

Description

@mircealungu

Symptom

In the iOS reader, French articles still display "Passer la publicité" as a standalone paragraph between body paragraphs. Screenshot from a recent reader session confirms it appears verbatim, mid-article, between two otherwise clean French sentences.

This is the French ad-skip prompt (literally "Skip the ad") — leftover from a video/ad placeholder on the source site that the cleaner should be stripping.

What we already know

The exact string is already in the blocklist at api/zeeguu/core/content_cleaning/content_cleaner.py:27:

JUNK_PATTERNS_TO_REMOVE = [
    ...
    # French cookie/ad notices
    \"Passer la publicité\",
    \"La suite après cette publicité\",
    ...
]

So the question is why the existing entry isn't catching this instance.

Hypotheses worth checking

  1. Article crawled before the entry was added. Cleaning is applied at crawl time only — old articles in the DB retain the artifact. Confirmable by checking when this entry landed (git blame on line 27) vs the article's published_time / crawl date.

  2. Wiring gap: JUNK_PATTERNS_TO_REMOVE may not flow into sent_filter_set. filter_noise_patterns (line 96) matches against sent_filter_set passed in by the caller, not JUNK_PATTERNS_TO_REMOVE directly. Worth verifying the caller actually unions both lists into the set.

  3. Sentence-tokenization splits the phrase. Matching uses sent_tokenize + exact match on normalize_sent(sent) (.lower().strip()). If the phrase appears with attached punctuation or wrapped inside a longer line on this particular site's HTML, the tokenized sentence won't equal the blocklist entry, even after normalization.

  4. Second cleaning path bypasses the list. The JS readability-server does its own cleanup via SpecificCleanup/. If that path runs at render time instead of crawl time, it has no awareness of this Python list.

Suggested diagnosis order

  1. Check git log -- zeeguu/core/content_cleaning/content_cleaner.py and compare to article publish/crawl dates for affected URL.
  2. Trace the caller of filter_noise_patterns to confirm JUNK_PATTERNS_TO_REMOVE is actually included in sent_filter_set.
  3. Add a debug print of sent values being compared on a re-clean of an affected article — see whether the candidate sentence is exactly "passer la publicité" or has trailing characters.
  4. If wiring + tokenization are fine, the article is likely pre-list — bulk re-clean older articles.

Out of scope (separate issues)

  • Whether other French ad-prompt variants are still missing from the list (e.g. "Passer cette publicité", different casing/punctuation)
  • Whether to make matching less brittle (substring instead of exact-match) — separate design discussion

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions