Skip to content

fix: Deduplicate Source.articles by normalized URL and content fingerprint#718

Draft
Copilot wants to merge 3 commits intodev-0.9.6from
copilot/fix-duplicate-articles-issue
Draft

fix: Deduplicate Source.articles by normalized URL and content fingerprint#718
Copilot wants to merge 3 commits intodev-0.9.6from
copilot/fix-duplicate-articles-issue

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Mar 21, 2026

Same article discovered under variant URLs (http://theatlantic.com/… vs https://www.theatlantic.com/…) bypassed the existing exact-URL dedup and appeared multiple times in Source.articles.

Related Issues

Proposed Changes:

Root cause: _generate_articles() keyed dedup on article.url verbatim, so scheme (http/https) and www. subdomain differences produced distinct keys for identical articles.

newspaper/source.py

  • _normalize_url_for_dedup(url) — new static method; strips scheme and www. prefix, normalises trailing slash. Result is used only as a dict key, never for fetching.
  • _generate_articles() — adds a second dedup pass using the normalised key after existing exact-URL dedup. Skipped URLs are logged at DEBUG.
  • _get_article_fingerprint(article) — new static method; computes sha256(title + text) as a hex digest. Before hashing, the content is normalised: tabs and non-breaking spaces are replaced with regular spaces, multiple spaces are collapsed, punctuation is removed, and the result is lowercased. This ensures minor formatting differences do not prevent duplicate detection.
  • parse_articles() — adds a fingerprint-based dedup pass after parsing, catching duplicates that survive URL normalisation (e.g. redirect chains landing on the same content). Skipped URLs are logged at DEBUG.
# Both URLs now produce a single article in Source.articles
# https://www.theatlantic.com/politics/…/561317/  ← kept
# http://theatlantic.com/politics/…/561317/       ← deduplicated out

tests/unit/test_source.py

  • Updated test_source_parse_articles: mocked parse now sets unique title/text per article (empty content was previously allowed to collide on the fingerprint).
  • Added test_normalize_url_for_dedup, test_generate_articles_deduplicates_www_vs_no_www, test_generate_articles_deduplicates_http_vs_https, test_get_article_fingerprint, test_get_article_fingerprint_normalizes_whitespace_and_case, test_parse_articles_deduplicates_by_content.

How did you test it?

All 15 unit tests in tests/unit/test_source.py pass, including 6 new targeted tests covering URL-normalization dedup, content-fingerprint dedup, and fingerprint normalization (tabs, non-breaking spaces, multiple spaces, punctuation, and case differences).

Notes for the reviewer

  • Fingerprint dedup only runs in parse_articles() (i.e. after download + parse), so it does not affect users who only call generate_articles().
  • The _normalize_url_for_dedup key is intentionally scheme-free — it is never passed to any HTTP client.
  • Before hashing, the fingerprint normalises whitespace (tabs, non-breaking spaces, multiple spaces), strips punctuation, and lowercases the content, so minor formatting differences between two copies of the same article do not prevent deduplication.
  • Articles with genuinely empty title and text both hash to the same fingerprint and will be deduplicated; in practice is_valid_body() should already filter those out before the fingerprint pass.

Checklist

  • I have updated the related issue with new insights and changes
  • I added unit tests and updated the docstrings
  • I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test:.
  • I documented my code
  • I ran pre-commit hooks and fixed any issue
Original prompt

This section details on the original issue you should resolve

<issue_title>Iterating articles on news source produces duplicates, if subdomain omitted.</issue_title>
<issue_description> Issue by awiebe
Sun Jun 17 02:00:15 2018
Originally opened as codelucas/newspaper#580


I was testing news sources, and found that this article was emitted twice, despite the fact that newspaper should be memoizing.
The problem seems to be that memoization uses the straight url and doesn't consider that the second source is missing the www subdomain.

https://www.theatlantic.com/politics/archive/2018/05/stephen-miller-trump-adviser/561317/
Trump’s Right-Hand Troll
['Mckay Coppins']
http://theatlantic.com/politics/archive/2018/05/stephen-miller-trump-adviser/561317/
Trump’s Right-Hand Troll
['Mckay Coppins']
import newspaper

def dump_article(a):
    try:
        a.download()
        a.parse()
        print(a.title)
        print(a.authors)
        # print (a.text)
        return True
    except :
        return False
    


MAX_PULL=10

for source in newspaper.popular_urls():
    print(source)
    pull=0
    s=newspaper.build(source,lang='en')
    for a in s.articles:
        print(a.url)
        if dump_article(a):
            pull+=1
        if pull>= MAX_PULL:
            break

</issue_description>

<agent_instructions>implement a fingerprinting mechanism using hashing and the title + extracted text. log deduplicated urls. Implement this on the Source class level, such that duplicates are reduced in Source.articles</agent_instructions>

Comments on the Issue (you are @copilot in this section)

@AndyTheFactory **Comment by [minhdanh](https://github.com/minhdanh)** _Sat Mar 6 14:49:49 2021_

Having same problem here in 2021. @awiebe Have you by any chance had a solution?


⚡ Quickly spin up Copilot coding agent tasks from anywhere on your macOS or Windows machine with Raycast.

Copilot AI changed the title [WIP] Fix duplicate articles on news source due to subdomain omission fix: Deduplicate Source.articles by normalized URL and content fingerprint Mar 21, 2026
Copilot AI requested a review from AndyTheFactory March 21, 2026 11:45
Comment thread newspaper/source.py
str: A hex-encoded SHA-256 digest of the concatenated title and
text.
"""
content = (article.title or "") + (article.text or "")
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot replace tabs and non-breakable space with space, remove multiple spaces and any punctuation. convert all to lowercase. before computing hash

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in commit 224dcc5. Before computing the hash, the combined title+text is now normalized: tabs and non-breaking spaces (\xa0) are replaced with regular spaces, multiple consecutive spaces are collapsed, punctuation is stripped, and the result is lower-cased.

@AndyTheFactory AndyTheFactory changed the base branch from master to dev-0.9.6 April 11, 2026 17:11
Copy link
Copy Markdown
Contributor Author

Copilot AI commented Apr 11, 2026

Just as a heads up, I was blocked by some firewall rules while working on your feedback. Expand below for details.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • publicsuffix.org
    • Triggering command: /usr/bin/python3 python3 -m pytest tests/unit/test_source.py -v --no-header (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Iterating articles on news source produces duplicates, if subdomain omitted.

2 participants