fix: Deduplicate Source.articles by normalized URL and content fingerprint by Copilot · Pull Request #718 · AndyTheFactory/newspaper4k

Copilot · 2026-03-21T11:36:49Z

Same article discovered under variant URLs (http://theatlantic.com/… vs https://www.theatlantic.com/…) bypassed the existing exact-URL dedup and appeared multiple times in Source.articles.

Related Issues

Proposed Changes:

Root cause: _generate_articles() keyed dedup on article.url verbatim, so scheme (http/https) and www. subdomain differences produced distinct keys for identical articles.

newspaper/source.py

_normalize_url_for_dedup(url) — new static method; strips scheme and www. prefix, normalises trailing slash. Result is used only as a dict key, never for fetching.
_generate_articles() — adds a second dedup pass using the normalised key after existing exact-URL dedup. Skipped URLs are logged at DEBUG.
_get_article_fingerprint(article) — new static method; computes sha256(title + text) as a hex digest. Before hashing, the content is normalised: tabs and non-breaking spaces are replaced with regular spaces, multiple spaces are collapsed, punctuation is removed, and the result is lowercased. This ensures minor formatting differences do not prevent duplicate detection.
parse_articles() — adds a fingerprint-based dedup pass after parsing, catching duplicates that survive URL normalisation (e.g. redirect chains landing on the same content). Skipped URLs are logged at DEBUG.

# Both URLs now produce a single article in Source.articles
# https://www.theatlantic.com/politics/…/561317/  ← kept
# http://theatlantic.com/politics/…/561317/       ← deduplicated out

tests/unit/test_source.py

Updated test_source_parse_articles: mocked parse now sets unique title/text per article (empty content was previously allowed to collide on the fingerprint).
Added test_normalize_url_for_dedup, test_generate_articles_deduplicates_www_vs_no_www, test_generate_articles_deduplicates_http_vs_https, test_get_article_fingerprint, test_get_article_fingerprint_normalizes_whitespace_and_case, test_parse_articles_deduplicates_by_content.

How did you test it?

All 15 unit tests in tests/unit/test_source.py pass, including 6 new targeted tests covering URL-normalization dedup, content-fingerprint dedup, and fingerprint normalization (tabs, non-breaking spaces, multiple spaces, punctuation, and case differences).

Notes for the reviewer

Fingerprint dedup only runs in parse_articles() (i.e. after download + parse), so it does not affect users who only call generate_articles().
The _normalize_url_for_dedup key is intentionally scheme-free — it is never passed to any HTTP client.
Before hashing, the fingerprint normalises whitespace (tabs, non-breaking spaces, multiple spaces), strips punctuation, and lowercases the content, so minor formatting differences between two copies of the same article do not prevent deduplication.
Articles with genuinely empty title and text both hash to the same fingerprint and will be deduplicated; in practice is_valid_body() should already filter those out before the fingerprint pass.

Checklist

I have updated the related issue with new insights and changes
I added unit tests and updated the docstrings
I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test:.
I documented my code
I ran pre-commit hooks and fixed any issue

Original prompt

This section details on the original issue you should resolve

<issue_title>Iterating articles on news source produces duplicates, if subdomain omitted.</issue_title>
<issue_description> Issue by awiebe
Sun Jun 17 02:00:15 2018
Originally opened as codelucas/newspaper#580

I was testing news sources, and found that this article was emitted twice, despite the fact that newspaper should be memoizing.
The problem seems to be that memoization uses the straight url and doesn't consider that the second source is missing the www subdomain.
https://www.theatlantic.com/politics/archive/2018/05/stephen-miller-trump-adviser/561317/
Trump’s Right-Hand Troll
['Mckay Coppins']
http://theatlantic.com/politics/archive/2018/05/stephen-miller-trump-adviser/561317/
Trump’s Right-Hand Troll
['Mckay Coppins']
import newspaper

def dump_article(a):
    try:
        a.download()
        a.parse()
        print(a.title)
        print(a.authors)
        # print (a.text)
        return True
    except :
        return False
    


MAX_PULL=10

for source in newspaper.popular_urls():
    print(source)
    pull=0
    s=newspaper.build(source,lang='en')
    for a in s.articles:
        print(a.url)
        if dump_article(a):
            pull+=1
        if pull>= MAX_PULL:
            break
</issue_description>

<agent_instructions>implement a fingerprinting mechanism using hashing and the title + extracted text. log deduplicated urls. Implement this on the Source class level, such that duplicates are reduced in Source.articles</agent_instructions>

Comments on the Issue (you are @copilot in this section)
@AndyTheFactory **Comment by [minhdanh](https://github.com/minhdanh)** _Sat Mar 6 14:49:49 2021_

Having same problem here in 2021. @awiebe Have you by any chance had a solution?

Fixes Iterating articles on news source produces duplicates, if subdomain omitted. #217

⚡ Quickly spin up Copilot coding agent tasks from anywhere on your macOS or Windows machine with Raycast.

Co-authored-by: AndyTheFactory <863810+AndyTheFactory@users.noreply.github.com> Agent-Logs-Url: https://github.com/AndyTheFactory/newspaper4k/sessions/0e11cc93-3086-46a3-8a35-84b5043bc0ae

AndyTheFactory · 2026-04-11T17:08:57Z

+            str: A hex-encoded SHA-256 digest of the concatenated title and
+            text.
+        """
+        content = (article.title or "") + (article.text or "")


@copilot replace tabs and non-breakable space with space, remove multiple spaces and any punctuation. convert all to lowercase. before computing hash

Done in commit 224dcc5. Before computing the hash, the combined title+text is now normalized: tabs and non-breaking spaces (\xa0) are replaced with regular spaces, multiple consecutive spaces are collapsed, punctuation is stripped, and the result is lower-cased.

Agent-Logs-Url: https://github.com/AndyTheFactory/newspaper4k/sessions/80f97daa-07cd-4e9c-93c2-cf2e1e80e936 Co-authored-by: AndyTheFactory <863810+AndyTheFactory@users.noreply.github.com>

Copilot · 2026-04-11T17:13:40Z

Just as a heads up, I was blocked by some firewall rules while working on your feedback. Expand below for details.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

publicsuffix.org
- Triggering command: /usr/bin/python3 python3 -m pytest tests/unit/test_source.py -v --no-header (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Configure Actions setup steps to set up my environment, which run before the firewall is enabled
Add the appropriate URLs or hosts to the custom allowlist in this repository's Copilot coding agent settings (admins only)

Initial plan

a79194f

Copilot AI assigned Copilot and AndyTheFactory Mar 21, 2026

Copilot started work on behalf of AndyTheFactory March 21, 2026 11:36 View session

Fix duplicate articles via URL normalization and content fingerprinting

9c38f43

Co-authored-by: AndyTheFactory <863810+AndyTheFactory@users.noreply.github.com> Agent-Logs-Url: https://github.com/AndyTheFactory/newspaper4k/sessions/0e11cc93-3086-46a3-8a35-84b5043bc0ae

Copilot AI changed the title ~~[WIP] Fix duplicate articles on news source due to subdomain omission~~ fix: Deduplicate Source.articles by normalized URL and content fingerprint Mar 21, 2026

Copilot AI requested a review from AndyTheFactory March 21, 2026 11:45

Copilot finished work on behalf of AndyTheFactory March 21, 2026 11:45

AndyTheFactory requested changes Apr 11, 2026

View reviewed changes

Copilot started work on behalf of AndyTheFactory April 11, 2026 17:11 View session

AndyTheFactory changed the base branch from master to dev-0.9.6 April 11, 2026 17:11

Normalize content before hashing in _get_article_fingerprint

224dcc5

Agent-Logs-Url: https://github.com/AndyTheFactory/newspaper4k/sessions/80f97daa-07cd-4e9c-93c2-cf2e1e80e936 Co-authored-by: AndyTheFactory <863810+AndyTheFactory@users.noreply.github.com>

Copilot AI requested a review from AndyTheFactory April 11, 2026 17:13

Copilot finished work on behalf of AndyTheFactory April 11, 2026 17:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Deduplicate Source.articles by normalized URL and content fingerprint#718

fix: Deduplicate Source.articles by normalized URL and content fingerprint#718
Copilot wants to merge 3 commits intodev-0.9.6from
copilot/fix-duplicate-articles-issue

Copilot AI commented Mar 21, 2026 •

edited

Loading

Uh oh!

AndyTheFactory Apr 11, 2026

Uh oh!

Copilot AI Apr 11, 2026

Uh oh!

Copilot AI commented Apr 11, 2026

I tried to connect to the following addresses, but was blocked by firewall rules:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Copilot AI commented Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Related Issues

Proposed Changes:

How did you test it?

Notes for the reviewer

Checklist

Comments on the Issue (you are @copilot in this section)

Uh oh!

AndyTheFactory Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI commented Apr 11, 2026

I tried to connect to the following addresses, but was blocked by firewall rules:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Mar 21, 2026 •

edited

Loading