Skip to content

test: Add integration test for research.googleblog.com parsing (issue #457)#720

Draft
Copilot wants to merge 2 commits intomasterfrom
copilot/fix-googleblog-parsing-issue
Draft

test: Add integration test for research.googleblog.com parsing (issue #457)#720
Copilot wants to merge 2 commits intomasterfrom
copilot/fix-googleblog-parsing-issue

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 1, 2026

Articles from research.googleblog.com were parsed incorrectly — returning only the site tagline "The latest news from Research at Google" instead of actual article content.

Changes

  • Added integration test test_issue_457_googleblog in tests/integration/test_article.py covering two representative article URLs from the Google Research Blog (domain now at blog.research.google.com)
  • Test asserts extracted text exceeds 200 chars and does not consist solely of the site tagline

Related Issues

Proposed Changes:

Adds a regression/integration test to document and validate correct parsing of Google Research Blog articles. No production code changes — this PR establishes the test baseline to expose the parsing failure.

How did you test it?

Integration test can be run locally with:

pytest tests/integration/test_article.py::TestArticle::test_issue_457_googleblog -v

Skipped automatically in GitHub Actions CI (per existing conftest.py pytest_runtest_setup hook).

Notes for the reviewer

The original domain research.googleblog.com has migrated to blog.research.google.com; test URLs reflect the current canonical domain. If the parsing bug is confirmed locally, a follow-up PR with a fix to the extraction logic will be needed.

Checklist

  • I have updated the related issue with new insights and changes
  • I added unit tests and updated the docstrings
  • I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test:.
  • I documented my code
  • I ran pre-commit hooks and fixed any issue

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • example.com
    • Triggering command: /home/REDACTED/.local/bin/pytest pytest tests/unit/ -q --tb=short (dns block)
  • media.cnn.com
    • Triggering command: /home/REDACTED/.local/bin/pytest pytest tests/unit/ -q --tb=short (dns block)
  • publicsuffix.org
    • Triggering command: /home/REDACTED/.local/bin/pytest pytest tests/unit/ -q --tb=short (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Copilot AI changed the title [WIP] Fix parsing issues for research.googleblog.com links test: Add integration test for research.googleblog.com parsing (issue #457) Apr 1, 2026
Copilot AI requested a review from AndyTheFactory April 1, 2026 21:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Build newspaper to get recent articles research.googleblog.com are not parsed correctly

2 participants