Skip to content

Releases: AndyTheFactory/newspaper4k

Version 0.9.5 - improvements Google News and honor robots.txt

28 Feb 14:33
9fdb4ce

Choose a tag to compare

  • lang: Add ISO 639-3 language code support for Kurdish (ckb, kmr)
  • tests: add robots tests
  • feat: added robots.txt check with hook in do_request
  • feat: add hooks to get_html
  • parse: prioritize datePublished over dateCreated in JSON-LD extraction
  • docs: Readme improvements
  • feat: add nltk as an optional dependency for leaner deployments
  • docs: added additional documentation for GoogleNews and Cloudscraper integration
  • rework: type annotations removed deprecated types (python 3.10+)

What's Changed

  • lang: Add ISO 639-3 language code support for Kurdish (ckb, kmr) by Muzaffer Cikay in #691
  • tests: add robots tests by Andrei in 030e50d
  • feat: added robots.txt check with hook in do_request by Andrei in 62dece9
  • feat: add hooks to get_html by Andrei in 708cc10
  • parse: prioritize datePublished over dateCreated in JSON-LD extraction by Pontus Svensson in cdadb9e
  • docs: Readme improvements by Andrei in 18ca21c
  • feat: add nltk as an optional dependency for leaner deployments by Andrei in e073459
  • docs: added additional documentation for GoogleNews and Cloudscraper integration by Andrei in aceb853
  • rework: type annotations removed deprecated types (python 3.10+) by Andrei in bd82a41

Bugs Fixed

  • skip null entries in JSON-LD arrays during extraction (fix #692) by ghxm in 77d6ccc
  • ArticleException f-string not interpolating status_code (fix #684) by Andrei in 7caa2a5
  • accept relative paths for categories (PR #667) by BRNMan in #667
  • use w3lib to detect webpage encoding by Andrei in 3bd4f00

New Contributors

  • Muzaffer Cikay made their first contribution in #691

Full Changelog: 0.9.4...0.9.5

Version 0.9.4.1 - Python 3.14 support

18 Nov 06:08

Choose a tag to compare

  • feat: add support for python 3.14
  • rework: minor typing changes
  • tests: increase test coverage

What's Changed

New Contributors

Full Changelog: 0.9.4...0.9.4.1

Version 0.9.4 - Dropping python 3.8, 3.9, Switching to `uv`, improving tests

15 Nov 21:52
701f5da

Choose a tag to compare

New Features

Bumped min Python version to 3.10. Version 3.8 and 3.9 are no longer supported, but might still work.

  • misc: switch to uv from poetry
  • parse: add brotli compression
  • install: dependency versions pin
  • tests: split tests into unit, integration and e2e. Only unit tests are ran on each PR. Integration and e2e tests are ran locally when developing.
  • tests: added coverage report generation. Coverage uploaded to coveralls.io

Version 0.9.3.1 Minor bug fix

18 Mar 21:56

Choose a tag to compare

Some fixes with regards to python >= 3.11 dependencies. Numpy version was incompatible with colab. Now it is fixed.

Also, there was a typo in the Nepali language code - it was "np" instead of "ne". This is now fixed.

Version 0.9.3 Article Parsing improvements and huge jump in multi language support (support for over 40 languages added)

18 Mar 00:10

Choose a tag to compare

Massive improvements in multi-language capabilities. Added over 40 new languages and completely reworked the language module. Much easier to add new languages now. Additionally, added support for Google News as a source. You can now search and parse news based on keywords, topic, location or website.
Integrated cloudscraper as an optional dependency. If installed, it will us cloudscraper as a layer over requests. Cloudscraper tries to bypass cloudflair protection.
We now have use two evaluation datasets - the one from scrapinghub and one created by us drom the top 200 most popular websites. This will help keeping track of future improvements and to have a clear view of the impact of the changes.

We see a steady improvement from version 0.9.0 up to 0.9.3. The evaluation results are available in the documentation. The evaluation dataset is also available in the following repository: Article Extraction Dataset

  • You can now install languages that need special packages as optional dependencies
  • Google News full integrated in the scraping process.
  • You can now pickle sources and articles - easier to save and recover scraping
  • Bumped minimum python version support to Python 3.8

Version 0.9.2 some major changes in document parsing

14 Jan 11:36
97fdcb0

Choose a tag to compare

  • You can now us the module as a command line interface (CLI). Usage: python -m newspaper --url https://www.test.com. More information in the documentation.
  • I have added an evaluation script against a dataset from scrapinghub. This will help keeping track of future improvements.
  • Better handling of multithreaded requests. The previous version had a bug that could lead to a deadlock. I implemented ThreadPoolExecutor from the concurrent.futures module, which is more stable. The previously news_pool was replaced with a fetch_news() function.
  • Caching is now much more flexible. You can disable it completely or for one request.
  • You can now use newspaper.article() function for convenience. It will create, download and parse an article in one step. It takes all the parameters of the Article class.
  • protected sites by cloudflare are better detected and raise an exception. The reason will be in the exception message.

Version 0.9.1 code refactoring and bugfixes

08 Nov 13:40

Choose a tag to compare

New feature:

  • version bump(f7107be)
  • tests: Add test case for(592f6f6)
  • parse: added possibility to follow "read more" links in articles(0720de1)
  • Allow to pass any requests parameter to the Article constructor. You can now pass verify=False in order to ignore certificate errors (issue #462)(5ff5d27)
  • parse: extended data parsing of json-ld metadata (issue #518)(fc413af)
  • tests: added script to create test cases(9df8c16)
  • parse: added tag for date detection issue #835(41152eb)
  • parse: added og:regDate to known date tags(dc35e29)
  • tests: convert unittest to pytest(45c4e8d)

Bugs fixed:

  • typing annotation for set python 3.8(895343f)
  • parse: improve meta tag content for articles and pubdate(37bb0b7)
  • parse: 📝 improved author detection. improved video links detection(23c547f)
  • parse: ensured that clean_doc/doc to clean_top_node are on the same DOM. And doc/top_node on the same DOM.(6874d05)
  • small changes, replace os.path with pathlib(5598d95)
  • parse: use one file of stopwords for english, the one in the standard folder #503(6bdf813)
  • parse: better author parsing based on issue #493(f93a9c2)
  • parse: make the url date parsing stricter. Issue #514(0cc1e83)
  • parse: replace \n with space in sentence split (Issue #506)(3ccb87c)
  • parsing: catch url errors resulting resulting from parsed image links(9140a04)
  • correct python versions in pipeline(7e671df)
  • gitignore update(8855f00)

First release after the fork

29 Oct 23:27

Choose a tag to compare

First release after the fork. This release is based on the 0.1.7 release of the original newspaper3k project. I jumped versions such that it is clear that this is a fork and not the original project.

New feature:

  • tests: starting moving tests to pytest(f294a01) (by Andrei)
  • parser: add yoast schema parse for date extraction(39a5cff) (by Andrei)

Bugs fixed:

  • docs: update README.md(d5f9209) (by Andrei)
  • feed_url parsing, issue #915(ec2d474) (by Andrei)
  • better content detection. added and
    tag as candidate for content parent_node(447a429) (by Andrei)
  • close pickle files - PR #938(d7608da) (by Andrei)
  • parsing: improved publication date extraction(4d137eb) (by Andrei)
  • some linter errors, whitespaces and spelling(79553f6) (by Andrei)