Releases · AndyTheFactory/newspaper4k

28 Feb 14:33

AndyTheFactory

0.9.5

9fdb4ce

Version 0.9.5 - improvements Google News and honor robots.txt Latest

Latest

lang: Add ISO 639-3 language code support for Kurdish (ckb, kmr)
tests: add robots tests
feat: added robots.txt check with hook in do_request
feat: add hooks to get_html
parse: prioritize datePublished over dateCreated in JSON-LD extraction
docs: Readme improvements
feat: add nltk as an optional dependency for leaner deployments
docs: added additional documentation for GoogleNews and Cloudscraper integration
rework: type annotations removed deprecated types (python 3.10+)

What's Changed

lang: Add ISO 639-3 language code support for Kurdish (ckb, kmr) by Muzaffer Cikay in #691
tests: add robots tests by Andrei in 030e50d
feat: added robots.txt check with hook in do_request by Andrei in 62dece9
feat: add hooks to get_html by Andrei in 708cc10
parse: prioritize datePublished over dateCreated in JSON-LD extraction by Pontus Svensson in cdadb9e
docs: Readme improvements by Andrei in 18ca21c
feat: add nltk as an optional dependency for leaner deployments by Andrei in e073459
docs: added additional documentation for GoogleNews and Cloudscraper integration by Andrei in aceb853
rework: type annotations removed deprecated types (python 3.10+) by Andrei in bd82a41

Bugs Fixed

skip null entries in JSON-LD arrays during extraction (fix #692) by ghxm in 77d6ccc
ArticleException f-string not interpolating status_code (fix #684) by Andrei in 7caa2a5
accept relative paths for categories (PR #667) by BRNMan in #667
use w3lib to detect webpage encoding by Andrei in 3bd4f00

New Contributors

Muzaffer Cikay made their first contribution in #691

Full Changelog: 0.9.4...0.9.5

Assets 2

18 Nov 06:08

AndyTheFactory

0.9.4.1

da22927

Version 0.9.4.1 - Python 3.14 support

feat: add support for python 3.14
rework: minor typing changes
tests: increase test coverage

What's Changed

lang: Add Kurdish Kurmanji stop words by @cikay in #677
docs: Update supported languages by @cikay in #676
docs: bump sphinx version by @AndyTheFactory in #680
Docs 0.9.4 by @AndyTheFactory in #681

New Contributors

@cikay made their first contribution in #677

Full Changelog: 0.9.4...0.9.4.1

Contributors

AndyTheFactory and cikay

Assets 2

15 Nov 21:52

AndyTheFactory

0.9.4

701f5da

Version 0.9.4 - Dropping python 3.8, 3.9, Switching to `uv`, improving tests

New Features

Bumped min Python version to 3.10. Version 3.8 and 3.9 are no longer supported, but might still work.

misc: switch to uv from poetry
parse: add brotli compression
install: dependency versions pin
tests: split tests into unit, integration and e2e. Only unit tests are ran on each PR. Integration and e2e tests are ran locally when developing.
tests: added coverage report generation. Coverage uploaded to coveralls.io

Assets 2

18 Mar 21:56

AndyTheFactory

0.9.3.1

9989040

Version 0.9.3.1 Minor bug fix

Some fixes with regards to python >= 3.11 dependencies. Numpy version was incompatible with colab. Now it is fixed.

Also, there was a typo in the Nepali language code - it was "np" instead of "ne". This is now fixed.

Assets 2

18 Mar 00:10

AndyTheFactory

0.9.3

741fcb3

Version 0.9.3 Article Parsing improvements and huge jump in multi language support (support for over 40 languages added)

Massive improvements in multi-language capabilities. Added over 40 new languages and completely reworked the language module. Much easier to add new languages now. Additionally, added support for Google News as a source. You can now search and parse news based on keywords, topic, location or website.
Integrated cloudscraper as an optional dependency. If installed, it will us cloudscraper as a layer over requests. Cloudscraper tries to bypass cloudflair protection.
We now have use two evaluation datasets - the one from scrapinghub and one created by us drom the top 200 most popular websites. This will help keeping track of future improvements and to have a clear view of the impact of the changes.

We see a steady improvement from version 0.9.0 up to 0.9.3. The evaluation results are available in the documentation. The evaluation dataset is also available in the following repository: Article Extraction Dataset

You can now install languages that need special packages as optional dependencies
Google News full integrated in the scraping process.
You can now pickle sources and articles - easier to save and recover scraping
Bumped minimum python version support to Python 3.8

Assets 2

14 Jan 11:36

AndyTheFactory

0.9.2

97fdcb0

Version 0.9.2 some major changes in document parsing

You can now us the module as a command line interface (CLI). Usage: python -m newspaper --url https://www.test.com. More information in the documentation.
I have added an evaluation script against a dataset from scrapinghub. This will help keeping track of future improvements.
Better handling of multithreaded requests. The previous version had a bug that could lead to a deadlock. I implemented ThreadPoolExecutor from the concurrent.futures module, which is more stable. The previously news_pool was replaced with a fetch_news() function.
Caching is now much more flexible. You can disable it completely or for one request.
You can now use newspaper.article() function for convenience. It will create, download and parse an article in one step. It takes all the parameters of the Article class.
protected sites by cloudflare are better detected and raise an exception. The reason will be in the exception message.

Assets 2

08 Nov 13:40

AndyTheFactory

0.9.1

c261786

Version 0.9.1 code refactoring and bugfixes

New feature:

version bump(f7107be)
tests: Add test case for(592f6f6)
parse: added possibility to follow "read more" links in articles(0720de1)
Allow to pass any requests parameter to the Article constructor. You can now pass verify=False in order to ignore certificate errors (issue #462)(5ff5d27)
parse: extended data parsing of json-ld metadata (issue #518)(fc413af)
tests: added script to create test cases(9df8c16)
parse: added tag for date detection issue #835(41152eb)
parse: added og:regDate to known date tags(dc35e29)
tests: convert unittest to pytest(45c4e8d)

Bugs fixed:

typing annotation for set python 3.8(895343f)
parse: improve meta tag content for articles and pubdate(37bb0b7)
parse: 📝 improved author detection. improved video links detection(23c547f)
parse: ensured that clean_doc/doc to clean_top_node are on the same DOM. And doc/top_node on the same DOM.(6874d05)
small changes, replace os.path with pathlib(5598d95)
parse: use one file of stopwords for english, the one in the standard folder #503(6bdf813)
parse: better author parsing based on issue #493(f93a9c2)
parse: make the url date parsing stricter. Issue #514(0cc1e83)
parse: replace \n with space in sentence split (Issue #506)(3ccb87c)
parsing: catch url errors resulting resulting from parsed image links(9140a04)
correct python versions in pipeline(7e671df)
gitignore update(8855f00)

Assets 2

29 Oct 23:27

AndyTheFactory

0.9.0

c11f950

First release after the fork

First release after the fork. This release is based on the 0.1.7 release of the original newspaper3k project. I jumped versions such that it is clear that this is a fork and not the original project.

New feature:

tests: starting moving tests to pytest(f294a01) (by Andrei)
parser: add yoast schema parse for date extraction(39a5cff) (by Andrei)

Bugs fixed:

docs: update README.md(d5f9209) (by Andrei)
feed_url parsing, issue #915(ec2d474) (by Andrei)
better content detection. added and
tag as candidate for content parent_node(447a429) (by Andrei)
close pickle files - PR #938(d7608da) (by Andrei)
parsing: improved publication date extraction(4d137eb) (by Andrei)
some linter errors, whitespaces and spelling(79553f6) (by Andrei)

Assets 3

Releases: AndyTheFactory/newspaper4k

Version 0.9.5 - improvements Google News and honor robots.txt

What's Changed

Bugs Fixed

New Contributors

Uh oh!

Version 0.9.4.1 - Python 3.14 support

What's Changed

New Contributors

Contributors

Uh oh!

Version 0.9.4 - Dropping python 3.8, 3.9, Switching to `uv`, improving tests

New Features

Uh oh!

Version 0.9.3.1 Minor bug fix

Uh oh!

Version 0.9.3 Article Parsing improvements and huge jump in multi language support (support for over 40 languages added)

Uh oh!

Version 0.9.2 some major changes in document parsing

Uh oh!

Version 0.9.1 code refactoring and bugfixes

New feature:

Bugs fixed:

Uh oh!

First release after the fork

New feature:

Bugs fixed:

Uh oh!