Releases: AndyTheFactory/newspaper4k
Version 0.9.5 - improvements Google News and honor robots.txt
- lang: Add ISO 639-3 language code support for Kurdish (ckb, kmr)
- tests: add robots tests
- feat: added robots.txt check with hook in do_request
- feat: add hooks to get_html
- parse: prioritize
datePublishedoverdateCreatedin JSON-LD extraction - docs: Readme improvements
- feat: add
nltkas an optional dependency for leaner deployments - docs: added additional documentation for GoogleNews and Cloudscraper integration
- rework: type annotations removed deprecated types (python 3.10+)
What's Changed
- lang: Add ISO 639-3 language code support for Kurdish (ckb, kmr) by Muzaffer Cikay in #691
- tests: add robots tests by Andrei in 030e50d
- feat: added robots.txt check with hook in
do_requestby Andrei in 62dece9 - feat: add hooks to
get_htmlby Andrei in 708cc10 - parse: prioritize
datePublishedoverdateCreatedin JSON-LD extraction by Pontus Svensson in cdadb9e - docs: Readme improvements by Andrei in 18ca21c
- feat: add
nltkas an optional dependency for leaner deployments by Andrei in e073459 - docs: added additional documentation for GoogleNews and Cloudscraper integration by Andrei in aceb853
- rework: type annotations removed deprecated types (python 3.10+) by Andrei in bd82a41
Bugs Fixed
- skip null entries in JSON-LD arrays during extraction (fix #692) by ghxm in 77d6ccc
- ArticleException f-string not interpolating
status_code(fix #684) by Andrei in 7caa2a5 - accept relative paths for categories (PR #667) by BRNMan in #667
- use w3lib to detect webpage encoding by Andrei in 3bd4f00
New Contributors
- Muzaffer Cikay made their first contribution in #691
Full Changelog: 0.9.4...0.9.5
Version 0.9.4.1 - Python 3.14 support
- feat: add support for python 3.14
- rework: minor typing changes
- tests: increase test coverage
What's Changed
- lang: Add Kurdish Kurmanji stop words by @cikay in #677
- docs: Update supported languages by @cikay in #676
- docs: bump sphinx version by @AndyTheFactory in #680
- Docs 0.9.4 by @AndyTheFactory in #681
New Contributors
Full Changelog: 0.9.4...0.9.4.1
Version 0.9.4 - Dropping python 3.8, 3.9, Switching to `uv`, improving tests
New Features
Bumped min Python version to 3.10. Version 3.8 and 3.9 are no longer supported, but might still work.
- misc: switch to uv from poetry
- parse: add brotli compression
- install: dependency versions pin
- tests: split tests into unit, integration and e2e. Only unit tests are ran on each PR. Integration and e2e tests are ran locally when developing.
- tests: added coverage report generation. Coverage uploaded to coveralls.io
Version 0.9.3.1 Minor bug fix
Some fixes with regards to python >= 3.11 dependencies. Numpy version was incompatible with colab. Now it is fixed.
Also, there was a typo in the Nepali language code - it was "np" instead of "ne". This is now fixed.
Version 0.9.3 Article Parsing improvements and huge jump in multi language support (support for over 40 languages added)
Massive improvements in multi-language capabilities. Added over 40 new languages and completely reworked the language module. Much easier to add new languages now. Additionally, added support for Google News as a source. You can now search and parse news based on keywords, topic, location or website.
Integrated cloudscraper as an optional dependency. If installed, it will us cloudscraper as a layer over requests. Cloudscraper tries to bypass cloudflair protection.
We now have use two evaluation datasets - the one from scrapinghub and one created by us drom the top 200 most popular websites. This will help keeping track of future improvements and to have a clear view of the impact of the changes.
We see a steady improvement from version 0.9.0 up to 0.9.3. The evaluation results are available in the documentation. The evaluation dataset is also available in the following repository: Article Extraction Dataset
- You can now install languages that need special packages as optional dependencies
- Google News full integrated in the scraping process.
- You can now pickle sources and articles - easier to save and recover scraping
- Bumped minimum python version support to Python 3.8
Version 0.9.2 some major changes in document parsing
- You can now us the module as a command line interface (CLI). Usage:
python -m newspaper --url https://www.test.com. More information in the documentation. - I have added an evaluation script against a dataset from scrapinghub. This will help keeping track of future improvements.
- Better handling of multithreaded requests. The previous version had a bug that could lead to a deadlock. I implemented ThreadPoolExecutor from the concurrent.futures module, which is more stable. The previously
news_poolwas replaced with afetch_news()function. - Caching is now much more flexible. You can disable it completely or for one request.
- You can now use
newspaper.article()function for convenience. It will create, download and parse an article in one step. It takes all the parameters of theArticleclass. - protected sites by cloudflare are better detected and raise an exception. The reason will be in the exception message.
Version 0.9.1 code refactoring and bugfixes
New feature:
- version bump(
f7107be) - tests: Add test case for(
592f6f6) - parse: added possibility to follow "read more" links in articles(
0720de1) - Allow to pass any requests parameter to the Article constructor. You can now pass verify=False in order to ignore certificate errors (issue #462)(
5ff5d27) - parse: extended data parsing of json-ld metadata (issue #518)(
fc413af) - tests: added script to create test cases(
9df8c16) - parse: added tag for date detection issue #835(
41152eb) - parse: added og:regDate to known date tags(
dc35e29) - tests: convert unittest to pytest(
45c4e8d)
Bugs fixed:
- typing annotation for set python 3.8(
895343f) - parse: improve meta tag content for articles and pubdate(
37bb0b7) - parse: 📝 improved author detection. improved video links detection(
23c547f) - parse: ensured that clean_doc/doc to clean_top_node are on the same DOM. And doc/top_node on the same DOM.(
6874d05) - small changes, replace os.path with pathlib(
5598d95) - parse: use one file of stopwords for english, the one in the standard folder #503(
6bdf813) - parse: better author parsing based on issue #493(
f93a9c2) - parse: make the url date parsing stricter. Issue #514(
0cc1e83) - parse: replace \n with space in sentence split (Issue #506)(
3ccb87c) - parsing: catch url errors resulting resulting from parsed image links(
9140a04) - correct python versions in pipeline(
7e671df) - gitignore update(
8855f00)
First release after the fork
First release after the fork. This release is based on the 0.1.7 release of the original newspaper3k project. I jumped versions such that it is clear that this is a fork and not the original project.
New feature:
- tests: starting moving tests to pytest(
f294a01) (by Andrei) - parser: add yoast schema parse for date extraction(
39a5cff) (by Andrei)
Bugs fixed:
- docs: update README.md(
d5f9209) (by Andrei) - feed_url parsing, issue #915(
ec2d474) (by Andrei) - better content detection. added and tag as candidate for content parent_node(
447a429) (by Andrei) - close pickle files - PR #938(
d7608da) (by Andrei) - parsing: improved publication date extraction(
4d137eb) (by Andrei) - some linter errors, whitespaces and spelling(
79553f6) (by Andrei)