Skip to content

URL parsing compatibility  #626

@Jacey0

Description

@Jacey0

Hi,

I'm having trouble setting up the environment for this. I'm using a conda environment on Windows and get the same problem with python 3.9, 3.10 and 3.11. I also made sure to pip install with the requirements.txt here before running pip install newspaper4k.

I will encounter this first issue

File "c:\Users...\scrape_from_urls.py", line 1, in
import newspaper
File "C:\Users...\site-packages\newspaper_init_.py", line 17, in
from .api import (
File "C:\Users...\site-packages\newspaper\api.py", line 11, in
from newspaper.article import Article
File "C:\Users...\site-packages\newspaper\article.py", line 28, in
from .extractors import ContentExtractor
File "C:\Users...\site-packages\newspaper\extractors_init_.py", line 8, in
from newspaper.extractors.content_extractor import ContentExtractor
File "C:\Users...\site-packages\newspaper\extractors\content_extractor.py", line 8, in
from newspaper.extractors.articlebody_extractor import ArticleBodyExtractor
File "C:\Users...\site-packages\newspaper\extractors\articlebody_extractor.py", line 8, in
import newspaper.extractors.defines as defines
File "C:\Users...\site-packages\newspaper\extractors\defines.py", line 2, in
from typing_extensions import TypedDict, NotRequired
ModuleNotFoundError: No module named 'typing_extensions'

No biggie, just need to pip install typing-extensions, so the import works, but then it encounters another error later when I try to call newspaper.article with any url.

File "c:\Users...\scrape_from_urls.py", line 7, in
article = newspaper.article(url)
File "C:\Users...\site-packages\newspaper_init_.py", line 61, in article
a = Article(url, language=language, **kwargs)
File "C:\Users...\site-packages\newspaper\article.py", line 195, in init
scheme = urls.get_scheme(url)
File "C:\Users...\site-packages\newspaper\urls.py", line 370, in get_scheme
return urlparse(abs_url, **kwargs).scheme
File "c:\Users...\lib\urllib\parse.py", line 399, in urlparse
url, scheme, _coerce_result = _coerce_args(url, scheme)
File "c:\Users...\lib\urllib\parse.py", line 136, in _coerce_args
return _decode_args(args) + (_encode_result,)
File "c:\Users...\lib\urllib\parse.py", line 120, in _decode_args
return tuple(x.decode(encoding, errors) if x else '' for x in args)
File "c:\Users...\lib\urllib\parse.py", line 120, in
return tuple(x.decode(encoding, errors) if x else '' for x in args)
AttributeError: 'builtin_function_or_method' object has no attribute 'decode'

I also tried newspaper3k and get a similar AttributeError so I'm wondering if I should be using a different urllib version (urllib3==1.26.18).

Would be great if these could be added to the requirements.txt. Thank you.

Metadata

Metadata

Assignees

No one assigned

    Labels

    help wantedExtra attention is needed

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions