Skip to content

[BUG] keep_article_html doesn't exist anymore and images are not preserved #721

@blackdwarf

Description

@blackdwarf

Describe the bug
Using https://swtch.com/~rsc/regexp/regexp1.html as a source, we can see there are several images in the text. The markup is pretty vanilla and most of the images are children of <p> tags. When I parse the document using the latest release, the images are not inline in the markup. They are successfully extract into the images property on the created object, however.

In addition, not sure this is a bug, but somewhere in Nov 2023, keep_article_html was removed from the Configuration class and replaced with clean_article_html which I don't think has the same semantics.

To Reproduce

from newspaper import Article
doc = Article("https://swtch.com/~rsc/regexp/regexp1.html", keep_article_html=True)
doc.download()
doc.parse()

print(doc.article_html)

Expected behavior
The markup should contain the <img> tags.

System information

  • OS: Windows 11, MacOS
  • Python version: 3,10, 3.11, 3.12
  • Library version: 0.9.4 (latest release)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions