Describe the bug
Using https://swtch.com/~rsc/regexp/regexp1.html as a source, we can see there are several images in the text. The markup is pretty vanilla and most of the images are children of <p> tags. When I parse the document using the latest release, the images are not inline in the markup. They are successfully extract into the images property on the created object, however.
In addition, not sure this is a bug, but somewhere in Nov 2023, keep_article_html was removed from the Configuration class and replaced with clean_article_html which I don't think has the same semantics.
To Reproduce
from newspaper import Article
doc = Article("https://swtch.com/~rsc/regexp/regexp1.html", keep_article_html=True)
doc.download()
doc.parse()
print(doc.article_html)
Expected behavior
The markup should contain the <img> tags.
System information
- OS: Windows 11, MacOS
- Python version: 3,10, 3.11, 3.12
- Library version: 0.9.4 (latest release)
Describe the bug
Using https://swtch.com/~rsc/regexp/regexp1.html as a source, we can see there are several images in the text. The markup is pretty vanilla and most of the images are children of
<p>tags. When I parse the document using the latest release, the images are not inline in the markup. They are successfully extract into theimagesproperty on the created object, however.In addition, not sure this is a bug, but somewhere in Nov 2023,
keep_article_htmlwas removed from theConfigurationclass and replaced withclean_article_htmlwhich I don't think has the same semantics.To Reproduce
Expected behavior
The markup should contain the
<img>tags.System information