feat: introduce custom tags extractor#662
feat: introduce custom tags extractor#662queukat wants to merge 1 commit intoAndyTheFactory:masterfrom
Conversation
|
Hi @queukat ! What I wonder is - why create a new class and not extend the functionality here: newspaper4k/newspaper/extractors/metadata_extractor.py Lines 164 to 174 in c5e4170 |
|
hey @AndyTheFactory Short AnswerWe introduced a dedicated Detailed ComparisonPurpose and ScopeMetadataExtractor:
TagsExtractor:
Reduced Complexity and RiskWithout a Separate Class:
With TagsExtractor:
Single Responsibility Principle
Flexibility and Future Proofing
No Impact on Existing Usage
ConclusionCreating a standalone
|
|
Hi @queukat Thanks for your compelling arguments. You are right, it would make sense to have the tags in their own extractor. Would it not make sense to move the whole functionality of def _get_tags into the TagsExtractor ? |
|
You're right — conceptually, it would make sense to move _get_tags entirely into TagsExtractor. That said, I’d like to clarify the intent behind the current setup. The TagsExtractor wasn’t designed as a general-purpose or centralized solution for tag extraction. It was created specifically to handle a few niche cases where tag elements weren't being captured — in particular, two smaller sites I came across where the existing metadata logic didn’t work. So rather than bloating MetadataExtractor, this was a way to isolate site-specific logic without introducing risk or complexity to the core system. If we start fully migrating everything tag-related into TagsExtractor, we risk repeating the same issue — just in a different place — and eventually end up with a monolithic TagsExtractor.py, defeating the purpose of separation and modularity. So yes, moving _get_tags is structurally reasonable, but it assumes TagsExtractor is meant to be a general solution — which it currently isn’t. If the goal is to evolve it into something more universal, that could be considered, but probably needs a broader discussion on scope and ownership. |
Issue Link: n/a
Changes Overview:
TagsExtractorclass that finds tags fromTagsExtractorintoContentExtractorto unify custom tag extractionArticle.parse()to merge extracted tags intoarticle.tagsLimitations:
<a class="lnk" or rel="tag">. Might need expansions for other patterns.Breaking Changes:
Testing Approach: