add who guideline scraper#6
Open
conscioustahoe wants to merge 1 commit into
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this does
Adds a WHO scraper to the datasets scraping pipeline. Same shape as the NICE scraper — discovery and extraction are separate and everything outputs a normalized
ScrapedDocument.For Milestone 1 we scrape the HTML Overview section from each publication landing page. Full PDF text is deferred on purpose. The metadata records
content_scope: overviewand points at the PDF viadownload_urlso we can add a PDF path later without guessing.How discovery works
WHO runs on Sitefinity not Next.js. Listing goes through their OData publications API filtered to the Guidelines publishing office UUID (
c09761c0-ab8e-4cfa-9744-99509c4d306b). That gives us ~356 publications with$count=truefor the progress bar.One API quirk:
$tophas to stay at 25 or below or WHO dropsDownloadUrlfrom the response. Page size is set accordingly. If the API fails we fall back to parsing the SSR cards on/publications/who-guidelines.How extraction works
Each publication page at
/publications/i/item/{id}gets fetched and the Overview block undersection.dynamic-content__sectionis converted to markdown via the sharedhtml_to_markdownhelper. Nav and footer cruft stay out because we target the content container not the whole page.Empty or missing markup raises
WhoFetchError. Same behavior as NICE.CLI
--source allnow includes WHO alongside NICE.Licensing
WHO publications since Nov 2016 are CC BY-NC-SA 3.0 IGO. Each doc gets
metadata.licenseandmetadata.attribution. Full terms are indatasets/amfv_datasets/scraping/LICENSE_NOTES.md.Files
datasets/amfv_datasets/scraping/who.py— scraper moduledatasets/amfv_datasets/scraping/cli.py—--source whodispatchdatasets/amfv_datasets/scraping/LICENSE_NOTES.md— WHO license + attribution notesdatasets/test/test_scraping_who.py— offline tests with MockTransport + fixturesTest plan
uv run pytest datasets/test/test_scraping_who.py— listing parse, single url, full doc, error pathuv run pytest datasets/test/test_scraping_*.py— no regressions (30 passed)uv run amfv-scrape --source who --documents 1 --no-progress— live smoke test with non-empty Overview markdownuv run amfv-scrape --source who --documents 3 -f markdown -o /tmp/who-out/