add who guideline scraper by conscioustahoe · Pull Request #6 · MedARC-AI/amfv

conscioustahoe · 2026-07-01T19:25:58Z

What this does

Adds a WHO scraper to the datasets scraping pipeline. Same shape as the NICE scraper — discovery and extraction are separate and everything outputs a normalized ScrapedDocument.

For Milestone 1 we scrape the HTML Overview section from each publication landing page. Full PDF text is deferred on purpose. The metadata records content_scope: overview and points at the PDF via download_url so we can add a PDF path later without guessing.

How discovery works

WHO runs on Sitefinity not Next.js. Listing goes through their OData publications API filtered to the Guidelines publishing office UUID (c09761c0-ab8e-4cfa-9744-99509c4d306b). That gives us ~356 publications with $count=true for the progress bar.

One API quirk: $top has to stay at 25 or below or WHO drops DownloadUrl from the response. Page size is set accordingly. If the API fails we fall back to parsing the SSR cards on /publications/who-guidelines.

How extraction works

Each publication page at /publications/i/item/{id} gets fetched and the Overview block under section.dynamic-content__section is converted to markdown via the shared html_to_markdown helper. Nav and footer cruft stay out because we target the content container not the whole page.

Empty or missing markup raises WhoFetchError. Same behavior as NICE.

CLI

uv run amfv-scrape --source who --documents 1
uv run amfv-scrape --source who --url https://www.who.int/publications/i/item/9789240121805
uv run amfv-scrape --source who --documents 3 -f markdown -o ./who-out/

--source all now includes WHO alongside NICE.

Licensing

WHO publications since Nov 2016 are CC BY-NC-SA 3.0 IGO. Each doc gets metadata.license and metadata.attribution. Full terms are in datasets/amfv_datasets/scraping/LICENSE_NOTES.md.

Files

datasets/amfv_datasets/scraping/who.py — scraper module
datasets/amfv_datasets/scraping/cli.py — --source who dispatch
datasets/amfv_datasets/scraping/LICENSE_NOTES.md — WHO license + attribution notes
datasets/test/test_scraping_who.py — offline tests with MockTransport + fixtures

Test plan

uv run pytest datasets/test/test_scraping_who.py — listing parse, single url, full doc, error path
uv run pytest datasets/test/test_scraping_*.py — no regressions (30 passed)
uv run amfv-scrape --source who --documents 1 --no-progress — live smoke test with non-empty Overview markdown
reviewer runs markdown output: uv run amfv-scrape --source who --documents 3 -f markdown -o /tmp/who-out/

add who guideline scraper

5a1429a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add who guideline scraper#6

add who guideline scraper#6
conscioustahoe wants to merge 1 commit into
MedARC-AI:mainfrom
conscioustahoe:feat/who-scraper

conscioustahoe commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

conscioustahoe commented Jul 1, 2026

What this does

How discovery works

How extraction works

CLI

Licensing

Files

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant