Skip to content

add who guideline scraper#6

Open
conscioustahoe wants to merge 1 commit into
MedARC-AI:mainfrom
conscioustahoe:feat/who-scraper
Open

add who guideline scraper#6
conscioustahoe wants to merge 1 commit into
MedARC-AI:mainfrom
conscioustahoe:feat/who-scraper

Conversation

@conscioustahoe

Copy link
Copy Markdown

What this does

Adds a WHO scraper to the datasets scraping pipeline. Same shape as the NICE scraper — discovery and extraction are separate and everything outputs a normalized ScrapedDocument.

For Milestone 1 we scrape the HTML Overview section from each publication landing page. Full PDF text is deferred on purpose. The metadata records content_scope: overview and points at the PDF via download_url so we can add a PDF path later without guessing.

How discovery works

WHO runs on Sitefinity not Next.js. Listing goes through their OData publications API filtered to the Guidelines publishing office UUID (c09761c0-ab8e-4cfa-9744-99509c4d306b). That gives us ~356 publications with $count=true for the progress bar.

One API quirk: $top has to stay at 25 or below or WHO drops DownloadUrl from the response. Page size is set accordingly. If the API fails we fall back to parsing the SSR cards on /publications/who-guidelines.

How extraction works

Each publication page at /publications/i/item/{id} gets fetched and the Overview block under section.dynamic-content__section is converted to markdown via the shared html_to_markdown helper. Nav and footer cruft stay out because we target the content container not the whole page.

Empty or missing markup raises WhoFetchError. Same behavior as NICE.

CLI

uv run amfv-scrape --source who --documents 1
uv run amfv-scrape --source who --url https://www.who.int/publications/i/item/9789240121805
uv run amfv-scrape --source who --documents 3 -f markdown -o ./who-out/

--source all now includes WHO alongside NICE.

Licensing

WHO publications since Nov 2016 are CC BY-NC-SA 3.0 IGO. Each doc gets metadata.license and metadata.attribution. Full terms are in datasets/amfv_datasets/scraping/LICENSE_NOTES.md.

Files

  • datasets/amfv_datasets/scraping/who.py — scraper module
  • datasets/amfv_datasets/scraping/cli.py--source who dispatch
  • datasets/amfv_datasets/scraping/LICENSE_NOTES.md — WHO license + attribution notes
  • datasets/test/test_scraping_who.py — offline tests with MockTransport + fixtures

Test plan

  • uv run pytest datasets/test/test_scraping_who.py — listing parse, single url, full doc, error path
  • uv run pytest datasets/test/test_scraping_*.py — no regressions (30 passed)
  • uv run amfv-scrape --source who --documents 1 --no-progress — live smoke test with non-empty Overview markdown
  • reviewer runs markdown output: uv run amfv-scrape --source who --documents 3 -f markdown -o /tmp/who-out/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant