Add six new datasets (EHRShot, HIRID, INSPIRE, NWICU, SICdb, eICU) + complete AUMCdb#299
Closed
mmcdermott wants to merge 1 commit into
Closed
Add six new datasets (EHRShot, HIRID, INSPIRE, NWICU, SICdb, eICU) + complete AUMCdb#299mmcdermott wants to merge 1 commit into
mmcdermott wants to merge 1 commit into
Conversation
Replicates the dataset additions from #258 on top of current dev (the original branch is too far behind to merge directly; this PR keeps only the dataset/task content, not the stale infra reverts). Datasets added (`src/MEDS_DEV/datasets/<name>/`): - EHRShot — Stanford EHR cohort with pre-built MEDS extraction. - HIRID — Bern ICU dataset via MEDS_extract-HIRID. - INSPIRE — perioperative dataset via MEDS_extract-INSPIRE. - NWICU — Northwestern ICU dataset via NWICU_MEDS. - SICdb — Salzburg ICU dataset via MEDS_extract-SICdb. - eICU — multi-center US ICU dataset via MEDS_extract-eICU (with demo). AUMCdb is also completed (was previously just predicates.yaml + README): adds dataset.yaml, requirements.txt, refs.bib, and the full ICU predicate set from the upstream PR. Tasks: mortality/in_icu/first_24h now lists AUMCdb and NWICU under supported_datasets in addition to MIMIC-IV. MIMIC-IV/README.md: pulled the longer description + access-requirements write-up from #258 (replaces the TODO placeholders). Each dataset.yaml has a `build_demo` command — for datasets without a real demo recipe, this is a stub echo so registry validation passes (matching the pattern HIRID already used in the source PR). Co-Authored-By: Robin P. van de Water <rvandewater@users.noreply.github.com> Co-Authored-By: Patrick Rockenschaub <prockenschaub@users.noreply.github.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
73023a4 to
e3c7a4a
Compare
❌ 1 Tests Failed:
View the top 1 failed test(s) by shortest run time
To view more test analytics, go to the Test Analytics Dashboard |
Collaborator
Author
|
Splitting this into one PR per dataset for easier review (per #299 discussion). The cross-cutting |
This was referenced May 13, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Replicates the dataset additions from #258 on top of current dev. The upstream branch (May 2025) is too far behind dev to merge directly — it reverts the web-tooling, CI tiers, CLAUDE.md, uv.lock, and other infra that have landed in the meantime. This PR keeps only the dataset/task content and rebases it onto dev as a single commit.
Also makes #243 (NWICU) redundant; that PR can be closed once this lands.
What's added
Six new datasets (
src/MEDS_DEV/datasets/<name>/):MEDS_extract-HIRID.MEDS_extract-INSPIRE.NWICU_MEDS.MEDS_extract-SICdb.MEDS_extract-eICU(with demo recipe).AUMCdb completed — was previously just
predicates.yaml+README.md; this addsdataset.yaml,requirements.txt,refs.bib, and the full ICU predicate set.Tasks —
mortality/in_icu/first_24h.supported_datasetsnow includes AUMCdb and NWICU (alongside MIMIC-IV). EHRShot, HIRID, INSPIRE, SICdb, eICU register as datasets but no task references them yet, so they'll just be inert registry entries until tasks are added.MIMIC-IV README — replaces the previous
TODO: Summarize MIMIC-IVstub with the longer description + access-requirements write-up from the upstream PR.Fixes applied during local review
Beyond replicating the contributor's content, I made these corrections:
AUMCdb/predicates.yaml— 4×�mol/l→µmol/l(the predicates would never have matched as-is).HIRID,INSPIRE,SICdbpredicates — patterns like^HOSPITAL_ADMISSION*(which matchesHOSPITAL_ADMISSIO+ zero-or-moreNs) corrected to^HOSPITAL_ADMISSION//.*to match the working convention used by NWICU/eICU/AUMCdb.Supported Taskslist (referenced three task files that don't exist in this repo) and the empty## MEDS-transformationheading.build_demo: echo "Demo not available for this dataset"to dataset.yamls that were missing it, so registry validation passes (matches the pattern HIRID's upstream already used).Open items needing maintainer / contributor input
Draft because these aren't fixed yet:
MEDS_cohort_dirvsMEDS_output_dir— all new dataset.yamls useMEDS_cohort_dir="{output_dir}"; dev's MIMIC-IV usesMEDS_output_dir="{output_dir}". One of these is stale; needs the extractor-maintainer's call.icu_admission/icu_dischargeboth use^HOSPITAL_ADMISSION.*/^HOSPITAL_DISCHARGE.*— same regex as the hospital predicates, so they don't actually distinguish ICU vs hospital events. Likely a content bug in the upstream PR but needs HIRID-MEDS knowledge to fix.icu_admission/icu_dischargecollapse to the same^ADMISSION//.*/^DISCHARGE//.*as the hospital predicates. The mortality/in_icu task wouldn't work correctly on SICdb if added tosupported_datasets(SICdb is not currently listed, so this is latent).build_full(echo "MEDS extraction is pre-done") and norequirements.txt. Worth deciding: (a) document the "bring your own MEDS" expectation prominently, or (b) hold the registry entry until an extractor exists.icu_dischargeis the same ashospital_discharge(Visit/DP). May be intentional (EHRShot's OMOP source may not have a separate ICU discharge event) but worth confirming.value_min:thresholds will behave wrong outside the source's native units.access_policyunset everywhere (defaults toPRIVATE_SINGLE_USE, which is wrong for public PhysioNet datasets). Repo-wide gap, not new in this PR — fine to fix separately.Test plan
test_registry_validation.pyfor all 8 datasets).mortality/in_icu/first_24h) — these require real source data.Supersedes / refs
more-datasetsbranch by @rvandewater is too stale to merge directly).🤖 Generated with Claude Code