Skip to content

ArioMoniri/syntha

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

74 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🩺 syntha

A Synthea-inspired hybrid synthetic patient record generator. Learns the joint distribution of real anonymized Turkish-cohort EHR episodes with a Gaussian copula, then layers Synthea-style clinical pathways on top to emit fully-coded FHIR R4 bundles in Turkish.

CI Cross-platform Release Install buttons Codecov Latest release Downloads License: Apache 2.0 Python 3.10+ FHIR R4 Locale: tr-TR


What it is

syntha is a Python library, command-line tool, and signed cross-platform desktop app for generating realistic synthetic patient records — flat CSVs and FHIR R4 transaction Bundles — that match the statistical structure of an anonymized Turkish-cohort EHR while staying physiologically valid and clinically coded.

The pipeline is hybrid:

  1. Gaussian copula fitted on real anonymized episodes — preserves marginal distributions (age, labs, vitals, comorbidity prevalence) and their joint correlation structure.
  2. Physiologic filter — rejects samples that violate pulse-pressure, Friedewald lipid coherence, or eGFR ↔ creatinine constraints.
  3. Synthea-style clinical modules — nine condition-specific state activations that emit Encounters, MedicationRequests (RxNorm-coded), Procedures, and CarePlans matching each patient's comorbidity profile.
  4. FHIR R4 export — Patient + Observation + Condition + Encounter + MedicationRequest + Procedure + CarePlan + DiagnosticReport + RiskAssessment + FamilyMemberHistory, dual-coded LOINC / SNOMED CT / ICD-10 / RxNorm, Turkish locale (names, addresses, language code, display text).

Desktop app

Download macOS Apple Silicon (.dmg)   Download Windows installer (.exe)   Download Linux AppImage

A Tauri 2 app bundling the trained Gaussian copula. Picks cohort + n + seed + constraints, samples synthetic patients fully client-side (no Python at runtime), downloads a CSV. macOS DMG is Developer-ID signed + notarized + stapled. Windows installer is code-signed. All three OSes ship a minisign-signed auto-updater — existing installs get an in-app upgrade banner on next launch.

Install URLs auto-resolve to the latest release via releases/latest/download/… — no per-version link maintenance.

Use from Claude (MCP connector)

syntha ships as an MCP connector. Claude (Desktop or Claude.com custom-connector slot) can drive end-to-end synthetic-cohort workflows directly — Turkish-locale patients, FHIR R4 bundles, and longitudinal multi-encounter histories all without leaving the chat. The connector now exposes 34 tools covering generation, validation, privacy audit, physiologic constraints, and reference-data lookup, and bundles the two trained copulas (tolerant n=135,569; strict n=55,141) so no source CSV is needed at runtime.

pip install "syntha-ehr[mcp]"

Add to Claude Desktop's config (Settings → Developer → Edit Config):

{ "mcpServers": { "syntha": { "command": "syntha-mcp" } } }
Category Representative tools What it does
Generate generate_cohort_csv, generate_cohort_fhir, sample_conditional, generate_longitudinal_cohort, generate_longitudinal_fhir, generate_cohort_with_lab_history, generate_clinical_assessments Cross-sectional + longitudinal cohorts, conditional sampling, FHIR R4 bundles, lab-history series, validated clinical assessment instruments
Inspect syntha_version, list_bundled_cohorts, get_cohort_summary, get_model_card, list_modules_detail, list_pipeline_config_options, list_clinical_modules, get_correlation_pairs Browse bundled cohorts, model cards, clinical modules, pipeline knobs, correlation structure
Validate & audit validate_synthetic_csv, validate_against_bundled_cohort, privacy_audit, privacy_audit_bundled, fraction_within_reference, check_row_within_reference, apply_physiologic_constraints, ckd_stage_for_egfr, validate_condition_expression KS / prevalence / correlation fidelity, MIA + AIA privacy audit, reference-range checks, physiologic-constraint enforcement
Locale & terminology list_locale_data, list_condition_codes, list_lab_loinc_codes, list_lab_panels, list_rxnorm_medications, list_clinical_assessment_instruments Turkish names/addresses + ICD-10 / SNOMED CT / LOINC / RxNorm catalogs
Reference data list_physiologic_constraints, list_reference_ranges, list_schema_columns, list_lab_drift_profiles Constraint definitions, reference intervals, schema columns, longitudinal drift profiles

Then ask Claude things like "Using the syntha connector, give me 50 longitudinal Turkish patients with hypertension and diabetes aged 60+ — 4 encounters each, as a FHIR bundle." Marketplace listing is in preparation; full per-tool reference, Claude.com Streamable-HTTP setup, and Submitting to the Connector directory instructions live in docs/MCP.md.

Install

# PyPI
pip install syntha-ehr

# With the MCP connector (Claude Desktop / custom connectors)
pip install "syntha-ehr[mcp]"

# Or from source
git clone https://github.com/ArioMoniri/syntha
cd syntha
pip install -e ".[dev]"

# Or Docker
docker pull ghcr.io/ariomoniri/syntha:latest

Quick start

# Generate 1 000 synthetic episodes + FHIR bundles + model card + validation report
syntha generate \
  --input data/raw/pristine_tolerant_episodes.csv \
  --output output/tolerant \
  --n 1000 --cohort tolerant

# Longitudinal — multiple encounters per patient with shared HASTA_ID
syntha generate \
  --input data/raw/pristine_tolerant_episodes.csv \
  --output output/tolerant_long \
  --n 2000 --cohort tolerant \
  --longitudinal --encounters-per-patient 4 --years-of-history 3

# Validate a synthetic CSV against the source it was trained on
syntha validate \
  --source data/raw/pristine_tolerant_episodes.csv \
  --synthetic output/tolerant/synthetic_tolerant_episodes.csv \
  --output output/tolerant/validation.json

# Run a privacy audit (MIA + AIA)
syntha audit \
  --source data/raw/pristine_tolerant_episodes.csv \
  --synthetic output/tolerant/synthetic_tolerant_episodes.csv \
  --output output/tolerant/privacy.json

By default the CSV writer drops 29 source-pipeline curation flags (pristine_*, berturk_*, drug-safety filters, rf_*) — those are training metadata, not clinical observations, and most are degenerate (constant 0 or 1) in the pristine cohort. Pass --curation-flags to keep them for QA work.

What it produces

For every synthetic patient, syntha emits a FHIR R4 transaction Bundle:

Resource Coding What
👤 Patient Turkish HumanName + Address (ISO 3166-2:TR province), communication.language = tr, derived birthDate
🧪 Observation ×~12 LOINC Labs (glucose, full lipid panel, CBC, LFTs, eGFR/creatinine, ferritin, B12) + vitals (BP)
🩺 Condition ×N SNOMED CT + ICD-10 Every active comorbidity, dual-coded, with English + clinical-Turkish display
🏥 Encounter ×M SNOMED CT One per active condition, fired by the relevant module
💊 MedicationRequest ×P RxNorm First-line therapy per condition, with dosage
🔬 Procedure ×Q SNOMED CT HbA1c, lipid panel, ECG, spirometry, etc.
📋 CarePlan ×R SNOMED CT Disease-specific lifestyle + monitoring plans
📊 DiagnosticReport LOINC Lipid, CBC, CMP, iron, BP panels grouping their constituent Observations
🎯 RiskAssessment SNOMED CT Charlson Comorbidity Index
👪 FamilyMemberHistory SNOMED CT When rf_kanser / rf_kronik_hastalik are set

…plus a flat CSV matching the input schema (minus the 29 dropped curation flags) for drop-in use as training data, a JSON model card with the source_sha256 and marginals, and a validation report.

Distribution fidelity

A 100-episode sample of tolerant vs the full 135 569-row source:

Metric Value
n source / synthetic 135 569 / 100
Max Kolmogorov–Smirnov across continuous columns 0.14
Mean KS 0.07
Max binary-prevalence error 0.025 (has_rx_data)
Disease-prevalence error (HTN / DM / hyperlipidemia) 0.015 / 0.004 / 0.010
Spearman correlation Frobenius diff 2.94
Fraction of synthetic patients with all labs in reference range reported per cohort in validation_report.json

Marginals

Marginal distributions — source vs synthetic

Spearman correlation structure

Spearman correlations — source vs synthetic vs diff

Disease prevalence

Comorbidity prevalence — source vs synthetic

FHIR endpoints

# Spin up a local read-only FHIR R4 server
syntha serve --bundles examples/sample_output/sample_bundles.ndjson --port 8080

# Then:
curl http://127.0.0.1:8080/metadata           # CapabilityStatement
curl http://127.0.0.1:8080/Patient            # search-set Bundle
curl http://127.0.0.1:8080/Patient/{id}
curl http://127.0.0.1:8080/\$export           # Bulk Data NDJSON

scripts/post_to_fhir.sh posts every transaction Bundle in an NDJSON file to any FHIR R4 endpoint (default: the public HAPI test server).

Turkish cohort + Turkish output

The trained models bundled with the desktop app and the example output come from pristine_strict_episodes.csv and pristine_tolerant_episodes.csv — anonymized retrospective EHR episodes from a Turkish patient cohort selected to represent clinically pristine adults. The source CSVs themselves are gitignored and never redistributed.

The output is Turkish-localized:

  • Patient names sampled from Turkish given-name and family-name distributions (src/syntha/locale/turkish.py).
  • Addresses use Turkish cities weighted by approximate population with ISO 3166-2:TR province codes.
  • Every Condition carries both an English SNOMED display and a clinical-Turkish translation in Condition.code.text.
  • Patient.communication.language is tr.

All clinical terminology used (LOINC, SNOMED CT, ICD-10, RxNorm) comes from open international standards. No licensed terminology content is embedded.

Synthea-style clinical modules

Nine modules ship out of the box (src/syntha/modules/); each fires on its corresponding comorbidity flag.

Module Source flag(s) Emits
🫀 Hypertension Hipertansiyon Encounter, 1–2 antihypertensives (stage 2 → dual), CarePlan
🍬 Diabetes DM_Tum, DM_Komplikasyonlu Encounter, HbA1c, metformin (+ insulin if severe), CarePlan
🧀 Hyperlipidemia Hiperlipidemi Encounter, lipid panel, statin (high-intensity if LDL ≥ 190)
🦋 Thyroid Tiroid Encounter, TSH, levothyroxine
😔 Depression Depresyon Psych encounter, sertraline, CBT CarePlan
😰 Anxiety Anksiyete Psych encounter, escitalopram (or buspirone if already on an SSRI)
❤️ Ischemic heart disease Iskemik_Kalp Cardiology encounter, ECG, aspirin + β-blocker + statin
🌬️ Asthma Astim Resp encounter, spirometry, SABA + ICS
🚭 COPD COPD Resp encounter, spirometry, LABA + SABA

Module authoring guide: docs/MODULES.md.

Architecture

┌──────────────┐    ┌──────────────────┐    ┌──────────────────────┐
│  Source CSV  │──▶│  Gaussian copula  │──▶│ Physiologic filter   │
│ (Turkish     │    │ (mixed-type ρ;   │    │ (BP, Friedewald,     │
│  pristine)   │    │ nearest-PSD)     │    │  eGFR ↔ creatinine)  │
└──────────────┘    └──────────────────┘    └─────────┬────────────┘
                                                       │
                                  ┌────────────────────┴────────────────────┐
                                  │                                         │
                                  ▼                                         ▼
                       ┌──────────────────┐                  ┌──────────────────────────┐
                       │ Longitudinal     │   (optional)     │  Single-encounter CSV +  │
                       │ expansion        │ ───────────────▶│  FHIR R4 export with      │
                       │ (drift, Poisson) │                  │  module activation        │
                       └─────────┬────────┘                  └──────────────────────────┘
                                 │
                                 ▼
                          (same FHIR export)

Full math (mixed-type correlation, nearest-PSD projection, conditional missingness, AR(1) lab drift): docs/ARCHITECTURE.md.

CLI reference

Command What
syntha generate End-to-end: train copula → sample → modules → CSV + FHIR + model card + validation
syntha fit Fit and persist a copula in a registry without sampling
syntha sample Raw sampling from a registered model
syntha sample-conditional AST-validated rejection sampling against a pandas filter expression
syntha fhir Convert an existing synthetic CSV to FHIR R4 bundles
syntha validate KS / Wasserstein / correlation diff + reference-range coverage
syntha audit Privacy audit (membership-inference + attribute-inference)
syntha serve Read-only FHIR R4 demo server
syntha export-model Export a registered copula to v2 JSON for the desktop app
syntha list-models, show-card Inspect the registry

Run syntha <cmd> --help for full option lists.

Example output

A pretty-printed sample Bundle, a 100-episode synthetic CSV, the model card, and the validation report all live under examples/sample_output/ and are tracked in git.

File What
sample_bundle_pretty.json One pretty-printed transaction Bundle
sample_bundles.ndjson 100 Bundles, one per line (Bulk-FHIR style)
sample_episodes.csv 100 synthetic episodes matching the input schema
sample_model_card.json source_sha256, n_train, marginals, top correlations
sample_validation_report.json KS / Wasserstein / correlation-Frobenius per column

For FHIR-aware rendering: drop the Bundle onto simplifier.net or the HL7 Clinical FHIR Renderer.

What it is not

  • Not privacy-proof. Gaussian copulas are not differentially private. Run syntha audit before sharing any synthetic dataset trained on a small or sensitive cohort.
  • Not a substitute for real PHI when validity hinges on rare events — the copula reproduces the bulk of the joint distribution, not the long tails.
  • Not a population-representative Turkish cohort by default — the source is selected for clinically-pristine adults, so synthetic disease prevalence is lower than TÜİK national figures. Calibration to TÜİK is a curation task — see ROADMAP.md and COLLABORATE.md for how to help.

Contributing + collaboration

Open-source, Apache 2.0, contributions welcome from clinicians, data scientists, and software engineers alike. Three places to start:

  • 🧑‍⚕️ Clinicians — see COLLABORATE.md for the live list of tasks needing clinical-Turkish guidance (drug calibration, ICD specificity, new modules), plus the in-app Collaborate panel that surfaces the same list with one-click "claim" via your GitHub handle.
  • 💻 DevelopersCONTRIBUTING.md for dev setup, commit conventions, and the test matrix.
  • 🗺️ Project directionROADMAP.md for the staged plan, what's shipped, and what's queued.

License + citation

Apache 2.0 © 2026 Ariorad Moniri — see LICENSE. If you use syntha in academic work, please cite:

Moniri, A. (2026). syntha: hybrid synthetic patient record generator
trained on Turkish pristine-healthy EHR cohorts.
https://github.com/ArioMoniri/syntha

Acknowledgements

Project What it gives us
🩺 Synthea Inspiration for the clinical-module layer and FHIR output format
🧪 LOINC Lab and observation codes
🧬 SNOMED CT Condition, procedure, encounter, and care-plan terminology
📑 ICD-10 Diagnosis coding alongside SNOMED
💊 RxNorm Medication coding
📊 Turkish-cohort EHR data steward De-identified retrospective episodes (anonymized upstream; never redistributed by this repo)

Community

💬 Discussions
Open questions, "is this the right tool for X?", show-and-tell
🐛 Issues
Bug reports + feature requests + clinical curation
🤝 Collaborate
Live list of clinician + dev + data tasks · also surfaced in the desktop app
📖 Contributing
Dev setup, commit conventions, test matrix
🗺️ Roadmap
Shipped + queued + what needs a clinician
📋 Changelog
Semver, Keep-a-Changelog, generated by release-please

About

Synthea-inspired hybrid synthetic patient record generator — Gaussian copula + clinical modules, trained on Turkish pristine-healthy EHR cohorts. Outputs CSV + FHIR R4 (LOINC/SNOMED/ICD-10/RxNorm). Includes a Tauri desktop app for non-coders.

Topics

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors