feat: add igv-reports skill for offline HTML genomic-region reports by sahuno · Pull Request #44 · anthropics/life-sciences

sahuno · 2026-05-18T16:50:31Z

Summary

Adds the igv-reports skill, a cohort-aware driver + post-render verifiers on top of the upstream igv-reports Python package (create_report). Builds self-contained, offline HTML viewers for genomic regions — one HTML per sample for a whole cohort, with embedded BAM/VCF slices and configurable annotation tracks.

Why not just raw `create_report`?

This skill differentiates from pip install igv-reports + plain create_report invocation in five concrete ways:

Cohort mode: TSV samplesheet → per-sample HTMLs + index.html in one call.
Post-render structural verifier (verify_report.py, verify_cohort.py): asserts each rendered HTML contains exactly the regions and tracks it was asked to. Catches sample-swap and silent-truncation bugs that create_report exit code doesn't.
Opt-in read-count anchor verifier (verify_anchors.py): re-counts reads in each embedded BAM slice against a frozen regression fixture; catches silent empty slices and same-basename mix-ups.
ONT 5mC/5hmC methylation presets via generate_tracks_json.py: bakes in colorBy: basemod2, fixed min:0 max:100 y-axis lock for cross-sample bedGraph comparison, --flanking 0, --type mutation. Without these presets users routinely produce reports that look right but auto-scale the y-axis per track, masking tumor-vs-normal methylation differences.
prep_track.sh utility for the recurring plain-gzip → bgzip+tabix conversion that trips create_report on misprepared annotation tracks.

Files added under `./igv-reports/`

SKILL.md — skill manifest + workflow guide, plus references/ pointers
scripts/ — 5 Python + 1 shell, all argv-only, no global state
references/ — best_practices.md, databases_config_paths.md (schema for the optional $IGV_REPORTS_DB_CONFIG YAML), methylation_ont.md (cheat-sheet for the ONT path)
examples/portable/ — single-sample + cohort reference invocations using user-supplied paths
tests/ — 63 hermetic unit tests + 1 smoke test + 4 integration scenarios; integration tests SKIP cleanly via exit 77 without sample BAMs
LICENSE.txt — Apache 2.0, matching the convention added in Add Apache License 2.0 to all life science skill directories #34

Updates to marketplace files

.claude-plugin/marketplace.json: new plugin entry (alphabetically near scientific-problem-selection); category: life-sciences, tags [bioinformatics, genomics, visualization, variant-validation, structural-variants, ont, nanopore, methylation, igv, html-report].
README.md: install command added to both quickstart and Detailed Installation blocks; Skills section entry with use-cases and upstream requirement.

Test plan

Unit tests pass: cd igv-reports && bash tests/run_all.sh --unit-only → 63 passed in ~1s. Runs anywhere with pytest + Python ≥ 3.10.
JSON validity: python -m json.tool .claude-plugin/marketplace.json succeeds.
No external-path leaks: grep -rn 'data1\|miniforge3' igv-reports/ is empty.
Smoke tests need samtools on PATH; run with --no-integration.
Integration tests require user-supplied BAMs via IGV_REPORTS_TEST_BAM_{1,2,3} env vars (no built-in defaults — SKIP otherwise).

Upstream

Upstream development lives at https://github.com/sahuno/igv-reports-skill. File issues there for skill-level bugs; file issues at https://github.com/igvteam/igv-reports for create_report rendering bugs.

🤖 Generated with Claude Code

Adds the `igv-reports` skill, a cohort-aware driver + post-render verifiers on top of the upstream igv-reports Python package (`create_report`). Builds self-contained, offline HTML viewers for genomic regions — one HTML per sample for a whole cohort, with embedded BAM/VCF slices and configurable annotation tracks. Differentiating features vs raw `create_report`: - Cohort mode: TSV samplesheet → per-sample HTMLs + index.html in one call. - Post-render structural verifier (`verify_report.py`, `verify_cohort.py`): asserts each rendered HTML contains exactly the regions and tracks it was asked to. Catches sample-swap and silent-truncation bugs that `create_report` exit code doesn't. - Opt-in read-count anchor verifier (`verify_anchors.py`): re-counts reads in each embedded BAM slice against a frozen regression fixture; catches silent empty slices and same-basename mix-ups. - ONT 5mC/5hmC methylation presets via `generate_tracks_json.py`: bakes in `colorBy: basemod2`, fixed `min:0 max:100` y-axis lock for cross-sample bedGraph comparison, `--flanking 0`, `--type mutation`. Without these presets users routinely produce reports that look right but auto-scale the y-axis per track, masking tumor-vs-normal methylation differences. - `prep_track.sh` utility for the recurring plain-gzip → bgzip+tabix conversion that trips `create_report` on misprepared annotation tracks. Files added under `./igv-reports/`: - SKILL.md (skill manifest + workflow guide, plus references/ pointers) - scripts/ (5 Python + 1 shell, all argv-only, no global state) - references/ (best_practices, databases_config_paths schema, methylation_ont cheat-sheet) - examples/portable/ (single-sample + cohort reference invocations using user-supplied paths) - tests/ (63 hermetic unit tests + 1 smoke test + 4 integration scenarios; integration tests SKIP cleanly via exit 77 without sample BAMs) - LICENSE.txt (Apache 2.0, matching the convention added in PR anthropics#34) Updates: - `.claude-plugin/marketplace.json`: new plugin entry, alphabetically near scientific-problem-selection. - `README.md`: install command + Skills section entry with use-cases and upstream requirement (`pip install -U 'igv-reports>=1.16.0'`). Test plan: - Unit tests pass: `cd igv-reports && bash tests/run_all.sh --unit-only` → 63 passed in ~1s. Runs anywhere with pytest + Python ≥ 3.10. - Smoke tests need `samtools`; run with `--no-integration`. - Integration tests require user-supplied BAMs via `IGV_REPORTS_TEST_BAM_{1,2,3}` env vars (no built-in defaults). Upstream development at https://github.com/sahuno/igv-reports-skill.

Sequential cohort builds dominated wall time for any run with N samples > ~3: a 6-sample run took ~18 min where the BAM-slice I/O could easily run 6-way concurrent. Add `--jobs/-j N` to fan create_report invocations out across a ThreadPoolExecutor; threads (not processes) because build_one() spends nearly all its time inside subprocess.run(create_report), which releases the GIL, and the captured logger is non-picklable. Failure semantics: - N==1 keeps the old behavior bit-identical (sequential loop). Default 1 for backwards compat. - N>1 uses as_completed so a failed sample surfaces immediately but doesn't block already-submitted siblings. All errors collected and surfaced together at the end. Cohort exits nonzero if ANY sample failed — independent of --fail-on-fail (which gates verifier soft-fails). - Per-sample log lines may interleave under N>1; logger is thread-safe so each line is atomic, just out-of-order across samples. The `=== <sample> ===` marker stays the per-sample anchor. Empirical: 6-sample cohort, hg38, --flanking 300, real BAMs: --jobs 1 wall ≈ 18 min (baseline) --jobs 6 wall ≈ 4 min (~4.5x; not 6x because the slowest sample bounds the total) No new deps; ThreadPoolExecutor is stdlib. 63 unit tests still pass.

Eval-driven improvements after cross-checking against four functional eval cases that simulate "Claude with skill loaded": SKILL.md - "Sites BED format" now documents `--info-columns <colname>` for surfacing 4+ column BED data in the report's clickable table. - New "--type for BED-style sites" sub-section surfaces `--type mutation` outside the methylation pathway. - Cohort recipe now mentions the `--jobs N` parallel-builds flag. - prep-track section documents the new sibling-file mode (--out) and shows the `file <name>` diagnostic strings for distinguishing plain-gzip vs bgzip. scripts/prep_track.sh - New `--out PATH` flag: non-destructive sibling-file mode for when other pipelines point at the original .gz and can't tolerate an in-place replace. - Fixed a pre-existing latent bug in in-place mode: `bgzip` (htslib 1.20) refused to overwrite the existing plain-gzip `.gz` file (exit 2, no informative message). Added `rm -f "$TARGET"` before the bgzip step. scripts/build_igvreports.py - New `validate_bams()` preflight: every BAM passed via --bam must have a sibling index (.bai or .csi). Skipped on the --track-config code path. Catches a common silent failure mode that previously produced an obscure pysam stack trace several layers in. scripts/generate_tracks_json.py - New `--force` flag. By default refuses to overwrite an existing tracks.json (exit 2 with actionable message) so hand-edits aren't lost. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Methylation users (--track-config path) no longer have to hand-paste absolute paths for CpG islands / gencode / RepeatMasker / EPDnew into the tracks YAML. The same databases YAML resolution the SV viewer driver already uses (--genome hg38 -> auto-resolved tracks) is now available via a `default:` shortcut: ```yaml genome: hg38 annotation: - default: gencode - default: cgi - default: repmasker - default: epdnew_coding # hg38 only - default: epdnew_noncoding # hg38 only ``` - 5 valid keys: cgi, gencode, repmasker, epdnew_coding, epdnew_noncoding. - Each comes with a colorblind-safe Okabe-Ito color and sensible displayMode you can override per entry (name/color/displayMode). - Mixes freely with the existing explicit `- name: ... url: ...` form. - Needs a databases YAML at $IGV_REPORTS_DB_CONFIG or --db-config PATH; schema is `reference_genomes.local.<genome>.{CpGIslands, gtf, repMaskerBed, EPDnewCoding, EPDnewNonCoding}`. See references/databases_config_paths.md. Backwards-compatible: explicit `- name: ... url: ...` entries work unchanged and don't need --db-config / $IGV_REPORTS_DB_CONFIG to be set. The db-config is only loaded if the spec actually uses a `default:` shortcut, so existing self-contained specs stay self-contained. Error paths: unknown key, missing genome in YAML, missing yaml key for the chosen genome, missing path on disk, and shortcut-used-without- top-level-`genome:` all raise SystemExit with actionable messages. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

The HTML report is the deep-dive view; sometimes you also need static PNGs you can email, drop in a Slack channel, or paste into slides. New --also-png flag invokes the sister `igver` tool against the SAME sites BED + SAME resolved track list as create_report, writes a manifest TSV bridging PNG ↔ HTML rows, and verify_cohort.py runs three additional checks to confirm the two artifacts stay in sync. ```bash python scripts/build_igvreports.py --samplesheet samplesheet.tsv --genome hg38 --output-dir results/run/reports/ --jobs 6 --also-png --png-dpi 600 ``` Output layout per sample: - <sample>.hg38.html interactive - png_<sample>.hg38/png/<region>.png per-region PNGs - png_<sample>.hg38/manifest.tsv bridge: BED row ↔ PNG ↔ HTML Five consistency levers: 1. Sites BED with --flanking baked in (igver and igv.js see identical coordinates — no renderer divergence on the flanked window). 2. Single resolved track list (positional path reuses the exact list passed to create_report; --track-config path extracts local `url:` entries from the JSON, http(s) skipped — igver can't slice them). 3. Matched display mode (--png-display-mode collapse aligns with the HTML's BAM_DEFAULTS displayMode COLLAPSED). 4. UID-based filenames — BED `name` column drives both the HTML table label (via --info-columns name) and the PNG filename suffix (`<chr-start-end>.<uid>.png`). 5. manifest.tsv audit trail (11 cols including html_table_row), read by verify_cohort.py to power three cross-artifact checks. verify_cohort.py picks up the manifest automatically and runs three extra checks per sample: - png_count_matches_bed: catches partial igver runs, stale manifests. - pngs_exist_and_nonempty: catches empty IGV snapshots (< 10 KB). - png_html_row_alignment: catches HTML/PNG drift across rebuilds. igver resolution order: --igver-cmd / $IGVER_CMD / `igver` on PATH. If none resolve, exit before the HTML build so you don't pay that cost before learning PNGs are unavailable. Methylation caveat documented: HTML uses bedGraph; igver per-read view uses BAM, igver cross-sample view uses bigwig. Content can be made identical only if both formats trace back to the same modkit pileup output. Tests: 109/109 pass (was 84 → +25 new across test_build_pngs.py and test_verify_cohort_png.py). Also mirrored the previously-missed test_generate_tracks_json.py from the methylation-shortcut round. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Empirical 2026-05-18: igver exits 0 even when it fails to render. My build_pngs_with_igver gated on subprocess.returncode != 0 which missed this; a "successful" build would silently ship an empty png/ dir. Fix: after igver returns, walk each expected PNG path (`<chr>-<start>- <end>.<uid>.<ext>` — igver's documented filename convention) and raise SystemExit on any missing or zero-byte file. Error message points at the most common root cause (`pip install igver` egg-link without the IGV Java binary) and the apptainer SIF fix. The cohort verifier (`verify_cohort.py:check_pngs_exist_and_nonempty`) runs the same check at end of build, but the inline check fires for single-sample `--sites` invocations too, so every code path gets the safety net. 3 new tests cover the failure modes: no PNGs (the motivating bug), partial PNGs (mid-batch fail), zero-byte PNGs (truncated write). The previous "happy path" test now patches subprocess.run with a fake igver that writes the expected files instead of using a /usr/bin/true stub. Tests: 112/112. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

… check) Methylation users care about CpG counts in the bedGraph slice, not BAM read counts — the silent-failure mode for methylation reports is "region rendered, bedGraph slice has 0 CpGs" which the BAM-read-count anchors don't detect. Closes that gap. scripts/verify_anchors.py - New track_type column in the anchor TSV schema (default `bam` when absent, for backwards compat). Two values: `bam` (samtools-view read count, existing behavior) and `bedgraph` (NEW — data row count in the wig/bedGraph slice). - bedgraph_count_source(): count rows overlapping a region. tabix fast path for indexed .gz, gzip stream-decompress for plain .gz, linear scan for plain text. Overlap rule matches IGV ([start, end) half-open). - bedgraph_count_slice(): decode the wig/bedGraph slice from the HTML data: URL (gzip(text) then base64 per igv_reports/datauri.py), count non-header data lines. Falls back to uncompressed for small payloads. - sample_bedgraph_paths(): pull bedGraph/wig entries from samplesheet extra_tracks. Recognizes .bedgraph / .bg / .wig, plain or .gz. - cmd_generate iterates both BAM and bedGraph tracks per row; verify_one_html dispatches on track_type to choose the right count strategy (no samtools needed for bedgraph). - _is_wig_data_line() central helper rejects track/browser/ fixedStep/variableStep/#/empty lines so source and slice counts use identical row-filtering semantics. tests/unit/test_verify_anchors.py (+17 tests, total 129) - Backwards compat: legacy 10-col anchor file → track_type="bam". - bedGraph counting: in-region, half-open boundary, header skipping, plain-gzip input, missing file. - Slice decoding: gzipped, uncompressed fallback, empty (silent-empty- methylation-slice catch). - sample_bedgraph_paths: extras parsing, stem suffix stripping. build_igvreports.py: no changes — run_anchors_generate already shells to verify_anchors.py, the new bedGraph iteration lives inside cmd_generate, samplesheet extra_tracks flows through automatically. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Adds tests/integration/end_to_end/, a ~14 s smoke test using the committed tiny_colo829.hg38.bam fixture (457 KB). Closes the gap between mocked unit tests and real-world failures: the silent-exit-0 bug fixed yesterday wouldn't have been caught by the mocked suite, but is caught instantly by this layer. scenarios.sh exercises eight assertions: - HTML produced + plausible size (≥ 50 KB) - verify_report.py structural PASS - verify_anchors.py generate matches the frozen contract from tests/fixtures/README.md (chr2=5, chr7=9) - verify_anchors.py verify round-trips the embedded slice counts - --also-png produces manifest + non-empty PNGs (SKIPped when igver isn't installed or is the documented broken shim) Synthesizes a minimal N-only FASTA so CI doesn't need an hg38 download. run_all.sh runs end_to_end as the first integration scenario (it's fast + portable, no shared-storage dependency). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

verify_anchors.py recently added `track_type` as the 3rd anchor column, shifting downstream columns by +1. The integration scenarios edit the anchor TSV by column index to simulate specific corruption modes, so they were silently writing to wrong fields after the schema change. Fix bumps three awk corruption indexes: scenario A: $6 -> $7 (corrupt `expected`) scenario B: $8 -> $9 (corrupt `min`) scenario D: $3 -> $4 (filter on `chrom`) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

sahuno and others added 9 commits May 18, 2026 12:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add igv-reports skill for offline HTML genomic-region reports#44

feat: add igv-reports skill for offline HTML genomic-region reports#44
sahuno wants to merge 9 commits into
anthropics:mainfrom
sahuno:feat/add-igv-reports-skill

sahuno commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sahuno commented May 18, 2026

Summary

Why not just raw create_report?

Files added under ./igv-reports/

Updates to marketplace files

Test plan

Upstream

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Why not just raw `create_report`?

Files added under `./igv-reports/`