Skip to content

feat: add igv-reports skill for offline HTML genomic-region reports#44

Open
sahuno wants to merge 9 commits into
anthropics:mainfrom
sahuno:feat/add-igv-reports-skill
Open

feat: add igv-reports skill for offline HTML genomic-region reports#44
sahuno wants to merge 9 commits into
anthropics:mainfrom
sahuno:feat/add-igv-reports-skill

Conversation

@sahuno
Copy link
Copy Markdown

@sahuno sahuno commented May 18, 2026

Summary

Adds the igv-reports skill, a cohort-aware driver + post-render verifiers on top of the upstream igv-reports Python package (create_report). Builds self-contained, offline HTML viewers for genomic regions — one HTML per sample for a whole cohort, with embedded BAM/VCF slices and configurable annotation tracks.

Why not just raw create_report?

This skill differentiates from pip install igv-reports + plain create_report invocation in five concrete ways:

  • Cohort mode: TSV samplesheet → per-sample HTMLs + index.html in one call.
  • Post-render structural verifier (verify_report.py, verify_cohort.py): asserts each rendered HTML contains exactly the regions and tracks it was asked to. Catches sample-swap and silent-truncation bugs that create_report exit code doesn't.
  • Opt-in read-count anchor verifier (verify_anchors.py): re-counts reads in each embedded BAM slice against a frozen regression fixture; catches silent empty slices and same-basename mix-ups.
  • ONT 5mC/5hmC methylation presets via generate_tracks_json.py: bakes in colorBy: basemod2, fixed min:0 max:100 y-axis lock for cross-sample bedGraph comparison, --flanking 0, --type mutation. Without these presets users routinely produce reports that look right but auto-scale the y-axis per track, masking tumor-vs-normal methylation differences.
  • prep_track.sh utility for the recurring plain-gzip → bgzip+tabix conversion that trips create_report on misprepared annotation tracks.

Files added under ./igv-reports/

  • SKILL.md — skill manifest + workflow guide, plus references/ pointers
  • scripts/ — 5 Python + 1 shell, all argv-only, no global state
  • references/best_practices.md, databases_config_paths.md (schema for the optional $IGV_REPORTS_DB_CONFIG YAML), methylation_ont.md (cheat-sheet for the ONT path)
  • examples/portable/ — single-sample + cohort reference invocations using user-supplied paths
  • tests/ — 63 hermetic unit tests + 1 smoke test + 4 integration scenarios; integration tests SKIP cleanly via exit 77 without sample BAMs
  • LICENSE.txt — Apache 2.0, matching the convention added in Add Apache License 2.0 to all life science skill directories #34

Updates to marketplace files

  • .claude-plugin/marketplace.json: new plugin entry (alphabetically near scientific-problem-selection); category: life-sciences, tags [bioinformatics, genomics, visualization, variant-validation, structural-variants, ont, nanopore, methylation, igv, html-report].
  • README.md: install command added to both quickstart and Detailed Installation blocks; Skills section entry with use-cases and upstream requirement.

Test plan

  • Unit tests pass: cd igv-reports && bash tests/run_all.sh --unit-only → 63 passed in ~1s. Runs anywhere with pytest + Python ≥ 3.10.
  • JSON validity: python -m json.tool .claude-plugin/marketplace.json succeeds.
  • No external-path leaks: grep -rn 'data1\|miniforge3' igv-reports/ is empty.
  • Smoke tests need samtools on PATH; run with --no-integration.
  • Integration tests require user-supplied BAMs via IGV_REPORTS_TEST_BAM_{1,2,3} env vars (no built-in defaults — SKIP otherwise).

Upstream

Upstream development lives at https://github.com/sahuno/igv-reports-skill. File issues there for skill-level bugs; file issues at https://github.com/igvteam/igv-reports for create_report rendering bugs.

🤖 Generated with Claude Code

sahuno and others added 9 commits May 18, 2026 12:44
Adds the `igv-reports` skill, a cohort-aware driver + post-render verifiers
on top of the upstream igv-reports Python package (`create_report`). Builds
self-contained, offline HTML viewers for genomic regions — one HTML per
sample for a whole cohort, with embedded BAM/VCF slices and configurable
annotation tracks.

Differentiating features vs raw `create_report`:
- Cohort mode: TSV samplesheet → per-sample HTMLs + index.html in one call.
- Post-render structural verifier (`verify_report.py`, `verify_cohort.py`):
  asserts each rendered HTML contains exactly the regions and tracks it was
  asked to. Catches sample-swap and silent-truncation bugs that `create_report`
  exit code doesn't.
- Opt-in read-count anchor verifier (`verify_anchors.py`): re-counts reads
  in each embedded BAM slice against a frozen regression fixture; catches
  silent empty slices and same-basename mix-ups.
- ONT 5mC/5hmC methylation presets via `generate_tracks_json.py`: bakes in
  `colorBy: basemod2`, fixed `min:0 max:100` y-axis lock for cross-sample
  bedGraph comparison, `--flanking 0`, `--type mutation`. Without these
  presets users routinely produce reports that look right but auto-scale
  the y-axis per track, masking tumor-vs-normal methylation differences.
- `prep_track.sh` utility for the recurring plain-gzip → bgzip+tabix
  conversion that trips `create_report` on misprepared annotation tracks.

Files added under `./igv-reports/`:
- SKILL.md (skill manifest + workflow guide, plus references/ pointers)
- scripts/ (5 Python + 1 shell, all argv-only, no global state)
- references/ (best_practices, databases_config_paths schema, methylation_ont
  cheat-sheet)
- examples/portable/ (single-sample + cohort reference invocations using
  user-supplied paths)
- tests/ (63 hermetic unit tests + 1 smoke test + 4 integration scenarios;
  integration tests SKIP cleanly via exit 77 without sample BAMs)
- LICENSE.txt (Apache 2.0, matching the convention added in PR anthropics#34)

Updates:
- `.claude-plugin/marketplace.json`: new plugin entry, alphabetically near
  scientific-problem-selection.
- `README.md`: install command + Skills section entry with use-cases and
  upstream requirement (`pip install -U 'igv-reports>=1.16.0'`).

Test plan:
- Unit tests pass: `cd igv-reports && bash tests/run_all.sh --unit-only`
  → 63 passed in ~1s. Runs anywhere with pytest + Python ≥ 3.10.
- Smoke tests need `samtools`; run with `--no-integration`.
- Integration tests require user-supplied BAMs via `IGV_REPORTS_TEST_BAM_{1,2,3}`
  env vars (no built-in defaults).

Upstream development at https://github.com/sahuno/igv-reports-skill.
Sequential cohort builds dominated wall time for any run with N samples >
~3: a 6-sample run took ~18 min where the BAM-slice I/O could easily run
6-way concurrent. Add `--jobs/-j N` to fan create_report invocations out
across a ThreadPoolExecutor; threads (not processes) because build_one()
spends nearly all its time inside subprocess.run(create_report), which
releases the GIL, and the captured logger is non-picklable.

Failure semantics:
- N==1 keeps the old behavior bit-identical (sequential loop). Default 1
  for backwards compat.
- N>1 uses as_completed so a failed sample surfaces immediately but
  doesn't block already-submitted siblings. All errors collected and
  surfaced together at the end. Cohort exits nonzero if ANY sample
  failed — independent of --fail-on-fail (which gates verifier
  soft-fails).
- Per-sample log lines may interleave under N>1; logger is thread-safe
  so each line is atomic, just out-of-order across samples. The
  `=== <sample> ===` marker stays the per-sample anchor.

Empirical: 6-sample cohort, hg38, --flanking 300, real BAMs:
  --jobs 1  wall ≈ 18 min  (baseline)
  --jobs 6  wall ≈ 4 min   (~4.5x; not 6x because the slowest sample
                            bounds the total)

No new deps; ThreadPoolExecutor is stdlib. 63 unit tests still pass.
Eval-driven improvements after cross-checking against four functional eval
cases that simulate "Claude with skill loaded":

SKILL.md
- "Sites BED format" now documents `--info-columns <colname>` for surfacing
  4+ column BED data in the report's clickable table.
- New "--type for BED-style sites" sub-section surfaces `--type mutation`
  outside the methylation pathway.
- Cohort recipe now mentions the `--jobs N` parallel-builds flag.
- prep-track section documents the new sibling-file mode (--out) and shows
  the `file <name>` diagnostic strings for distinguishing plain-gzip vs bgzip.

scripts/prep_track.sh
- New `--out PATH` flag: non-destructive sibling-file mode for when other
  pipelines point at the original .gz and can't tolerate an in-place replace.
- Fixed a pre-existing latent bug in in-place mode: `bgzip` (htslib 1.20)
  refused to overwrite the existing plain-gzip `.gz` file (exit 2, no
  informative message). Added `rm -f "$TARGET"` before the bgzip step.

scripts/build_igvreports.py
- New `validate_bams()` preflight: every BAM passed via --bam must have a
  sibling index (.bai or .csi). Skipped on the --track-config code path.
  Catches a common silent failure mode that previously produced an obscure
  pysam stack trace several layers in.

scripts/generate_tracks_json.py
- New `--force` flag. By default refuses to overwrite an existing
  tracks.json (exit 2 with actionable message) so hand-edits aren't lost.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Methylation users (--track-config path) no longer have to hand-paste
absolute paths for CpG islands / gencode / RepeatMasker / EPDnew into
the tracks YAML. The same databases YAML resolution the SV viewer
driver already uses (--genome hg38 -> auto-resolved tracks) is now
available via a `default:` shortcut:

```yaml
genome: hg38
annotation:
  - default: gencode
  - default: cgi
  - default: repmasker
  - default: epdnew_coding         # hg38 only
  - default: epdnew_noncoding      # hg38 only
```

- 5 valid keys: cgi, gencode, repmasker, epdnew_coding, epdnew_noncoding.
- Each comes with a colorblind-safe Okabe-Ito color and sensible
  displayMode you can override per entry (name/color/displayMode).
- Mixes freely with the existing explicit `- name: ... url: ...` form.
- Needs a databases YAML at $IGV_REPORTS_DB_CONFIG or --db-config PATH;
  schema is `reference_genomes.local.<genome>.{CpGIslands, gtf,
  repMaskerBed, EPDnewCoding, EPDnewNonCoding}`. See
  references/databases_config_paths.md.

Backwards-compatible: explicit `- name: ... url: ...` entries work
unchanged and don't need --db-config / $IGV_REPORTS_DB_CONFIG to be set.
The db-config is only loaded if the spec actually uses a `default:`
shortcut, so existing self-contained specs stay self-contained.

Error paths: unknown key, missing genome in YAML, missing yaml key for
the chosen genome, missing path on disk, and shortcut-used-without-
top-level-`genome:` all raise SystemExit with actionable messages.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
The HTML report is the deep-dive view; sometimes you also need static
PNGs you can email, drop in a Slack channel, or paste into slides. New
--also-png flag invokes the sister `igver` tool against the SAME sites
BED + SAME resolved track list as create_report, writes a manifest TSV
bridging PNG ↔ HTML rows, and verify_cohort.py runs three additional
checks to confirm the two artifacts stay in sync.

```bash
python scripts/build_igvreports.py     --samplesheet samplesheet.tsv     --genome hg38 --output-dir results/run/reports/     --jobs 6 --also-png --png-dpi 600
```

Output layout per sample:
- <sample>.hg38.html                       interactive
- png_<sample>.hg38/png/<region>.png       per-region PNGs
- png_<sample>.hg38/manifest.tsv           bridge: BED row ↔ PNG ↔ HTML

Five consistency levers:
1. Sites BED with --flanking baked in (igver and igv.js see identical
   coordinates — no renderer divergence on the flanked window).
2. Single resolved track list (positional path reuses the exact list
   passed to create_report; --track-config path extracts local `url:`
   entries from the JSON, http(s) skipped — igver can't slice them).
3. Matched display mode (--png-display-mode collapse aligns with the
   HTML's BAM_DEFAULTS displayMode COLLAPSED).
4. UID-based filenames — BED `name` column drives both the HTML table
   label (via --info-columns name) and the PNG filename suffix
   (`<chr-start-end>.<uid>.png`).
5. manifest.tsv audit trail (11 cols including html_table_row), read
   by verify_cohort.py to power three cross-artifact checks.

verify_cohort.py picks up the manifest automatically and runs three
extra checks per sample:
- png_count_matches_bed: catches partial igver runs, stale manifests.
- pngs_exist_and_nonempty: catches empty IGV snapshots (< 10 KB).
- png_html_row_alignment: catches HTML/PNG drift across rebuilds.

igver resolution order: --igver-cmd / $IGVER_CMD / `igver` on PATH.
If none resolve, exit before the HTML build so you don't pay that cost
before learning PNGs are unavailable.

Methylation caveat documented: HTML uses bedGraph; igver per-read view
uses BAM, igver cross-sample view uses bigwig. Content can be made
identical only if both formats trace back to the same modkit pileup
output.

Tests: 109/109 pass (was 84 → +25 new across test_build_pngs.py and
test_verify_cohort_png.py). Also mirrored the previously-missed
test_generate_tracks_json.py from the methylation-shortcut round.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Empirical 2026-05-18: igver exits 0 even when it fails to render. My
build_pngs_with_igver gated on subprocess.returncode != 0 which missed
this; a "successful" build would silently ship an empty png/ dir.

Fix: after igver returns, walk each expected PNG path (`<chr>-<start>-
<end>.<uid>.<ext>` — igver's documented filename convention) and raise
SystemExit on any missing or zero-byte file. Error message points at
the most common root cause (`pip install igver` egg-link without the
IGV Java binary) and the apptainer SIF fix.

The cohort verifier (`verify_cohort.py:check_pngs_exist_and_nonempty`)
runs the same check at end of build, but the inline check fires for
single-sample `--sites` invocations too, so every code path gets the
safety net.

3 new tests cover the failure modes: no PNGs (the motivating bug),
partial PNGs (mid-batch fail), zero-byte PNGs (truncated write). The
previous "happy path" test now patches subprocess.run with a fake
igver that writes the expected files instead of using a /usr/bin/true
stub. Tests: 112/112.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
… check)

Methylation users care about CpG counts in the bedGraph slice, not BAM
read counts — the silent-failure mode for methylation reports is
"region rendered, bedGraph slice has 0 CpGs" which the BAM-read-count
anchors don't detect. Closes that gap.

scripts/verify_anchors.py
- New track_type column in the anchor TSV schema (default `bam` when
  absent, for backwards compat). Two values: `bam` (samtools-view
  read count, existing behavior) and `bedgraph` (NEW — data row count
  in the wig/bedGraph slice).
- bedgraph_count_source(): count rows overlapping a region. tabix fast
  path for indexed .gz, gzip stream-decompress for plain .gz, linear
  scan for plain text. Overlap rule matches IGV ([start, end) half-open).
- bedgraph_count_slice(): decode the wig/bedGraph slice from the HTML
  data: URL (gzip(text) then base64 per igv_reports/datauri.py),
  count non-header data lines. Falls back to uncompressed for small
  payloads.
- sample_bedgraph_paths(): pull bedGraph/wig entries from samplesheet
  extra_tracks. Recognizes .bedgraph / .bg / .wig, plain or .gz.
- cmd_generate iterates both BAM and bedGraph tracks per row;
  verify_one_html dispatches on track_type to choose the right
  count strategy (no samtools needed for bedgraph).
- _is_wig_data_line() central helper rejects track/browser/
  fixedStep/variableStep/#/empty lines so source and slice counts
  use identical row-filtering semantics.

tests/unit/test_verify_anchors.py (+17 tests, total 129)
- Backwards compat: legacy 10-col anchor file → track_type="bam".
- bedGraph counting: in-region, half-open boundary, header skipping,
  plain-gzip input, missing file.
- Slice decoding: gzipped, uncompressed fallback, empty (silent-empty-
  methylation-slice catch).
- sample_bedgraph_paths: extras parsing, stem suffix stripping.

build_igvreports.py: no changes — run_anchors_generate already shells
to verify_anchors.py, the new bedGraph iteration lives inside
cmd_generate, samplesheet extra_tracks flows through automatically.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Adds tests/integration/end_to_end/, a ~14 s smoke test using the
committed tiny_colo829.hg38.bam fixture (457 KB). Closes the gap
between mocked unit tests and real-world failures: the silent-exit-0
bug fixed yesterday wouldn't have been caught by the mocked suite,
but is caught instantly by this layer.

scenarios.sh exercises eight assertions:
- HTML produced + plausible size (≥ 50 KB)
- verify_report.py structural PASS
- verify_anchors.py generate matches the frozen contract from
  tests/fixtures/README.md (chr2=5, chr7=9)
- verify_anchors.py verify round-trips the embedded slice counts
- --also-png produces manifest + non-empty PNGs (SKIPped when igver
  isn't installed or is the documented broken shim)

Synthesizes a minimal N-only FASTA so CI doesn't need an hg38 download.

run_all.sh runs end_to_end as the first integration scenario (it's
fast + portable, no shared-storage dependency).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
verify_anchors.py recently added `track_type` as the 3rd anchor column,
shifting downstream columns by +1. The integration scenarios edit the
anchor TSV by column index to simulate specific corruption modes, so
they were silently writing to wrong fields after the schema change.

Fix bumps three awk corruption indexes:
  scenario A: $6 -> $7  (corrupt `expected`)
  scenario B: $8 -> $9  (corrupt `min`)
  scenario D: $3 -> $4  (filter on `chrom`)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant