Normalization and Duplication test script by Johnnyassaf · Pull Request #34 · populationgenomics/cpg-flow-seqr-loader

Johnnyassaf · 2026-04-10T05:45:39Z

I tried to make a test script that takes a matrix table, detects if there are any unnormalised variants, normalise them, compares them to the original table to detect if any duplicates occurred, print out both entries, arbitrarily choose the then unnormalised (now normalised) entry. Claude insists that I should put in the reference genome, although i think its unnecessary. otherwise I tried to split it up to be as hail efficient as possible.

Purpose

Find unnormalised variants
check if their normalised versions are duplicated in the table
record both instances if they are and use previously unnormalized version as the source of truth

Im unsure where it goes, but i did want to commit it here in case it can be useful

Checklist

Version Bump!
Related GitHub Issue created
Tests covering new change
Linting checks pass

…by unnormalized variants

Copilot

Pull request overview

Adds a standalone Hail script intended to normalize variant row keys to minimal representation, detect collisions/duplicates introduced by normalization, and write a cleaned MatrixTable output.

Changes:

Introduces a new Hail script that identifies rows needing hl.min_rep normalization and rebuilds the MT with normalized keys.
Adds collision detection logic to handle cases where normalized keys already exist in the input MT.
Uses interval-based subsetting + join patterns to avoid scanning/shuffling the full table multiple times.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/cpg_seqr_loader/hail_scripts/normilization_duplication_check.py

src/cpg_seqr_loader/hail_scripts/normalisation_duplication_check.py

Johnnyassaf · 2026-04-10T06:01:47Z

src/cpg_seqr_loader/hail_scripts/normalisation_duplication_check.py

+    collected = mt.aggregate_rows(
+        hl.agg.filter(
+            needs_norm,
+            hl.agg.collect(
+                hl.struct(
+                    old_locus=mt.locus,
+                    old_alleles=mt.alleles,
+                    new_locus=mr.locus,
+                    new_alleles=mr.alleles,
+                )
+            ),
+        )
+    )


I understand this idea, but this normalization should never affect more than 1000 or so rows at a time, if it ever oom's here, we would have a much bigger problem

This is a good point. Unless absolutely necessary you should avoid collect operations on hail data. That doubly applies if the collect is trying to capture the majority of the data.

An alternative would be to annotate the dataset with a new pair of fields, normal_locus and normal_alleles (as you've done with the mr object, but written back into the dataset as you want a table keyed on the new versions for cross annotation later), then either annotate a Boolean flag or just filter rows for variants which have an altered representation.

You can write that normalised table out to GCP, as it should be the vast minority of data. You can then use that table to anti-join on the original data to check for duplicates.

src/cpg_seqr_loader/hail_scripts/normalisation_duplication_check.py

src/cpg_seqr_loader/hail_scripts/normilization_duplication_check.py

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

MattWellie

Lots of things here which are logical steps, but in hail you might need to be more cautious. Collect operations almost always cause memory blowouts, so favour keeping the data/annotations inside hail objects where possible

MattWellie · 2026-04-10T06:08:41Z

src/cpg_seqr_loader/hail_scripts/normalisation_duplication_check.py

+    print(f'Found {n_changed} rows needing normalization')
+
+    if n_changed == 0:
+        mt.write(args.output, overwrite=args.overwrite)


In the happy path scenario, no duplication to fix, this write is duplicating the original matrix table

yes, its so nice, we had to save it twice

but also great catch my bad

EddieLF · 2026-04-10T06:12:40Z

My completely unimpactful suggestion that makes me feel like I'm contributing to this PR is to use loguru.logger instead of print

import loguru

loguru.logger.info('Logging instead of printing')

I admittedly have not set the best example with doing this in my scripts...

Adding a test script to fix and record all duplicate entries created …

ce1685e

…by unnormalized variants

Johnnyassaf requested review from EddieLF, MattWellie and Copilot April 10, 2026 05:45

Copilot started reviewing on behalf of Johnnyassaf April 10, 2026 05:46 View session

Copilot AI reviewed Apr 10, 2026

View reviewed changes

hail_batch fix

013c6c1

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Johnnyassaf temporarily deployed to development April 10, 2026 05:55 — with GitHub Actions Inactive

Johnnyassaf temporarily deployed to development April 10, 2026 05:56 — with GitHub Actions Inactive

appease the linting gods

24bebf4

Johnnyassaf temporarily deployed to development April 10, 2026 06:05 — with GitHub Actions Inactive

MattWellie reviewed Apr 10, 2026

View reviewed changes

adding loguru and changing happy case

cbd557c

Johnnyassaf temporarily deployed to development April 10, 2026 06:28 — with GitHub Actions Inactive

simplified and added checkpoint for small subset

1e08f30

Johnnyassaf temporarily deployed to development April 10, 2026 06:43 — with GitHub Actions Inactive

Johnnyassaf temporarily deployed to development April 10, 2026 06:44 — with GitHub Actions Inactive

linting

5d9bd4b

Johnnyassaf temporarily deployed to development April 10, 2026 06:46 — with GitHub Actions Inactive

Johnnyassaf deployed to development April 10, 2026 06:47 — with GitHub Actions Active

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Normalization and Duplication test script#34

Normalization and Duplication test script#34
Johnnyassaf wants to merge 6 commits intomainfrom
seqr-sanitize

Johnnyassaf commented Apr 10, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Johnnyassaf Apr 10, 2026

Uh oh!

MattWellie Apr 10, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

MattWellie left a comment

Uh oh!

MattWellie Apr 10, 2026

Uh oh!

Johnnyassaf Apr 10, 2026

Uh oh!

Johnnyassaf Apr 10, 2026

Uh oh!

EddieLF commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Johnnyassaf commented Apr 10, 2026

Purpose

Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Johnnyassaf Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

MattWellie Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

MattWellie left a comment

Choose a reason for hiding this comment

Uh oh!

MattWellie Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Johnnyassaf Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Johnnyassaf Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

EddieLF commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

MattWellie Apr 10, 2026 •

edited

Loading