Skip to content

Normalization and Duplication test script#34

Open
Johnnyassaf wants to merge 6 commits intomainfrom
seqr-sanitize
Open

Normalization and Duplication test script#34
Johnnyassaf wants to merge 6 commits intomainfrom
seqr-sanitize

Conversation

@Johnnyassaf
Copy link
Copy Markdown
Contributor

I tried to make a test script that takes a matrix table, detects if there are any unnormalised variants, normalise them, compares them to the original table to detect if any duplicates occurred, print out both entries, arbitrarily choose the then unnormalised (now normalised) entry. Claude insists that I should put in the reference genome, although i think its unnecessary. otherwise I tried to split it up to be as hail efficient as possible.

Purpose

Find unnormalised variants
check if their normalised versions are duplicated in the table
record both instances if they are and use previously unnormalized version as the source of truth

Im unsure where it goes, but i did want to commit it here in case it can be useful

Checklist

  • Version Bump!
  • Related GitHub Issue created
  • Tests covering new change
  • Linting checks pass

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a standalone Hail script intended to normalize variant row keys to minimal representation, detect collisions/duplicates introduced by normalization, and write a cleaned MatrixTable output.

Changes:

  • Introduces a new Hail script that identifies rows needing hl.min_rep normalization and rebuilds the MT with normalized keys.
  • Adds collision detection logic to handle cases where normalized keys already exist in the input MT.
  • Uses interval-based subsetting + join patterns to avoid scanning/shuffling the full table multiple times.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +37 to +49
collected = mt.aggregate_rows(
hl.agg.filter(
needs_norm,
hl.agg.collect(
hl.struct(
old_locus=mt.locus,
old_alleles=mt.alleles,
new_locus=mr.locus,
new_alleles=mr.alleles,
)
),
)
)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand this idea, but this normalization should never affect more than 1000 or so rows at a time, if it ever oom's here, we would have a much bigger problem

Copy link
Copy Markdown
Collaborator

@MattWellie MattWellie Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point. Unless absolutely necessary you should avoid collect operations on hail data. That doubly applies if the collect is trying to capture the majority of the data.

An alternative would be to annotate the dataset with a new pair of fields, normal_locus and normal_alleles (as you've done with the mr object, but written back into the dataset as you want a table keyed on the new versions for cross annotation later), then either annotate a Boolean flag or just filter rows for variants which have an altered representation.

You can write that normalised table out to GCP, as it should be the vast minority of data. You can then use that table to anti-join on the original data to check for duplicates.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Collaborator

@MattWellie MattWellie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lots of things here which are logical steps, but in hail you might need to be more cautious. Collect operations almost always cause memory blowouts, so favour keeping the data/annotations inside hail objects where possible

print(f'Found {n_changed} rows needing normalization')

if n_changed == 0:
mt.write(args.output, overwrite=args.overwrite)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the happy path scenario, no duplication to fix, this write is duplicating the original matrix table

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, its so nice, we had to save it twice

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but also great catch my bad

@EddieLF
Copy link
Copy Markdown
Collaborator

EddieLF commented Apr 10, 2026

My completely unimpactful suggestion that makes me feel like I'm contributing to this PR is to use loguru.logger instead of print

import loguru

loguru.logger.info('Logging instead of printing')

I admittedly have not set the best example with doing this in my scripts...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants