Conversation
…by unnormalized variants
There was a problem hiding this comment.
Pull request overview
Adds a standalone Hail script intended to normalize variant row keys to minimal representation, detect collisions/duplicates introduced by normalization, and write a cleaned MatrixTable output.
Changes:
- Introduces a new Hail script that identifies rows needing
hl.min_repnormalization and rebuilds the MT with normalized keys. - Adds collision detection logic to handle cases where normalized keys already exist in the input MT.
- Uses interval-based subsetting + join patterns to avoid scanning/shuffling the full table multiple times.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
src/cpg_seqr_loader/hail_scripts/normilization_duplication_check.py
Outdated
Show resolved
Hide resolved
src/cpg_seqr_loader/hail_scripts/normalisation_duplication_check.py
Outdated
Show resolved
Hide resolved
src/cpg_seqr_loader/hail_scripts/normalisation_duplication_check.py
Outdated
Show resolved
Hide resolved
| collected = mt.aggregate_rows( | ||
| hl.agg.filter( | ||
| needs_norm, | ||
| hl.agg.collect( | ||
| hl.struct( | ||
| old_locus=mt.locus, | ||
| old_alleles=mt.alleles, | ||
| new_locus=mr.locus, | ||
| new_alleles=mr.alleles, | ||
| ) | ||
| ), | ||
| ) | ||
| ) |
There was a problem hiding this comment.
I understand this idea, but this normalization should never affect more than 1000 or so rows at a time, if it ever oom's here, we would have a much bigger problem
There was a problem hiding this comment.
This is a good point. Unless absolutely necessary you should avoid collect operations on hail data. That doubly applies if the collect is trying to capture the majority of the data.
An alternative would be to annotate the dataset with a new pair of fields, normal_locus and normal_alleles (as you've done with the mr object, but written back into the dataset as you want a table keyed on the new versions for cross annotation later), then either annotate a Boolean flag or just filter rows for variants which have an altered representation.
You can write that normalised table out to GCP, as it should be the vast minority of data. You can then use that table to anti-join on the original data to check for duplicates.
src/cpg_seqr_loader/hail_scripts/normilization_duplication_check.py
Outdated
Show resolved
Hide resolved
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
MattWellie
left a comment
There was a problem hiding this comment.
Lots of things here which are logical steps, but in hail you might need to be more cautious. Collect operations almost always cause memory blowouts, so favour keeping the data/annotations inside hail objects where possible
| print(f'Found {n_changed} rows needing normalization') | ||
|
|
||
| if n_changed == 0: | ||
| mt.write(args.output, overwrite=args.overwrite) |
There was a problem hiding this comment.
In the happy path scenario, no duplication to fix, this write is duplicating the original matrix table
There was a problem hiding this comment.
yes, its so nice, we had to save it twice
There was a problem hiding this comment.
but also great catch my bad
|
My completely unimpactful suggestion that makes me feel like I'm contributing to this PR is to use import loguru
loguru.logger.info('Logging instead of printing')I admittedly have not set the best example with doing this in my scripts... |
I tried to make a test script that takes a matrix table, detects if there are any unnormalised variants, normalise them, compares them to the original table to detect if any duplicates occurred, print out both entries, arbitrarily choose the then unnormalised (now normalised) entry. Claude insists that I should put in the reference genome, although i think its unnecessary. otherwise I tried to split it up to be as hail efficient as possible.
Purpose
Find unnormalised variants
check if their normalised versions are duplicated in the table
record both instances if they are and use previously unnormalized version as the source of truth
Im unsure where it goes, but i did want to commit it here in case it can be useful
Checklist