Develop by joshfactorial · Pull Request #123 · ncsa/rusty-neat

joshfactorial · 2026-05-13T19:08:46Z

Added custom mutation rate bed file. Takes in a bed file with a column "mut_rate=0.0001" for example, and produces variants at that specified mutation rate within the bed region, and default everywhere else. Can intersect with a target bed as well.

Improvements to memory usage by streaming files.

Modeled real data (GIAB - NA12878) to create baseline human models for rneat to run on.

Incremental flush of bam file, enabled by the next item.

Lost shuffling. Shuffling was too memory expensive, and even turned off the architecture in place to do the shuffle was eating up memory. So we nixed it, with a recommendation to use seqkit to shuffle the fastq.

Overall, though rneat lost the ability to do shuffling of fastq files, which only was practical on small genomes in the first place, it made huge memory gains and can now handle full genome files on an HPC cluster, and up to medium sized on a standard desktop computer.

…tra information. Will make it clear in the docs what is expected. Also, fixed a couple small things that were bugging me.

…ies into bed file, and added an input mutation regions parameter to the config. Still need to implement the logic. NEAT 2.1 read the file, but never implemented the actual mutation regions

…to the default outside of the mutation regions bed. I'm not entirely sure what the original design was going to be, but this seems like a logical place to start, to me. I manually added in a readme item explaining how to use the mutation_regions bed and the rules for use. I also tweaked some of the logic in the parsing mutation regions file, deciding it was worth having a 1-bit flag so we can skip searching for a mutation rate in beds where we don't care.

…, since the field is only really needed when we know it's a mutation regions bed. Had Junie add some tests to cover this code

Feature/custom mutation rates

… at all

…in a reverse in expected count of SNPs v INDELs. Should be fixed now.

…I think writing more files to save some memory.

…eal-world WGS data gen-reads memory: - Replace per-position Vec<f64> bias_map with compact (start, end, rate) segment list; rice peak RSS 3.06 GB -> 1.05 GB (~3x reduction), major page faults 109K -> 0 - SE cover_dataset fragment pool reduced to 1 entry with cyclic index, eliminating O(coverage * contig_length) VecDeque allocation and redundant RNG shuffles gen-mut-model VCF fixes: - Handle missing alleles (./. and ./1) in GT field without panicking - Skip multi-allelic ALT sites (comma-separated) with a debug log - Warn-and-skip on unparseable FORMAT/SAMPLE rather than aborting model build - Canonicalize soft-masked FASTA bases before REF comparison; genuine mismatches warn-and-skip instead of aborting (fixes hg38 soft-masked reference) - Fix optional bed_file key using .get() instead of [] index (panic on missing key) - Demote per-variant "Found genotype" log from INFO to DEBUG NA12878 WGS human model now builds successfully: mutation_rate 0.001457, homozygous_frequency 0.390, 86% SNP / 6.6% INS / 7.4% DEL, 2:15 wall clock. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…on to 1.5.0 - New "Running on HPC" section with per-subcommand SLURM headers, memory/scratch budgets, and a sizing table for bacterial/human gen-reads runs - gen-gc-bias-model: document plain-text vs gzip memory trade-off and per-chromosome gzip strategy for full-genome runs - Correct stale note that gzip coverage files are unsupported (added in v1.5.0) - Bump version number in Prerequisites from 1.4.1 to 1.5.0 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Replaces the plain-text-only CoverageIndex/CoverageData::load API with a unified CoverageReader that transparently handles both plain-text and gzip-compressed coverage files. Plain-text path is unchanged (byte-offset seek, one contig in RAM at a time). Gzip path streams the entire file in a single forward pass and evicts contigs from the in-memory map as they are consumed, so peak memory declines as processing advances. Uses existing common::file_tools::file_io helpers (is_gzipped_file, read_gzip_lines) rather than reimplementing gzip detection or decoding. Adds 5 new tests covering the gzip path, eviction behaviour, and format-parity between plain and gzip readers. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

1. Import style: replace crate::common:: with common:: in runner.rs test module to match the convention used everywhere else in the file. 2. parse_other_for_mut return type: change Result<Option<f64>, BedErrors> to Result<f64, BedErrors>. The function never returned Ok(None) — both missing-tag and malformed-value paths returned Err. Update new_mut_region_record to wrap the result in Some(), and update 5 test assertions to compare bare f64 values. 3. Unit tests for rate helpers: add test_apply_rate_override, test_rate_at, and test_exclude_positions to runner.rs covering boundary splits, gap handling, and multi-segment exclusions that were previously only exercised indirectly via integration tests. 4. Trailing newlines: add missing final newline to bed_record.rs, bed_reader.rs, and runner.rs. 5. filter-reads boundary test: add test_filter_fastq_boundary to filter_lib.rs confirming that reads touching (but not overlapping) a BED region boundary are correctly excluded end-to-end, validating the overlaps() semantic fix from this branch. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

joshfactorial and others added 17 commits May 11, 2026 22:28

adding ability of bed_record to parse out a mutation rate from the ex…

bb8dbbb

…tra information. Will make it clear in the docs what is expected. Also, fixed a couple small things that were bugging me.

Code cleanup, aisle config.rs. Also, added mutation_regions capabilit…

c2ba4e7

…ies into bed file, and added an input mutation regions parameter to the config. Still need to implement the logic. NEAT 2.1 read the file, but never implemented the actual mutation regions

Minor code celanup. Adding mutation_regions to ContigContext.

36b2951

Seeing what Junie does with a relatively easy task

56cfc1c

Added a new generator for bedrecord to simplify things for the caller…

679688e

…, since the field is only really needed when we know it's a mutation regions bed. Had Junie add some tests to cover this code

Merge pull request #122 from ncsa/feature/custom_mutation_rates

e504e75

Feature/custom mutation rates

committing, but this is not the final form

3b95c24

Resolved merge conflicts in bed_reader.rs and updated tests

a10b97a

Expanding tests and double checking code

4a1da15

doing a reverse pass during cover_dataset to see if that helps things…

7e2a568

… at all

Testing revealed that the mutation model numbers were off, resulting …

f74694b

…in a reverse in expected count of SNPs v INDELs. Should be fixed now.

Trying to get memory use more efficient with Claude's help. It's now …

42fcd16

…I think writing more files to save some memory.

joshfactorial merged commit 7e44f2c into main May 18, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Develop#123

Develop#123
joshfactorial merged 17 commits into
mainfrom
develop

joshfactorial commented May 13, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

joshfactorial commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

joshfactorial commented May 13, 2026 •

edited

Loading