Skip to content

Develop#123

Merged
joshfactorial merged 17 commits into
mainfrom
develop
May 18, 2026
Merged

Develop#123
joshfactorial merged 17 commits into
mainfrom
develop

Conversation

@joshfactorial
Copy link
Copy Markdown
Collaborator

@joshfactorial joshfactorial commented May 13, 2026

Added custom mutation rate bed file. Takes in a bed file with a column "mut_rate=0.0001" for example, and produces variants at that specified mutation rate within the bed region, and default everywhere else. Can intersect with a target bed as well.

Improvements to memory usage by streaming files.

Modeled real data (GIAB - NA12878) to create baseline human models for rneat to run on.

Incremental flush of bam file, enabled by the next item.

Lost shuffling. Shuffling was too memory expensive, and even turned off the architecture in place to do the shuffle was eating up memory. So we nixed it, with a recommendation to use seqkit to shuffle the fastq.

Overall, though rneat lost the ability to do shuffling of fastq files, which only was practical on small genomes in the first place, it made huge memory gains and can now handle full genome files on an HPC cluster, and up to medium sized on a standard desktop computer.

joshfactorial and others added 17 commits May 11, 2026 22:28
…tra information. Will make it clear in the docs what is expected. Also, fixed a couple small things that were bugging me.
…ies into bed file, and added an input mutation regions parameter to the config. Still need to implement the logic. NEAT 2.1 read the file, but never implemented the actual mutation regions
…to the default outside of the mutation regions bed. I'm not entirely sure what the original design was going to be, but this seems like a logical place to start, to me. I manually added in a readme item explaining how to use the mutation_regions bed and the rules for use. I also tweaked some of the logic in the parsing mutation regions file, deciding it was worth having a 1-bit flag so we can skip searching for a mutation rate in beds where we don't care.
…, since the field is only really needed when we know it's a mutation regions bed. Had Junie add some tests to cover this code
…in a reverse in expected count of SNPs v INDELs. Should be fixed now.
…I think writing more files to save some memory.
…eal-world WGS data

gen-reads memory:
- Replace per-position Vec<f64> bias_map with compact (start, end, rate) segment list;
  rice peak RSS 3.06 GB -> 1.05 GB (~3x reduction), major page faults 109K -> 0
- SE cover_dataset fragment pool reduced to 1 entry with cyclic index, eliminating
  O(coverage * contig_length) VecDeque allocation and redundant RNG shuffles

gen-mut-model VCF fixes:
- Handle missing alleles (./. and ./1) in GT field without panicking
- Skip multi-allelic ALT sites (comma-separated) with a debug log
- Warn-and-skip on unparseable FORMAT/SAMPLE rather than aborting model build
- Canonicalize soft-masked FASTA bases before REF comparison; genuine mismatches
  warn-and-skip instead of aborting (fixes hg38 soft-masked reference)
- Fix optional bed_file key using .get() instead of [] index (panic on missing key)
- Demote per-variant "Found genotype" log from INFO to DEBUG

NA12878 WGS human model now builds successfully: mutation_rate 0.001457,
homozygous_frequency 0.390, 86% SNP / 6.6% INS / 7.4% DEL, 2:15 wall clock.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…on to 1.5.0

- New "Running on HPC" section with per-subcommand SLURM headers, memory/scratch
  budgets, and a sizing table for bacterial/human gen-reads runs
- gen-gc-bias-model: document plain-text vs gzip memory trade-off and
  per-chromosome gzip strategy for full-genome runs
- Correct stale note that gzip coverage files are unsupported (added in v1.5.0)
- Bump version number in Prerequisites from 1.4.1 to 1.5.0

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replaces the plain-text-only CoverageIndex/CoverageData::load API with a
unified CoverageReader that transparently handles both plain-text and
gzip-compressed coverage files. Plain-text path is unchanged (byte-offset
seek, one contig in RAM at a time). Gzip path streams the entire file in a
single forward pass and evicts contigs from the in-memory map as they are
consumed, so peak memory declines as processing advances.

Uses existing common::file_tools::file_io helpers (is_gzipped_file,
read_gzip_lines) rather than reimplementing gzip detection or decoding.
Adds 5 new tests covering the gzip path, eviction behaviour, and
format-parity between plain and gzip readers.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1. Import style: replace crate::common:: with common:: in runner.rs test
   module to match the convention used everywhere else in the file.

2. parse_other_for_mut return type: change Result<Option<f64>, BedErrors>
   to Result<f64, BedErrors>. The function never returned Ok(None) —
   both missing-tag and malformed-value paths returned Err. Update
   new_mut_region_record to wrap the result in Some(), and update 5 test
   assertions to compare bare f64 values.

3. Unit tests for rate helpers: add test_apply_rate_override, test_rate_at,
   and test_exclude_positions to runner.rs covering boundary splits, gap
   handling, and multi-segment exclusions that were previously only
   exercised indirectly via integration tests.

4. Trailing newlines: add missing final newline to bed_record.rs,
   bed_reader.rs, and runner.rs.

5. filter-reads boundary test: add test_filter_fastq_boundary to
   filter_lib.rs confirming that reads touching (but not overlapping)
   a BED region boundary are correctly excluded end-to-end, validating
   the overlaps() semantic fix from this branch.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@joshfactorial joshfactorial merged commit 7e44f2c into main May 18, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant