Develop#123
Merged
Merged
Conversation
…tra information. Will make it clear in the docs what is expected. Also, fixed a couple small things that were bugging me.
…ies into bed file, and added an input mutation regions parameter to the config. Still need to implement the logic. NEAT 2.1 read the file, but never implemented the actual mutation regions
…to the default outside of the mutation regions bed. I'm not entirely sure what the original design was going to be, but this seems like a logical place to start, to me. I manually added in a readme item explaining how to use the mutation_regions bed and the rules for use. I also tweaked some of the logic in the parsing mutation regions file, deciding it was worth having a 1-bit flag so we can skip searching for a mutation rate in beds where we don't care.
…, since the field is only really needed when we know it's a mutation regions bed. Had Junie add some tests to cover this code
Feature/custom mutation rates
…in a reverse in expected count of SNPs v INDELs. Should be fixed now.
…I think writing more files to save some memory.
…eal-world WGS data gen-reads memory: - Replace per-position Vec<f64> bias_map with compact (start, end, rate) segment list; rice peak RSS 3.06 GB -> 1.05 GB (~3x reduction), major page faults 109K -> 0 - SE cover_dataset fragment pool reduced to 1 entry with cyclic index, eliminating O(coverage * contig_length) VecDeque allocation and redundant RNG shuffles gen-mut-model VCF fixes: - Handle missing alleles (./. and ./1) in GT field without panicking - Skip multi-allelic ALT sites (comma-separated) with a debug log - Warn-and-skip on unparseable FORMAT/SAMPLE rather than aborting model build - Canonicalize soft-masked FASTA bases before REF comparison; genuine mismatches warn-and-skip instead of aborting (fixes hg38 soft-masked reference) - Fix optional bed_file key using .get() instead of [] index (panic on missing key) - Demote per-variant "Found genotype" log from INFO to DEBUG NA12878 WGS human model now builds successfully: mutation_rate 0.001457, homozygous_frequency 0.390, 86% SNP / 6.6% INS / 7.4% DEL, 2:15 wall clock. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…on to 1.5.0 - New "Running on HPC" section with per-subcommand SLURM headers, memory/scratch budgets, and a sizing table for bacterial/human gen-reads runs - gen-gc-bias-model: document plain-text vs gzip memory trade-off and per-chromosome gzip strategy for full-genome runs - Correct stale note that gzip coverage files are unsupported (added in v1.5.0) - Bump version number in Prerequisites from 1.4.1 to 1.5.0 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replaces the plain-text-only CoverageIndex/CoverageData::load API with a unified CoverageReader that transparently handles both plain-text and gzip-compressed coverage files. Plain-text path is unchanged (byte-offset seek, one contig in RAM at a time). Gzip path streams the entire file in a single forward pass and evicts contigs from the in-memory map as they are consumed, so peak memory declines as processing advances. Uses existing common::file_tools::file_io helpers (is_gzipped_file, read_gzip_lines) rather than reimplementing gzip detection or decoding. Adds 5 new tests covering the gzip path, eviction behaviour, and format-parity between plain and gzip readers. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1. Import style: replace crate::common:: with common:: in runner.rs test module to match the convention used everywhere else in the file. 2. parse_other_for_mut return type: change Result<Option<f64>, BedErrors> to Result<f64, BedErrors>. The function never returned Ok(None) — both missing-tag and malformed-value paths returned Err. Update new_mut_region_record to wrap the result in Some(), and update 5 test assertions to compare bare f64 values. 3. Unit tests for rate helpers: add test_apply_rate_override, test_rate_at, and test_exclude_positions to runner.rs covering boundary splits, gap handling, and multi-segment exclusions that were previously only exercised indirectly via integration tests. 4. Trailing newlines: add missing final newline to bed_record.rs, bed_reader.rs, and runner.rs. 5. filter-reads boundary test: add test_filter_fastq_boundary to filter_lib.rs confirming that reads touching (but not overlapping) a BED region boundary are correctly excluded end-to-end, validating the overlaps() semantic fix from this branch. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Added custom mutation rate bed file. Takes in a bed file with a column "mut_rate=0.0001" for example, and produces variants at that specified mutation rate within the bed region, and default everywhere else. Can intersect with a target bed as well.
Improvements to memory usage by streaming files.
Modeled real data (GIAB - NA12878) to create baseline human models for rneat to run on.
Incremental flush of bam file, enabled by the next item.
Lost shuffling. Shuffling was too memory expensive, and even turned off the architecture in place to do the shuffle was eating up memory. So we nixed it, with a recommendation to use seqkit to shuffle the fastq.
Overall, though rneat lost the ability to do shuffling of fastq files, which only was practical on small genomes in the first place, it made huge memory gains and can now handle full genome files on an HPC cluster, and up to medium sized on a standard desktop computer.