Develop#131
Merged
Merged
Conversation
Modern sequencers (NovaSeq 6000, NextSeq 1000/2000) quantize per-base quality into a small set of discrete bins instead of emitting the continuous Q0-Q40 range. This adds a `binned_quality_bins` option to gen-seq-error-model so users can train a model that snaps observed quality scores to a configurable bin set; gen-reads then emits only those bin values when sampling from the model. - New optional `binned_quality_bins: [..]` YAML field on the gen-seq-error-model config. Validated for non-empty, < 94, and excludes 31 (encodes to '@' under Phred+33 and would corrupt FASTQ output). Sorted and deduped at parse time. - accumulate_qual now snaps each decoded Q-score to the nearest bin during count accumulation (ties round down), so seed, transition, and global counts are all in bin space; error_rate naturally reflects what reads will look like. - QualityScoreModel::from_counts gains an is_binned parameter and now sets the previously-hardcoded binned_scores field accordingly. Adds a new InvalidConfiguration error variant. - Public accessor SequencingErrorModel::quality_score_model() for tests and consumers that need model metadata. - Zero-count bins are kept in quality_score_options and fall back to the existing uniform-row behavior, with a warn! listing them. - README and template_config updated with the new field, validation rules, and suggested bin sets per platform. No gen-reads changes needed: generate_quality_scores already restricts output to quality_score_options, which is now constrained to the bin set when the model is binned. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add binned quality scores support for modern Illumina platforms. Did some cleanup on the readme.
Pure additive coverage based on the rneat-test-coverage-plan. No new dev-deps and no new test directories — every change is inside an existing `#[cfg(test)] mod tests` block. Binned quality follow-ups (3 tests, sections 1.1): - QualityScoreModel: JSON-GZ round trip preserves binned_scores and quality_score_options bit-for-bit. - SequencingErrorModel wrapper round-trips a binned inner model and emits only bin values when sampled via the wrapper API. - gen-seq-error-model runner: binned vs unbinned error_rate must differ (locks in that bin-snapped counts feed the rate calculation). VCF parser robustness (6 tests, section 1.2): - Multi-allelic ALT records are skipped while sibling SNPs pass. - `<DEL>`-style SV ALTs parse without panicking (regression bait). - INFO strings with embedded semicolons survive intact (tab is the only field delimiter). - Gzipped VCF input parses identically to plain. - `./.` genotype is treated as homozygous today; lock that in until a deliberate fix. - Mixed phased (`0|1`) and unphased (`0/1`) GTs both parse as Heterozygous. gen-gc-bias-model gzip equivalence (2 tests, section 1.3): - Plain text vs gzipped coverage yields identical depths for bedtools-genomecov-d and bedtools-genomecov-dz. SamtoolsDepth case already existed. gen-mut-model expansion (3 tests, section 1.4): - Indels: a VCF with one insertion and one deletion runs to a written model without erroring. - Reference-mismatch records are logged-and-skipped, not fatal. - BED entries for nonexistent contigs do not panic the runner. filter_reads edge cases (3 tests, section 1.5): - Empty FASTQ in → empty gzipped FASTQ out. - BED with no overlap → every read filtered, no error. - VCF with no passing records keeps header lines but emits no data. gen-frag-length-model edge cases (3 tests, section 1.6): - All-zero-TLEN BAM → EmptyData (codifies the tlen>0 reader filter). - Single-ended BAM → EmptyData (codifies the SEGMENTED flag filter). - Low-MAPQ-only BAM → EmptyData (codifies the MAPQ>10 filter). Adds two new test-only helpers for these flag/MAPQ permutations. gen-seq-error-model negative paths (2 tests, section 1.7): - Truncated FASTQ (3 lines, no qual) → MalformedFastq. - First record with empty qual line → MalformedFastq. cargo test --workspace: 377 passing (up from 355). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a new rneat/tests/ harness that exercises the real binary boundary (CLI → config → model → output) via assert_cmd. Four integration suites with 12 tests total, plus one prod fix that the suite caught on its first run. New harness: - tests/common/mod.rs — shared helpers (binary command, fixtures, config builders, decompression). Also provides a GenReadsConfig builder with paired-ended, model, thread, and seed knobs. - New dev-deps: assert_cmd, predicates. cli_smoke.rs (4 tests): - `rneat --help` lists all 6 subcommands. - Each subcommand's `--help` exits 0 and mentions --configuration-yaml. - Missing config file → non-zero exit + stderr error message. - No arguments → non-zero exit + help text on stderr. pipeline_e2e.rs (2 tests): - gen-reads with default model produces a structurally well-formed FASTQ (multiple of 4 lines, '@'/+' markers, seq.len == qual.len). - gen-seq-error-model with binned_quality_bins → gen-reads → only bin-valued qualities appear in the output FASTQ. Prod fix caught by the second pipeline test: gen-reads previously loaded `quality_score_model` independently from `sequence_error_model`, so the QualityScoreModel embedded in a trained SequencingErrorModel was silently ignored. When a user set `sequence_error_model:` without a separate `quality_score_model:`, gen-reads quietly fell back to the built-in default — meaning binned-quality training had no effect on output. Fixed in gen_reads/utils/runner.rs by falling through to the SeqErrorModel's embedded QSM when no explicit override is configured. This matches the user-facing docstring in gen_reads_template.yml. determinism.rs (3 tests): - Same seed, single-threaded → same record multiset. - Same seed, multi-threaded → same record multiset. - Different seeds → different output (seed argument is load-bearing). Comparisons are on decompressed contents (gzip headers carry mtime). Multiset rather than byte-identical because rneat iterates HashMaps during contig assembly, so the line order in the output is non- deterministic even with num_threads=1; the record *set* is stable. fastq_validation.rs (3 tests): - Single-ended FASTQ passes strict structural validation (ACGTN-only seq, printable-ASCII qual, seq.len == qual.len) and every read's length matches the configured read_len. - Paired-end run produces both _r1 and _r2 with equal record counts; R1 names end in /1, R2 names end in /2, and name stems match pairwise. - Every quality byte decodes to a valid Phred+33 score in [0, 93]. cargo test --workspace: 12 new integration tests + 355 existing, all passing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DiscreteDistribution::sample is hot-path code shared by every model (quality scores, fragment lengths, indel lengths, nucleotide selection). Blanketing its CDF binary search with randomized inputs is cheap insurance against off-by-one bugs that the 7 hand-rolled fixtures here can't surface. Adds `proptest = "1"` as a common dev-dep and three properties on DiscreteDistribution: - `proptest_sample_returns_value_from_input_set`: for any positive weight vector of length 2..=10 and any rand ∈ [0, 1], the sample output is in the input value set. Catches any future bisect bug that returns an out-of-range index. proptest's shrink space naturally exercises the rand=0.0 and rand=1.0 boundaries. - `proptest_empirical_frequency_approaches_weights`: 20,000 samples drawn through a deterministic NeatRng must produce a histogram within 3% of N per bin (well above the worst-case binomial standard deviation, comfortably below the threshold a systematic CDF bug would induce). Catches any off-by-one in cumulative_sum or the bisect predicate that would show up as a measurable skew. - `proptest_all_zero_weights_always_returns_first_value`: documents the actual (non-uniform!) behavior when all weights are zero. The `new()` constructor sets the CDF to `[1.0; n]`, which makes bisect-left return index 0 on every sample. Note that this is *not* a uniform fallback — the uniform fallback for quality models lives one layer up in QualityScoreModel::from_counts. Renaming this contract would now be a deliberate change, not a silent regression. cargo test --workspace: 200 common tests (was 197), 158 rneat tests, all passing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add support for binned quality scores in Illumina sequencers and test phase 1 coverage
Test/phase 2 integration
Phase 3 test coverage: 3 proptest properties for DiscreteDistribution
Updated the changelog with version 1.5.1 details, including new features and tests.
Update CHANGELOG for rneat v1.5.1
…omes IUPAC codes (R/Y/M/K/S/W/H/B/V/D) in reference FASTAs were previously silently mapped to N, producing excess-N reads indistinguishable from assembly gaps. gen-reads now resolves each code to one of its constituent bases at reference-load time using the per-contig seeded RNG, making results reproducible while avoiding systematic reference bias. FastaStream now yields raw sequence strings; resolve_iupac_bases() in common::file_tools::fasta_stream performs the stochastic conversion. gen-mut-model and gen-gc-bias-model continue using simple Nucleotide::from conversion (IUPAC → N), appropriate for model-building from real VCF data. Bumps version to 1.5.2. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
feat: stochastic resolution of IUPAC ambiguity codes in reference gen…
- snap_to_bin doc: drop unsupported "matches Illumina behavior" claim - gen_mut_model, gen_gc_bias_model: comment why IUPAC→N is intentional - Add trailing newlines to sequencing_error_model.rs and gen_seq_error_model/utils/runner.rs Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
fix: address code review feedback
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Binned quality scores option and adding lots of tests. Now deals with nonstandard IUPAC codes in a graceful way (convert to known base and mutate as normal).