Develop by joshfactorial · Pull Request #131 · ncsa/rusty-neat

joshfactorial · 2026-05-20T07:06:55Z

Binned quality scores option and adding lots of tests. Now deals with nonstandard IUPAC codes in a graceful way (convert to known base and mutate as normal).

Modern sequencers (NovaSeq 6000, NextSeq 1000/2000) quantize per-base quality into a small set of discrete bins instead of emitting the continuous Q0-Q40 range. This adds a `binned_quality_bins` option to gen-seq-error-model so users can train a model that snaps observed quality scores to a configurable bin set; gen-reads then emits only those bin values when sampling from the model. - New optional `binned_quality_bins: [..]` YAML field on the gen-seq-error-model config. Validated for non-empty, < 94, and excludes 31 (encodes to '@' under Phred+33 and would corrupt FASTQ output). Sorted and deduped at parse time. - accumulate_qual now snaps each decoded Q-score to the nearest bin during count accumulation (ties round down), so seed, transition, and global counts are all in bin space; error_rate naturally reflects what reads will look like. - QualityScoreModel::from_counts gains an is_binned parameter and now sets the previously-hardcoded binned_scores field accordingly. Adds a new InvalidConfiguration error variant. - Public accessor SequencingErrorModel::quality_score_model() for tests and consumers that need model metadata. - Zero-count bins are kept in quality_score_options and fall back to the existing uniform-row behavior, with a warn! listing them. - README and template_config updated with the new field, validation rules, and suggested bin sets per platform. No gen-reads changes needed: generate_quality_scores already restricts output to quality_score_options, which is now constrained to the bin set when the model is binned. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add binned quality scores support for modern Illumina platforms. Did some cleanup on the readme.

Pure additive coverage based on the rneat-test-coverage-plan. No new dev-deps and no new test directories — every change is inside an existing `#[cfg(test)] mod tests` block. Binned quality follow-ups (3 tests, sections 1.1): - QualityScoreModel: JSON-GZ round trip preserves binned_scores and quality_score_options bit-for-bit. - SequencingErrorModel wrapper round-trips a binned inner model and emits only bin values when sampled via the wrapper API. - gen-seq-error-model runner: binned vs unbinned error_rate must differ (locks in that bin-snapped counts feed the rate calculation). VCF parser robustness (6 tests, section 1.2): - Multi-allelic ALT records are skipped while sibling SNPs pass. - `<DEL>`-style SV ALTs parse without panicking (regression bait). - INFO strings with embedded semicolons survive intact (tab is the only field delimiter). - Gzipped VCF input parses identically to plain. - `./.` genotype is treated as homozygous today; lock that in until a deliberate fix. - Mixed phased (`0|1`) and unphased (`0/1`) GTs both parse as Heterozygous. gen-gc-bias-model gzip equivalence (2 tests, section 1.3): - Plain text vs gzipped coverage yields identical depths for bedtools-genomecov-d and bedtools-genomecov-dz. SamtoolsDepth case already existed. gen-mut-model expansion (3 tests, section 1.4): - Indels: a VCF with one insertion and one deletion runs to a written model without erroring. - Reference-mismatch records are logged-and-skipped, not fatal. - BED entries for nonexistent contigs do not panic the runner. filter_reads edge cases (3 tests, section 1.5): - Empty FASTQ in → empty gzipped FASTQ out. - BED with no overlap → every read filtered, no error. - VCF with no passing records keeps header lines but emits no data. gen-frag-length-model edge cases (3 tests, section 1.6): - All-zero-TLEN BAM → EmptyData (codifies the tlen>0 reader filter). - Single-ended BAM → EmptyData (codifies the SEGMENTED flag filter). - Low-MAPQ-only BAM → EmptyData (codifies the MAPQ>10 filter). Adds two new test-only helpers for these flag/MAPQ permutations. gen-seq-error-model negative paths (2 tests, section 1.7): - Truncated FASTQ (3 lines, no qual) → MalformedFastq. - First record with empty qual line → MalformedFastq. cargo test --workspace: 377 passing (up from 355). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds a new rneat/tests/ harness that exercises the real binary boundary (CLI → config → model → output) via assert_cmd. Four integration suites with 12 tests total, plus one prod fix that the suite caught on its first run. New harness: - tests/common/mod.rs — shared helpers (binary command, fixtures, config builders, decompression). Also provides a GenReadsConfig builder with paired-ended, model, thread, and seed knobs. - New dev-deps: assert_cmd, predicates. cli_smoke.rs (4 tests): - `rneat --help` lists all 6 subcommands. - Each subcommand's `--help` exits 0 and mentions --configuration-yaml. - Missing config file → non-zero exit + stderr error message. - No arguments → non-zero exit + help text on stderr. pipeline_e2e.rs (2 tests): - gen-reads with default model produces a structurally well-formed FASTQ (multiple of 4 lines, '@'/+' markers, seq.len == qual.len). - gen-seq-error-model with binned_quality_bins → gen-reads → only bin-valued qualities appear in the output FASTQ. Prod fix caught by the second pipeline test: gen-reads previously loaded `quality_score_model` independently from `sequence_error_model`, so the QualityScoreModel embedded in a trained SequencingErrorModel was silently ignored. When a user set `sequence_error_model:` without a separate `quality_score_model:`, gen-reads quietly fell back to the built-in default — meaning binned-quality training had no effect on output. Fixed in gen_reads/utils/runner.rs by falling through to the SeqErrorModel's embedded QSM when no explicit override is configured. This matches the user-facing docstring in gen_reads_template.yml. determinism.rs (3 tests): - Same seed, single-threaded → same record multiset. - Same seed, multi-threaded → same record multiset. - Different seeds → different output (seed argument is load-bearing). Comparisons are on decompressed contents (gzip headers carry mtime). Multiset rather than byte-identical because rneat iterates HashMaps during contig assembly, so the line order in the output is non- deterministic even with num_threads=1; the record *set* is stable. fastq_validation.rs (3 tests): - Single-ended FASTQ passes strict structural validation (ACGTN-only seq, printable-ASCII qual, seq.len == qual.len) and every read's length matches the configured read_len. - Paired-end run produces both _r1 and _r2 with equal record counts; R1 names end in /1, R2 names end in /2, and name stems match pairwise. - Every quality byte decodes to a valid Phred+33 score in [0, 93]. cargo test --workspace: 12 new integration tests + 355 existing, all passing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

DiscreteDistribution::sample is hot-path code shared by every model (quality scores, fragment lengths, indel lengths, nucleotide selection). Blanketing its CDF binary search with randomized inputs is cheap insurance against off-by-one bugs that the 7 hand-rolled fixtures here can't surface. Adds `proptest = "1"` as a common dev-dep and three properties on DiscreteDistribution: - `proptest_sample_returns_value_from_input_set`: for any positive weight vector of length 2..=10 and any rand ∈ [0, 1], the sample output is in the input value set. Catches any future bisect bug that returns an out-of-range index. proptest's shrink space naturally exercises the rand=0.0 and rand=1.0 boundaries. - `proptest_empirical_frequency_approaches_weights`: 20,000 samples drawn through a deterministic NeatRng must produce a histogram within 3% of N per bin (well above the worst-case binomial standard deviation, comfortably below the threshold a systematic CDF bug would induce). Catches any off-by-one in cumulative_sum or the bisect predicate that would show up as a measurable skew. - `proptest_all_zero_weights_always_returns_first_value`: documents the actual (non-uniform!) behavior when all weights are zero. The `new()` constructor sets the CDF to `[1.0; n]`, which makes bisect-left return index 0 on every sample. Note that this is *not* a uniform fallback — the uniform fallback for quality models lives one layer up in QualityScoreModel::from_counts. Renaming this contract would now be a deliberate change, not a silent regression. cargo test --workspace: 200 common tests (was 197), 158 rneat tests, all passing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add support for binned quality scores in Illumina sequencers and test phase 1 coverage

Test/phase 2 integration

Phase 3 test coverage: 3 proptest properties for DiscreteDistribution

Updated the changelog with version 1.5.1 details, including new features and tests.

Update CHANGELOG for rneat v1.5.1

…omes IUPAC codes (R/Y/M/K/S/W/H/B/V/D) in reference FASTAs were previously silently mapped to N, producing excess-N reads indistinguishable from assembly gaps. gen-reads now resolves each code to one of its constituent bases at reference-load time using the per-contig seeded RNG, making results reproducible while avoiding systematic reference bias. FastaStream now yields raw sequence strings; resolve_iupac_bases() in common::file_tools::fasta_stream performs the stochastic conversion. gen-mut-model and gen-gc-bias-model continue using simple Nucleotide::from conversion (IUPAC → N), appropriate for model-building from real VCF data. Bumps version to 1.5.2. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat: stochastic resolution of IUPAC ambiguity codes in reference gen…

- snap_to_bin doc: drop unsupported "matches Illumina behavior" claim - gen_mut_model, gen_gc_bias_model: comment why IUPAC→N is intentional - Add trailing newlines to sequencing_error_model.rs and gen_seq_error_model/utils/runner.rs Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix: address code review feedback

joshfactorial and others added 20 commits May 20, 2026 02:00

Merge pull request #130 from ncsa/feature/binned-quality-scores

4e63afe

Add binned quality scores support for modern Illumina platforms. Did some cleanup on the readme.

Merge pull request #133 from ncsa/test/phase-1-coverage

0e1ecd8

Add support for binned quality scores in Illumina sequencers and test phase 1 coverage

Merge pull request #134 from ncsa/test/phase-2-integration

6af1373

Test/phase 2 integration

Merge branch 'develop' into test/phase-3-proptest

9e4b92c

Fixing cargo lock file

56acec2

Fixing lock file

4bf2fd9

Merge pull request #132 from ncsa/test/phase-3-proptest

d767ae8

Phase 3 test coverage: 3 proptest properties for DiscreteDistribution

Update CHANGELOG for rneat v1.5.1

60ed491

Updated the changelog with version 1.5.1 details, including new features and tests.

updated minor version

212817d

Merge pull request #135 from ncsa/joshfactorial-patch-1

949f386

Update CHANGELOG for rneat v1.5.1

fix: revert version to 1.5.1

f1936af

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Remove redundant header

b44b1b5

Merge pull request #136 from ncsa/feat/iupac-resolution

fbb3ca1

feat: stochastic resolution of IUPAC ambiguity codes in reference gen…

Merge pull request #138 from ncsa/fix/review-feedback

61b4cee

fix: address code review feedback

joshfactorial merged commit 87c110b into main May 20, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Develop#131

Develop#131
joshfactorial merged 20 commits into
mainfrom
develop

joshfactorial commented May 20, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

joshfactorial commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

joshfactorial commented May 20, 2026 •

edited

Loading