Skip to content

Develop#131

Merged
joshfactorial merged 20 commits into
mainfrom
develop
May 20, 2026
Merged

Develop#131
joshfactorial merged 20 commits into
mainfrom
develop

Conversation

@joshfactorial
Copy link
Copy Markdown
Collaborator

@joshfactorial joshfactorial commented May 20, 2026

Binned quality scores option and adding lots of tests. Now deals with nonstandard IUPAC codes in a graceful way (convert to known base and mutate as normal).

joshfactorial and others added 20 commits May 20, 2026 02:00
Modern sequencers (NovaSeq 6000, NextSeq 1000/2000) quantize per-base
quality into a small set of discrete bins instead of emitting the
continuous Q0-Q40 range. This adds a `binned_quality_bins` option to
gen-seq-error-model so users can train a model that snaps observed
quality scores to a configurable bin set; gen-reads then emits only
those bin values when sampling from the model.

- New optional `binned_quality_bins: [..]` YAML field on the
  gen-seq-error-model config. Validated for non-empty, < 94, and
  excludes 31 (encodes to '@' under Phred+33 and would corrupt FASTQ
  output). Sorted and deduped at parse time.
- accumulate_qual now snaps each decoded Q-score to the nearest bin
  during count accumulation (ties round down), so seed, transition,
  and global counts are all in bin space; error_rate naturally
  reflects what reads will look like.
- QualityScoreModel::from_counts gains an is_binned parameter and now
  sets the previously-hardcoded binned_scores field accordingly. Adds
  a new InvalidConfiguration error variant.
- Public accessor SequencingErrorModel::quality_score_model() for
  tests and consumers that need model metadata.
- Zero-count bins are kept in quality_score_options and fall back to
  the existing uniform-row behavior, with a warn! listing them.
- README and template_config updated with the new field, validation
  rules, and suggested bin sets per platform.

No gen-reads changes needed: generate_quality_scores already restricts
output to quality_score_options, which is now constrained to the bin
set when the model is binned.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add binned quality scores support for modern Illumina platforms. Did some cleanup on the readme.
Pure additive coverage based on the rneat-test-coverage-plan. No new
dev-deps and no new test directories — every change is inside an
existing `#[cfg(test)] mod tests` block.

Binned quality follow-ups (3 tests, sections 1.1):
- QualityScoreModel: JSON-GZ round trip preserves binned_scores and
  quality_score_options bit-for-bit.
- SequencingErrorModel wrapper round-trips a binned inner model and
  emits only bin values when sampled via the wrapper API.
- gen-seq-error-model runner: binned vs unbinned error_rate must
  differ (locks in that bin-snapped counts feed the rate calculation).

VCF parser robustness (6 tests, section 1.2):
- Multi-allelic ALT records are skipped while sibling SNPs pass.
- `<DEL>`-style SV ALTs parse without panicking (regression bait).
- INFO strings with embedded semicolons survive intact (tab is the
  only field delimiter).
- Gzipped VCF input parses identically to plain.
- `./.` genotype is treated as homozygous today; lock that in until a
  deliberate fix.
- Mixed phased (`0|1`) and unphased (`0/1`) GTs both parse as
  Heterozygous.

gen-gc-bias-model gzip equivalence (2 tests, section 1.3):
- Plain text vs gzipped coverage yields identical depths for
  bedtools-genomecov-d and bedtools-genomecov-dz. SamtoolsDepth case
  already existed.

gen-mut-model expansion (3 tests, section 1.4):
- Indels: a VCF with one insertion and one deletion runs to a written
  model without erroring.
- Reference-mismatch records are logged-and-skipped, not fatal.
- BED entries for nonexistent contigs do not panic the runner.

filter_reads edge cases (3 tests, section 1.5):
- Empty FASTQ in → empty gzipped FASTQ out.
- BED with no overlap → every read filtered, no error.
- VCF with no passing records keeps header lines but emits no data.

gen-frag-length-model edge cases (3 tests, section 1.6):
- All-zero-TLEN BAM → EmptyData (codifies the tlen>0 reader filter).
- Single-ended BAM → EmptyData (codifies the SEGMENTED flag filter).
- Low-MAPQ-only BAM → EmptyData (codifies the MAPQ>10 filter).
  Adds two new test-only helpers for these flag/MAPQ permutations.

gen-seq-error-model negative paths (2 tests, section 1.7):
- Truncated FASTQ (3 lines, no qual) → MalformedFastq.
- First record with empty qual line → MalformedFastq.

cargo test --workspace: 377 passing (up from 355).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a new rneat/tests/ harness that exercises the real binary boundary
(CLI → config → model → output) via assert_cmd. Four integration
suites with 12 tests total, plus one prod fix that the suite caught
on its first run.

New harness:
- tests/common/mod.rs — shared helpers (binary command, fixtures,
  config builders, decompression). Also provides a GenReadsConfig
  builder with paired-ended, model, thread, and seed knobs.
- New dev-deps: assert_cmd, predicates.

cli_smoke.rs (4 tests):
- `rneat --help` lists all 6 subcommands.
- Each subcommand's `--help` exits 0 and mentions
  --configuration-yaml.
- Missing config file → non-zero exit + stderr error message.
- No arguments → non-zero exit + help text on stderr.

pipeline_e2e.rs (2 tests):
- gen-reads with default model produces a structurally well-formed
  FASTQ (multiple of 4 lines, '@'/+' markers, seq.len == qual.len).
- gen-seq-error-model with binned_quality_bins → gen-reads → only
  bin-valued qualities appear in the output FASTQ.

Prod fix caught by the second pipeline test: gen-reads previously
loaded `quality_score_model` independently from `sequence_error_model`,
so the QualityScoreModel embedded in a trained SequencingErrorModel
was silently ignored. When a user set `sequence_error_model:` without
a separate `quality_score_model:`, gen-reads quietly fell back to the
built-in default — meaning binned-quality training had no effect on
output. Fixed in gen_reads/utils/runner.rs by falling through to the
SeqErrorModel's embedded QSM when no explicit override is configured.
This matches the user-facing docstring in gen_reads_template.yml.

determinism.rs (3 tests):
- Same seed, single-threaded → same record multiset.
- Same seed, multi-threaded → same record multiset.
- Different seeds → different output (seed argument is load-bearing).
Comparisons are on decompressed contents (gzip headers carry mtime).
Multiset rather than byte-identical because rneat iterates HashMaps
during contig assembly, so the line order in the output is non-
deterministic even with num_threads=1; the record *set* is stable.

fastq_validation.rs (3 tests):
- Single-ended FASTQ passes strict structural validation
  (ACGTN-only seq, printable-ASCII qual, seq.len == qual.len) and
  every read's length matches the configured read_len.
- Paired-end run produces both _r1 and _r2 with equal record counts;
  R1 names end in /1, R2 names end in /2, and name stems match
  pairwise.
- Every quality byte decodes to a valid Phred+33 score in [0, 93].

cargo test --workspace: 12 new integration tests + 355 existing,
all passing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DiscreteDistribution::sample is hot-path code shared by every model
(quality scores, fragment lengths, indel lengths, nucleotide
selection). Blanketing its CDF binary search with randomized inputs
is cheap insurance against off-by-one bugs that the 7 hand-rolled
fixtures here can't surface.

Adds `proptest = "1"` as a common dev-dep and three properties on
DiscreteDistribution:

- `proptest_sample_returns_value_from_input_set`: for any positive
  weight vector of length 2..=10 and any rand ∈ [0, 1], the sample
  output is in the input value set. Catches any future bisect bug
  that returns an out-of-range index. proptest's shrink space
  naturally exercises the rand=0.0 and rand=1.0 boundaries.

- `proptest_empirical_frequency_approaches_weights`: 20,000 samples
  drawn through a deterministic NeatRng must produce a histogram
  within 3% of N per bin (well above the worst-case binomial standard
  deviation, comfortably below the threshold a systematic CDF bug
  would induce). Catches any off-by-one in cumulative_sum or the
  bisect predicate that would show up as a measurable skew.

- `proptest_all_zero_weights_always_returns_first_value`: documents
  the actual (non-uniform!) behavior when all weights are zero. The
  `new()` constructor sets the CDF to `[1.0; n]`, which makes
  bisect-left return index 0 on every sample. Note that this is *not*
  a uniform fallback — the uniform fallback for quality models lives
  one layer up in QualityScoreModel::from_counts. Renaming this
  contract would now be a deliberate change, not a silent regression.

cargo test --workspace: 200 common tests (was 197), 158 rneat tests,
all passing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add support for binned quality scores in Illumina sequencers and test phase 1 coverage
Phase 3 test coverage: 3 proptest properties for DiscreteDistribution
Updated the changelog with version 1.5.1 details, including new features and tests.
Update CHANGELOG for rneat v1.5.1
…omes

IUPAC codes (R/Y/M/K/S/W/H/B/V/D) in reference FASTAs were previously
silently mapped to N, producing excess-N reads indistinguishable from
assembly gaps. gen-reads now resolves each code to one of its constituent
bases at reference-load time using the per-contig seeded RNG, making
results reproducible while avoiding systematic reference bias.

FastaStream now yields raw sequence strings; resolve_iupac_bases() in
common::file_tools::fasta_stream performs the stochastic conversion.
gen-mut-model and gen-gc-bias-model continue using simple Nucleotide::from
conversion (IUPAC → N), appropriate for model-building from real VCF data.

Bumps version to 1.5.2.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
feat: stochastic resolution of IUPAC ambiguity codes in reference gen…
- snap_to_bin doc: drop unsupported "matches Illumina behavior" claim
- gen_mut_model, gen_gc_bias_model: comment why IUPAC→N is intentional
- Add trailing newlines to sequencing_error_model.rs and
  gen_seq_error_model/utils/runner.rs

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
fix: address code review feedback
@joshfactorial joshfactorial merged commit 87c110b into main May 20, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant