Skip to content

Add binned quality scores support for modern Illumina platforms#130

Merged
joshfactorial merged 1 commit into
developfrom
feature/binned-quality-scores
May 20, 2026
Merged

Add binned quality scores support for modern Illumina platforms#130
joshfactorial merged 1 commit into
developfrom
feature/binned-quality-scores

Conversation

@joshfactorial
Copy link
Copy Markdown
Collaborator

Modern sequencers (NovaSeq 6000, NextSeq 1000/2000) quantize per-base quality into a small set of discrete bins instead of emitting the continuous Q0-Q40 range. This adds a binned_quality_bins option to gen-seq-error-model so users can train a model that snaps observed quality scores to a configurable bin set; gen-reads then emits only those bin values when sampling from the model.

  • New optional binned_quality_bins: [..] YAML field on the gen-seq-error-model config. Validated for non-empty, < 94, and excludes 31 (encodes to '@' under Phred+33 and would corrupt FASTQ output). Sorted and deduped at parse time.
  • accumulate_qual now snaps each decoded Q-score to the nearest bin during count accumulation (ties round down), so seed, transition, and global counts are all in bin space; error_rate naturally reflects what reads will look like.
  • QualityScoreModel::from_counts gains an is_binned parameter and now sets the previously-hardcoded binned_scores field accordingly. Adds a new InvalidConfiguration error variant.
  • Public accessor SequencingErrorModel::quality_score_model() for tests and consumers that need model metadata.
  • Zero-count bins are kept in quality_score_options and fall back to the existing uniform-row behavior, with a warn! listing them.
  • README and template_config updated with the new field, validation rules, and suggested bin sets per platform.

No gen-reads changes needed: generate_quality_scores already restricts output to quality_score_options, which is now constrained to the bin set when the model is binned.

Modern sequencers (NovaSeq 6000, NextSeq 1000/2000) quantize per-base
quality into a small set of discrete bins instead of emitting the
continuous Q0-Q40 range. This adds a `binned_quality_bins` option to
gen-seq-error-model so users can train a model that snaps observed
quality scores to a configurable bin set; gen-reads then emits only
those bin values when sampling from the model.

- New optional `binned_quality_bins: [..]` YAML field on the
  gen-seq-error-model config. Validated for non-empty, < 94, and
  excludes 31 (encodes to '@' under Phred+33 and would corrupt FASTQ
  output). Sorted and deduped at parse time.
- accumulate_qual now snaps each decoded Q-score to the nearest bin
  during count accumulation (ties round down), so seed, transition,
  and global counts are all in bin space; error_rate naturally
  reflects what reads will look like.
- QualityScoreModel::from_counts gains an is_binned parameter and now
  sets the previously-hardcoded binned_scores field accordingly. Adds
  a new InvalidConfiguration error variant.
- Public accessor SequencingErrorModel::quality_score_model() for
  tests and consumers that need model metadata.
- Zero-count bins are kept in quality_score_options and fall back to
  the existing uniform-row behavior, with a warn! listing them.
- README and template_config updated with the new field, validation
  rules, and suggested bin sets per platform.

No gen-reads changes needed: generate_quality_scores already restricts
output to quality_score_options, which is now constrained to the bin
set when the model is binned.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@joshfactorial joshfactorial merged commit 4e63afe into develop May 20, 2026
1 check passed
@joshfactorial joshfactorial deleted the feature/binned-quality-scores branch May 20, 2026 07:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant