Add binned quality scores support for modern Illumina platforms#130
Merged
Conversation
Modern sequencers (NovaSeq 6000, NextSeq 1000/2000) quantize per-base quality into a small set of discrete bins instead of emitting the continuous Q0-Q40 range. This adds a `binned_quality_bins` option to gen-seq-error-model so users can train a model that snaps observed quality scores to a configurable bin set; gen-reads then emits only those bin values when sampling from the model. - New optional `binned_quality_bins: [..]` YAML field on the gen-seq-error-model config. Validated for non-empty, < 94, and excludes 31 (encodes to '@' under Phred+33 and would corrupt FASTQ output). Sorted and deduped at parse time. - accumulate_qual now snaps each decoded Q-score to the nearest bin during count accumulation (ties round down), so seed, transition, and global counts are all in bin space; error_rate naturally reflects what reads will look like. - QualityScoreModel::from_counts gains an is_binned parameter and now sets the previously-hardcoded binned_scores field accordingly. Adds a new InvalidConfiguration error variant. - Public accessor SequencingErrorModel::quality_score_model() for tests and consumers that need model metadata. - Zero-count bins are kept in quality_score_options and fall back to the existing uniform-row behavior, with a warn! listing them. - README and template_config updated with the new field, validation rules, and suggested bin sets per platform. No gen-reads changes needed: generate_quality_scores already restricts output to quality_score_options, which is now constrained to the bin set when the model is binned. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Modern sequencers (NovaSeq 6000, NextSeq 1000/2000) quantize per-base quality into a small set of discrete bins instead of emitting the continuous Q0-Q40 range. This adds a
binned_quality_binsoption to gen-seq-error-model so users can train a model that snaps observed quality scores to a configurable bin set; gen-reads then emits only those bin values when sampling from the model.binned_quality_bins: [..]YAML field on the gen-seq-error-model config. Validated for non-empty, < 94, and excludes 31 (encodes to '@' under Phred+33 and would corrupt FASTQ output). Sorted and deduped at parse time.No gen-reads changes needed: generate_quality_scores already restricts output to quality_score_options, which is now constrained to the bin set when the model is binned.