Skip to content

feat: mean and median sequence length in Basic Statistics#5

Draft
ewels wants to merge 1 commit into
mainfrom
mean-median-read-length
Draft

feat: mean and median sequence length in Basic Statistics#5
ewels wants to merge 1 commit into
mainfrom
mean-median-read-length

Conversation

@ewels
Copy link
Copy Markdown
Owner

@ewels ewels commented May 29, 2026

Implements s-andrews/FastQC#203: report mean and median read length in the Basic Statistics module, so downstream tools (e.g. MultiQC) no longer have to estimate them from the binned length distribution.

What changed

Two new rows in the Basic Statistics table, right after Sequence length:

Sequence length          16
Mean sequence length     16.00
Median sequence length   16
  • Mean = total bases / total sequences, to 2 decimal places.
  • Median = derived from a per-length histogram over non-filtered sequences. For an even number of reads, the two central values are averaged and rounded up.

Both values flow through write_text_report, so they appear in fastqc_data.txt and the HTML report (default and --template modern), which render the table from the same source.

Notes / decisions

  • Computed only over non-filtered sequences, consistent with the existing min/max length range.
  • Histogram is kept locally in BasicStats rather than shared with SequenceLengthDistribution, which bins lengths and includes filtered reads — different semantics. Memory cost mirrors the histogram that module already maintains.
  • This is a deliberate divergence from the Java text output, which only reports the range.

CI / tests

  • Unit tests and the cargo test approved-output tests are updated and passing.
  • ⚠️ The equivalence job (Rust vs. stored Java reference) is expected to fail for now, since the Java reference doesn't yet emit these rows. Per plan, this PR stays open until the feature lands in upstream Java FastQC; once reference data is regenerated from a build that includes it, equivalence goes green. No equivalence patches added on purpose.

🤖 Generated with Claude Code

Adds "Mean sequence length" and "Median sequence length" rows to the
Basic Statistics module, implementing s-andrews#203.

The mean is total bases / total sequences (2 d.p.); the median is
derived from a per-length histogram over non-filtered sequences (for an
even count, the two central values averaged and rounded up). Both flow
through write_text_report, so they appear in fastqc_data.txt and in the
default and modern HTML reports.

This is a deliberate divergence from the Java text output, which only
reports the length range. The equivalence suite will pass once the
feature lands upstream and reference data is regenerated.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@ewels ewels changed the base branch from master to main May 29, 2026 08:07
@ewels ewels marked this pull request as draft May 29, 2026 09:28
@ewels
Copy link
Copy Markdown
Owner Author

ewels commented May 29, 2026

On hold until upstream merge is complete.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant