Skip to content

fix: emit unrounded percentage in Overrepresented sequences (Java compat)#2

Merged
ewels merged 3 commits into
ewels:mainfrom
FelixKrueger:fix/overrepresented-percentage-precision
Apr 26, 2026
Merged

fix: emit unrounded percentage in Overrepresented sequences (Java compat)#2
ewels merged 3 commits into
ewels:mainfrom
FelixKrueger:fix/overrepresented-percentage-precision

Conversation

@FelixKrueger
Copy link
Copy Markdown

@FelixKrueger FelixKrueger commented Apr 26, 2026

Summary

The current >>Overrepresented sequences text output rounds the percentage column to 2 decimal places (7.16), citing a JAVA COMPAT comment that claims Java FastQC does the same. Empirically, Java FastQC 0.12.1 emits full Double.toString() precision in this column (7.160449112640348). This PR drops the rounding so the section is byte-equivalent to Java FastQC again.

Reproducer

Trim smallRNA_100K.fastq.gz (typical small-RNA library, 100k reads, ~99 overrepresented sequences detected by both engines) and run both engines on the same byte-identical trimmed file.

# Java FastQC 0.12.1 (homebrew)
$ fastqc --quiet --outdir java raw_trimmed.fq.gz

# fastqc-rust v1.0.0
$ fastqc-rust --quiet --outdir rust raw_trimmed.fq.gz

>>Overrepresented sequences section in fastqc_data.txt:

First entry
Java FastQC 0.12.1 AAAGTGCTACTACTTTTGAGTCT 5931 7.160449112640348 No Hit
fastqc-rust v1.0.0 AAAGTGCTACTACTTTTGAGTCT 5931 7.16 No Hit

The same 99 sequences are detected with the same counts and the same Possible Source classification — only the percentage column's formatting differs. Every other section in fastqc_data.txt (Per base sequence quality, Per sequence GC content, Sequence Duplication Levels, etc.) is byte-identical to Java FastQC.

Why the existing comment is wrong

The current code at src/modules/overrepresented_seqs.rs:419-423 says:

// The Java ResultsTable.getValueAt() for percentage does
// JAVA COMPAT: Math.round(percentage * 100.0) / 100.0, rounding to 2 decimal places.
// The text report then calls String.valueOf() on the Double, producing
// Java's Double.toString() format.
let rounded_pct = (s.percentage * 100.0).round() / 100.0;

Walking the actual Java path in s-andrews/FastQC v0.12.1:

  • OverRepresentedSeqs.java:123double percentage = ((double)sequences.get(seq)/count)*100; — full precision computed.
  • OverRepresentedSeqs.java:253this.percentage = percentage; — stored unmodified in the OverrepresentedSeq struct.
  • OverRepresentedSeqs.java:214case 2: return seqs[rowIndex].percentage(); — returned unrounded by getValueAt(...).
  • AbstractQCModule.java:159w.writeCharacters(String.valueOf(table.getValueAt(r, c))); — serialized via String.valueOf() on the unrounded Double.

There is no Math.round anywhere in this chain in Java FastQC 0.12.1. The JAVA COMPAT comment is well-intentioned but doesn't match what Java actually does.

Fix

Drop the rounding line; pass s.percentage directly to java_format_double. The helper already produces Java's Double.toString() format faithfully for full-precision doubles (it's exercised by the existing test suite and by every other section that emits a percentage).

Regression test (commit b12da4e)

The existing equivalence suite missed this divergence because every reference test case has exactly one overrepresented entry at exactly 100.0%, where the buggy rounding is a no-op (Math.round(100.0 * 100.0) / 100.0 == 100.0). To close that coverage gap, this PR also adds a new equivalence case realistic_default sized to expose non-round percentages:

  • 1009 reads (prime denominator), 50 bp uniform Phred 40
  • 5 deliberately-overrepresented sequences at counts (73, 37, 11, 5, 2) → percentages 7.234886025768088, 3.6669970267591676, 1.0901883052527255, 0.4955401387512388, 0.19821605550049554
  • 881 unique pseudo-random background reads at 1/1009 = 0.099% (just below the 0.1% overrepresented:warn threshold so they don't pollute the section)
  • Deterministic generator at tests/data/gen_realistic.py (fixed seed)
  • 8 SVG patches generated against _normalize_svg'd Java reference, mirroring the pattern used by complex_*/minimal_* cases

Bidirectional validation:

  • With the fix applied (this PR's tip): Testing: realistic_default ... PASS
  • Reverting just src/modules/overrepresented_seqs.rs (HEAD~1 state): Testing: realistic_default ... FAIL with fastqc_data.txt: 5 lines differ — matching exactly the 5 designed overrepresented entries.

So the new test case both verifies the fix is correct and serves as a regression guard for future changes in the same code path.

Note on adapter list: complex_default/minimal_default were generated against fastqc-rust's bundled 6-adapter assets/adapter_list.txt (the Java FastQC tarball contains the same older list). To match, realistic_default's reference was produced with fastqc -a assets/adapter_list.txt tests/data/realistic.fastq. Same convention as the existing reference data.

Validation

After the fix, the >>Overrepresented sequences section produced by fastqc-rust matches Java FastQC 0.12.1's output byte-for-byte for the percentage column. All 99 upstream tests still pass (cargo test --release). The full equivalence suite passes (uv run tests/equivalence/compare.py --binary ./target/release/fastqc --test realistic_default).

One unrelated divergence remains: when two sequences have the same count, fastqc-rust uses sequence-ascending tiebreaker (intentional, per the comment at src/modules/overrepresented_seqs.rs:310-314) while Java preserves HashMap iteration order. That's a separate design call and out of scope for this PR.

Why this matters

The README and equivalence report position fastqc-rust as producing fastqc_data.txt byte-identical to Java FastQC. This bug breaks that contract for any sample with overrepresented sequences. Downstream MultiQC aggregations of fastqc_trimmed_overrepresented_sequences_plot.txt then yield different MD5s on different MultiQC versions (some MultiQCs round upstream of aggregation, some don't), causing fragile snapshot tests in pipelines that pin the bundled engine.

Context

I hit this while integrating fastqc-rust into Trim Galore v2.1.0-beta.4 as a bundled dep — the regression surfaced via nf-core/rnaseq#1789's nf-test snapshot suite (NF 25.04.3 with its pinned MultiQC version). Reproducer is portable.

…pat)

Java FastQC stores the raw percentage double without rounding and
serializes it through `String.valueOf()` in `AbstractQCModule.writeTable`,
which produces full `Double.toString()` precision (e.g. `7.160449112640348`).

The current code pre-rounds via `Math.round(x*100.0)/100.0` before
formatting, citing a "JAVA COMPAT" comment that does not match Java
FastQC 0.12.1's actual behaviour. Empirically Java FastQC writes the
unrounded value in `fastqc_data.txt`. Drop the rounding and pass the
raw percentage through `java_format_double` directly.

Reference paths in upstream Java FastQC 0.12.1 source:
- `OverRepresentedSeqs.java:123` — `(double)count/total * 100` stored unrounded
- `OverRepresentedSeqs.java:253` — `OverrepresentedSeq.percentage` field unmodified
- `OverRepresentedSeqs.java:214` — `getValueAt(case 2)` returns the raw double
- `AbstractQCModule.java:159` — `String.valueOf(table.getValueAt(...))`
@FelixKrueger
Copy link
Copy Markdown
Author

@ewels — to your question about the equivalence script:

It's a coverage gap in the test inputs, not a flaw in the script's mechanism (the script does compare fastqc_data.txt byte-for-byte and would absolutely catch this divergence with realistic data).

The cause is that every single one of the 18 reference test cases produces exactly one overrepresented entry at exactly 100.0%:

$ for d in tests/equivalence/reference/*/; do
    extract "$d/fastqc_data.txt" "Overrepresented sequences" |
      awk '/^[ACGTN]/ {print $3}'
  done | sort -u
100.0

The buggy rounding (x * 100.0).round() / 100.0 is a no-op for x = 100.0Math.round(100.0 * 100.0) / 100.0 == 100.0, which java_format_double then renders as "100.0" either way. So the rounded and unrounded paths produce byte-identical output for every existing test case.

The bug only surfaces when there's variety in the percentage column. complex.fastq (5 reads, all ACGTACGTACGTACGT) trivially gives 5/5 = 100% for its single overrepresented sequence. minimal.fastq is even smaller (1 read).

The way I tripped over it was running fastqc-rust on a real small-RNA library (smallRNA_100K — 100k reads, 99 detected overrepresented sequences with percentages like 7.160449112640348, 5.6694434383677415, etc.). Once the percentage column has any non-round entries, the divergence is unambiguous.

Suggested fix on the test side: add one realistic-shaped fastq under tests/data/ — even a few hundred reads with moderate sequence variety (one or two genuinely overrepresented sequences at non-round rates) is enough. The mechanical recipe:

  1. Drop, e.g., realistic.fastq under tests/data/
  2. Add realistic_default (etc.) entries to test_cases.yaml
  3. Run generate_reference.sh to capture the Java reference
  4. Commit the resulting tests/equivalence/reference/realistic_default/ tree

If you'd like me to put together a candidate test fastq + reference directory in this same PR (or a follow-up), happy to. Otherwise just flagging the coverage gap.

…wels#2)

The existing equivalence suite (minimal/complex variants) has every
overrepresented entry land at exactly 100.0%, where the buggy rounding
fixed in this PR is a no-op (Math.round(100.0 * 100.0) / 100.0 ==
100.0). This means the suite missed the percentage-precision divergence
even though the comparator mechanism is correct.

This adds a `realistic_default` test case sized to expose non-round
overrepresented percentages:

- 1009 reads (prime denominator), 50 bp, uniform Phred 40
- 5 deliberately-overrepresented sequences at counts (73, 37, 11, 5, 2)
  → percentages 7.234886025768088, 3.6669970267591676, 1.0901883052527255,
  0.4955401387512388, 0.19821605550049554 (all non-round)
- 881 unique pseudo-random background reads at 1/1009 = 0.0991% (just
  below the 0.1% overrepresented:warn threshold so they don't pollute
  the section)
- Deterministic generation (fixed seed); see tests/data/gen_realistic.py

Files added:
- tests/data/realistic.fastq + tests/data/gen_realistic.py (input + generator)
- tests/equivalence/test_cases.yaml (new realistic_default entry)
- tests/equivalence/reference/realistic_default/ (full Java FastQC reference)
- tests/equivalence/patches/realistic_default_*_svg.patch (8 SVG renderer
  patches mirroring the pattern used by complex_*/minimal_* cases)

Validation: with this commit's percentage-formatting fix, `realistic_default`
passes; reverting just src/modules/overrepresented_seqs.rs (HEAD~1 state)
makes it fail with "fastqc_data.txt: 5 lines differ" — matching exactly
the 5 designed overrepresented entries. The test thus serves as a
regression guard for the bug fixed earlier in this PR.

The existing `complex_default` and `minimal_default` reference data was
generated against an older Java FastQC config with 6 adapters in the
Adapter Content section; this case follows the same convention by
running Java FastQC with `-a assets/adapter_list.txt` to use
fastqc-rust's bundled 6-adapter list.
The plain FASTQ added in the previous commit compressed well (mostly
uniform Phred-40 quality strings + 50bp ACGT sequences, ~83.5%
reduction). Switch to a gzipped input to keep the repo lean.

- tests/data/realistic.fastq → realistic.fastq.gz
- gen_realistic.py now writes via gzip.GzipFile with mtime=0 so
  re-running produces a byte-identical .gz on any machine
- tests/equivalence/test_cases.yaml: file: realistic.fastq.gz
- Reference data regenerated: the "Filename" header in fastqc_data.txt,
  summary.txt rows, and the fastqc_report.html banner now reference
  realistic.fastq.gz (Java FastQC writes the literal input filename
  into several places, so the reference must match)

Equivalence test: realistic_default ... PASS (full suite green).
@ewels ewels merged commit bb65350 into ewels:main Apr 26, 2026
8 checks passed
FelixKrueger added a commit to FelixKrueger/TrimGalore that referenced this pull request Apr 26, 2026
Bumps `fastqc-rust` from `=1.0.0` to `=1.0.1` to pick up the upstream
percentage-precision fix in `>>Overrepresented sequences`
(ewels/FastQC-Rust#2, merged 2026-04-26).

Why this matters: fastqc-rust v1.0.0 rounded the overrepresented
percentage column to 2 decimals (`7.16`), while Java FastQC 0.12.1
emits full Double.toString() precision (`7.160449112640348`). The
nf-core/rnaseq#1789 integration matrix on Nextflow 25.04.3 surfaced
the divergence as a snapshot MD5 mismatch on
`fastqc_trimmed_overrepresented_sequences_plot.txt` (the older pinned
MultiQC preserves the literal value when aggregating, so the rounding
cascaded into a downstream byte difference). NF latest-everything was
already green because its newer MultiQC rounds upstream of aggregation.

After this bump: `fastqc_data.txt` byte-identity to Java FastQC 0.12.1
is restored for the overrepresented section, matching the existing
guarantee for every other section. No behavioural change to TrimGalore
itself; bug was purely cosmetic formatting in the bundled FastQC
engine.

Re-pin the rnaseq integration matrix to the new beta.5 digest to
verify CI now goes green on NF 25.04.3.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants