fix: emit unrounded percentage in Overrepresented sequences (Java compat) by FelixKrueger · Pull Request #2 · ewels/FastQC-Rust

FelixKrueger · 2026-04-26T21:36:48Z

Summary

The current >>Overrepresented sequences text output rounds the percentage column to 2 decimal places (7.16), citing a JAVA COMPAT comment that claims Java FastQC does the same. Empirically, Java FastQC 0.12.1 emits full Double.toString() precision in this column (7.160449112640348). This PR drops the rounding so the section is byte-equivalent to Java FastQC again.

Reproducer

Trim smallRNA_100K.fastq.gz (typical small-RNA library, 100k reads, ~99 overrepresented sequences detected by both engines) and run both engines on the same byte-identical trimmed file.

# Java FastQC 0.12.1 (homebrew)
$ fastqc --quiet --outdir java raw_trimmed.fq.gz

# fastqc-rust v1.0.0
$ fastqc-rust --quiet --outdir rust raw_trimmed.fq.gz

>>Overrepresented sequences section in fastqc_data.txt:

	First entry
Java FastQC 0.12.1	`AAAGTGCTACTACTTTTGAGTCT 5931 7.160449112640348 No Hit`
fastqc-rust v1.0.0	`AAAGTGCTACTACTTTTGAGTCT 5931 7.16 No Hit`

The same 99 sequences are detected with the same counts and the same Possible Source classification — only the percentage column's formatting differs. Every other section in fastqc_data.txt (Per base sequence quality, Per sequence GC content, Sequence Duplication Levels, etc.) is byte-identical to Java FastQC.

Why the existing comment is wrong

The current code at src/modules/overrepresented_seqs.rs:419-423 says:

// The Java ResultsTable.getValueAt() for percentage does
// JAVA COMPAT: Math.round(percentage * 100.0) / 100.0, rounding to 2 decimal places.
// The text report then calls String.valueOf() on the Double, producing
// Java's Double.toString() format.
let rounded_pct = (s.percentage * 100.0).round() / 100.0;

Walking the actual Java path in s-andrews/FastQC v0.12.1:

OverRepresentedSeqs.java:123 — double percentage = ((double)sequences.get(seq)/count)*100; — full precision computed.
OverRepresentedSeqs.java:253 — this.percentage = percentage; — stored unmodified in the OverrepresentedSeq struct.
OverRepresentedSeqs.java:214 — case 2: return seqs[rowIndex].percentage(); — returned unrounded by getValueAt(...).
AbstractQCModule.java:159 — w.writeCharacters(String.valueOf(table.getValueAt(r, c))); — serialized via String.valueOf() on the unrounded Double.

There is no Math.round anywhere in this chain in Java FastQC 0.12.1. The JAVA COMPAT comment is well-intentioned but doesn't match what Java actually does.

Fix

Drop the rounding line; pass s.percentage directly to java_format_double. The helper already produces Java's Double.toString() format faithfully for full-precision doubles (it's exercised by the existing test suite and by every other section that emits a percentage).

Regression test (commit `b12da4e`)

The existing equivalence suite missed this divergence because every reference test case has exactly one overrepresented entry at exactly 100.0%, where the buggy rounding is a no-op (Math.round(100.0 * 100.0) / 100.0 == 100.0). To close that coverage gap, this PR also adds a new equivalence case realistic_default sized to expose non-round percentages:

1009 reads (prime denominator), 50 bp uniform Phred 40
5 deliberately-overrepresented sequences at counts (73, 37, 11, 5, 2) → percentages 7.234886025768088, 3.6669970267591676, 1.0901883052527255, 0.4955401387512388, 0.19821605550049554
881 unique pseudo-random background reads at 1/1009 = 0.099% (just below the 0.1% overrepresented:warn threshold so they don't pollute the section)
Deterministic generator at tests/data/gen_realistic.py (fixed seed)
8 SVG patches generated against _normalize_svg'd Java reference, mirroring the pattern used by complex_*/minimal_* cases

Bidirectional validation:

With the fix applied (this PR's tip): Testing: realistic_default ... PASS
Reverting just src/modules/overrepresented_seqs.rs (HEAD~1 state): Testing: realistic_default ... FAIL with fastqc_data.txt: 5 lines differ — matching exactly the 5 designed overrepresented entries.

So the new test case both verifies the fix is correct and serves as a regression guard for future changes in the same code path.

Note on adapter list: complex_default/minimal_default were generated against fastqc-rust's bundled 6-adapter assets/adapter_list.txt (the Java FastQC tarball contains the same older list). To match, realistic_default's reference was produced with fastqc -a assets/adapter_list.txt tests/data/realistic.fastq. Same convention as the existing reference data.

Validation

After the fix, the >>Overrepresented sequences section produced by fastqc-rust matches Java FastQC 0.12.1's output byte-for-byte for the percentage column. All 99 upstream tests still pass (cargo test --release). The full equivalence suite passes (uv run tests/equivalence/compare.py --binary ./target/release/fastqc --test realistic_default).

One unrelated divergence remains: when two sequences have the same count, fastqc-rust uses sequence-ascending tiebreaker (intentional, per the comment at src/modules/overrepresented_seqs.rs:310-314) while Java preserves HashMap iteration order. That's a separate design call and out of scope for this PR.

Why this matters

The README and equivalence report position fastqc-rust as producing fastqc_data.txt byte-identical to Java FastQC. This bug breaks that contract for any sample with overrepresented sequences. Downstream MultiQC aggregations of fastqc_trimmed_overrepresented_sequences_plot.txt then yield different MD5s on different MultiQC versions (some MultiQCs round upstream of aggregation, some don't), causing fragile snapshot tests in pipelines that pin the bundled engine.

Context

I hit this while integrating fastqc-rust into Trim Galore v2.1.0-beta.4 as a bundled dep — the regression surfaced via nf-core/rnaseq#1789's nf-test snapshot suite (NF 25.04.3 with its pinned MultiQC version). Reproducer is portable.

…pat) Java FastQC stores the raw percentage double without rounding and serializes it through `String.valueOf()` in `AbstractQCModule.writeTable`, which produces full `Double.toString()` precision (e.g. `7.160449112640348`). The current code pre-rounds via `Math.round(x*100.0)/100.0` before formatting, citing a "JAVA COMPAT" comment that does not match Java FastQC 0.12.1's actual behaviour. Empirically Java FastQC writes the unrounded value in `fastqc_data.txt`. Drop the rounding and pass the raw percentage through `java_format_double` directly. Reference paths in upstream Java FastQC 0.12.1 source: - `OverRepresentedSeqs.java:123` — `(double)count/total * 100` stored unrounded - `OverRepresentedSeqs.java:253` — `OverrepresentedSeq.percentage` field unmodified - `OverRepresentedSeqs.java:214` — `getValueAt(case 2)` returns the raw double - `AbstractQCModule.java:159` — `String.valueOf(table.getValueAt(...))`

FelixKrueger · 2026-04-26T21:43:09Z

@ewels — to your question about the equivalence script:

It's a coverage gap in the test inputs, not a flaw in the script's mechanism (the script does compare fastqc_data.txt byte-for-byte and would absolutely catch this divergence with realistic data).

The cause is that every single one of the 18 reference test cases produces exactly one overrepresented entry at exactly 100.0%:

$ for d in tests/equivalence/reference/*/; do
    extract "$d/fastqc_data.txt" "Overrepresented sequences" |
      awk '/^[ACGTN]/ {print $3}'
  done | sort -u
100.0

The buggy rounding (x * 100.0).round() / 100.0 is a no-op for x = 100.0 — Math.round(100.0 * 100.0) / 100.0 == 100.0, which java_format_double then renders as "100.0" either way. So the rounded and unrounded paths produce byte-identical output for every existing test case.

The bug only surfaces when there's variety in the percentage column. complex.fastq (5 reads, all ACGTACGTACGTACGT) trivially gives 5/5 = 100% for its single overrepresented sequence. minimal.fastq is even smaller (1 read).

The way I tripped over it was running fastqc-rust on a real small-RNA library (smallRNA_100K — 100k reads, 99 detected overrepresented sequences with percentages like 7.160449112640348, 5.6694434383677415, etc.). Once the percentage column has any non-round entries, the divergence is unambiguous.

Suggested fix on the test side: add one realistic-shaped fastq under tests/data/ — even a few hundred reads with moderate sequence variety (one or two genuinely overrepresented sequences at non-round rates) is enough. The mechanical recipe:

Drop, e.g., realistic.fastq under tests/data/
Add realistic_default (etc.) entries to test_cases.yaml
Run generate_reference.sh to capture the Java reference
Commit the resulting tests/equivalence/reference/realistic_default/ tree

If you'd like me to put together a candidate test fastq + reference directory in this same PR (or a follow-up), happy to. Otherwise just flagging the coverage gap.

…wels#2) The existing equivalence suite (minimal/complex variants) has every overrepresented entry land at exactly 100.0%, where the buggy rounding fixed in this PR is a no-op (Math.round(100.0 * 100.0) / 100.0 == 100.0). This means the suite missed the percentage-precision divergence even though the comparator mechanism is correct. This adds a `realistic_default` test case sized to expose non-round overrepresented percentages: - 1009 reads (prime denominator), 50 bp, uniform Phred 40 - 5 deliberately-overrepresented sequences at counts (73, 37, 11, 5, 2) → percentages 7.234886025768088, 3.6669970267591676, 1.0901883052527255, 0.4955401387512388, 0.19821605550049554 (all non-round) - 881 unique pseudo-random background reads at 1/1009 = 0.0991% (just below the 0.1% overrepresented:warn threshold so they don't pollute the section) - Deterministic generation (fixed seed); see tests/data/gen_realistic.py Files added: - tests/data/realistic.fastq + tests/data/gen_realistic.py (input + generator) - tests/equivalence/test_cases.yaml (new realistic_default entry) - tests/equivalence/reference/realistic_default/ (full Java FastQC reference) - tests/equivalence/patches/realistic_default_*_svg.patch (8 SVG renderer patches mirroring the pattern used by complex_*/minimal_* cases) Validation: with this commit's percentage-formatting fix, `realistic_default` passes; reverting just src/modules/overrepresented_seqs.rs (HEAD~1 state) makes it fail with "fastqc_data.txt: 5 lines differ" — matching exactly the 5 designed overrepresented entries. The test thus serves as a regression guard for the bug fixed earlier in this PR. The existing `complex_default` and `minimal_default` reference data was generated against an older Java FastQC config with 6 adapters in the Adapter Content section; this case follows the same convention by running Java FastQC with `-a assets/adapter_list.txt` to use fastqc-rust's bundled 6-adapter list.

The plain FASTQ added in the previous commit compressed well (mostly uniform Phred-40 quality strings + 50bp ACGT sequences, ~83.5% reduction). Switch to a gzipped input to keep the repo lean. - tests/data/realistic.fastq → realistic.fastq.gz - gen_realistic.py now writes via gzip.GzipFile with mtime=0 so re-running produces a byte-identical .gz on any machine - tests/equivalence/test_cases.yaml: file: realistic.fastq.gz - Reference data regenerated: the "Filename" header in fastqc_data.txt, summary.txt rows, and the fastqc_report.html banner now reference realistic.fastq.gz (Java FastQC writes the literal input filename into several places, so the reference must match) Equivalence test: realistic_default ... PASS (full suite green).

Bumps `fastqc-rust` from `=1.0.0` to `=1.0.1` to pick up the upstream percentage-precision fix in `>>Overrepresented sequences` (ewels/FastQC-Rust#2, merged 2026-04-26). Why this matters: fastqc-rust v1.0.0 rounded the overrepresented percentage column to 2 decimals (`7.16`), while Java FastQC 0.12.1 emits full Double.toString() precision (`7.160449112640348`). The nf-core/rnaseq#1789 integration matrix on Nextflow 25.04.3 surfaced the divergence as a snapshot MD5 mismatch on `fastqc_trimmed_overrepresented_sequences_plot.txt` (the older pinned MultiQC preserves the literal value when aggregating, so the rounding cascaded into a downstream byte difference). NF latest-everything was already green because its newer MultiQC rounds upstream of aggregation. After this bump: `fastqc_data.txt` byte-identity to Java FastQC 0.12.1 is restored for the overrepresented section, matching the existing guarantee for every other section. No behavioural change to TrimGalore itself; bug was purely cosmetic formatting in the bundled FastQC engine. Re-pin the rnaseq integration matrix to the new beta.5 digest to verify CI now goes green on NF 25.04.3.

FelixKrueger added 2 commits April 26, 2026 22:57

ewels merged commit bb65350 into ewels:main Apr 26, 2026
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: emit unrounded percentage in Overrepresented sequences (Java compat)#2

fix: emit unrounded percentage in Overrepresented sequences (Java compat)#2
ewels merged 3 commits into
ewels:mainfrom
FelixKrueger:fix/overrepresented-percentage-precision

FelixKrueger commented Apr 26, 2026 •

edited

Loading

Uh oh!

FelixKrueger commented Apr 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

FelixKrueger commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Reproducer

Why the existing comment is wrong

Fix

Regression test (commit b12da4e)

Validation

Why this matters

Context

Uh oh!

FelixKrueger commented Apr 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

FelixKrueger commented Apr 26, 2026 •

edited

Loading

Regression test (commit `b12da4e`)