fix: emit unrounded percentage in Overrepresented sequences (Java compat)#2
Conversation
…pat) Java FastQC stores the raw percentage double without rounding and serializes it through `String.valueOf()` in `AbstractQCModule.writeTable`, which produces full `Double.toString()` precision (e.g. `7.160449112640348`). The current code pre-rounds via `Math.round(x*100.0)/100.0` before formatting, citing a "JAVA COMPAT" comment that does not match Java FastQC 0.12.1's actual behaviour. Empirically Java FastQC writes the unrounded value in `fastqc_data.txt`. Drop the rounding and pass the raw percentage through `java_format_double` directly. Reference paths in upstream Java FastQC 0.12.1 source: - `OverRepresentedSeqs.java:123` — `(double)count/total * 100` stored unrounded - `OverRepresentedSeqs.java:253` — `OverrepresentedSeq.percentage` field unmodified - `OverRepresentedSeqs.java:214` — `getValueAt(case 2)` returns the raw double - `AbstractQCModule.java:159` — `String.valueOf(table.getValueAt(...))`
It's a coverage gap in the test inputs, not a flaw in the script's mechanism (the script does compare The cause is that every single one of the 18 reference test cases produces exactly one overrepresented entry at exactly The buggy rounding The bug only surfaces when there's variety in the percentage column. The way I tripped over it was running fastqc-rust on a real small-RNA library (smallRNA_100K — 100k reads, 99 detected overrepresented sequences with percentages like Suggested fix on the test side: add one realistic-shaped fastq under
If you'd like me to put together a candidate test fastq + reference directory in this same PR (or a follow-up), happy to. Otherwise just flagging the coverage gap. |
…wels#2) The existing equivalence suite (minimal/complex variants) has every overrepresented entry land at exactly 100.0%, where the buggy rounding fixed in this PR is a no-op (Math.round(100.0 * 100.0) / 100.0 == 100.0). This means the suite missed the percentage-precision divergence even though the comparator mechanism is correct. This adds a `realistic_default` test case sized to expose non-round overrepresented percentages: - 1009 reads (prime denominator), 50 bp, uniform Phred 40 - 5 deliberately-overrepresented sequences at counts (73, 37, 11, 5, 2) → percentages 7.234886025768088, 3.6669970267591676, 1.0901883052527255, 0.4955401387512388, 0.19821605550049554 (all non-round) - 881 unique pseudo-random background reads at 1/1009 = 0.0991% (just below the 0.1% overrepresented:warn threshold so they don't pollute the section) - Deterministic generation (fixed seed); see tests/data/gen_realistic.py Files added: - tests/data/realistic.fastq + tests/data/gen_realistic.py (input + generator) - tests/equivalence/test_cases.yaml (new realistic_default entry) - tests/equivalence/reference/realistic_default/ (full Java FastQC reference) - tests/equivalence/patches/realistic_default_*_svg.patch (8 SVG renderer patches mirroring the pattern used by complex_*/minimal_* cases) Validation: with this commit's percentage-formatting fix, `realistic_default` passes; reverting just src/modules/overrepresented_seqs.rs (HEAD~1 state) makes it fail with "fastqc_data.txt: 5 lines differ" — matching exactly the 5 designed overrepresented entries. The test thus serves as a regression guard for the bug fixed earlier in this PR. The existing `complex_default` and `minimal_default` reference data was generated against an older Java FastQC config with 6 adapters in the Adapter Content section; this case follows the same convention by running Java FastQC with `-a assets/adapter_list.txt` to use fastqc-rust's bundled 6-adapter list.
The plain FASTQ added in the previous commit compressed well (mostly uniform Phred-40 quality strings + 50bp ACGT sequences, ~83.5% reduction). Switch to a gzipped input to keep the repo lean. - tests/data/realistic.fastq → realistic.fastq.gz - gen_realistic.py now writes via gzip.GzipFile with mtime=0 so re-running produces a byte-identical .gz on any machine - tests/equivalence/test_cases.yaml: file: realistic.fastq.gz - Reference data regenerated: the "Filename" header in fastqc_data.txt, summary.txt rows, and the fastqc_report.html banner now reference realistic.fastq.gz (Java FastQC writes the literal input filename into several places, so the reference must match) Equivalence test: realistic_default ... PASS (full suite green).
Bumps `fastqc-rust` from `=1.0.0` to `=1.0.1` to pick up the upstream percentage-precision fix in `>>Overrepresented sequences` (ewels/FastQC-Rust#2, merged 2026-04-26). Why this matters: fastqc-rust v1.0.0 rounded the overrepresented percentage column to 2 decimals (`7.16`), while Java FastQC 0.12.1 emits full Double.toString() precision (`7.160449112640348`). The nf-core/rnaseq#1789 integration matrix on Nextflow 25.04.3 surfaced the divergence as a snapshot MD5 mismatch on `fastqc_trimmed_overrepresented_sequences_plot.txt` (the older pinned MultiQC preserves the literal value when aggregating, so the rounding cascaded into a downstream byte difference). NF latest-everything was already green because its newer MultiQC rounds upstream of aggregation. After this bump: `fastqc_data.txt` byte-identity to Java FastQC 0.12.1 is restored for the overrepresented section, matching the existing guarantee for every other section. No behavioural change to TrimGalore itself; bug was purely cosmetic formatting in the bundled FastQC engine. Re-pin the rnaseq integration matrix to the new beta.5 digest to verify CI now goes green on NF 25.04.3.
Summary
The current
>>Overrepresented sequencestext output rounds the percentage column to 2 decimal places (7.16), citing aJAVA COMPATcomment that claims Java FastQC does the same. Empirically, Java FastQC 0.12.1 emits fullDouble.toString()precision in this column (7.160449112640348). This PR drops the rounding so the section is byte-equivalent to Java FastQC again.Reproducer
Trim
smallRNA_100K.fastq.gz(typical small-RNA library, 100k reads, ~99 overrepresented sequences detected by both engines) and run both engines on the same byte-identical trimmed file.>>Overrepresented sequencessection infastqc_data.txt:AAAGTGCTACTACTTTTGAGTCT 5931 7.160449112640348 No HitAAAGTGCTACTACTTTTGAGTCT 5931 7.16 No HitThe same 99 sequences are detected with the same counts and the same
Possible Sourceclassification — only the percentage column's formatting differs. Every other section infastqc_data.txt(Per base sequence quality, Per sequence GC content, Sequence Duplication Levels, etc.) is byte-identical to Java FastQC.Why the existing comment is wrong
The current code at
src/modules/overrepresented_seqs.rs:419-423says:Walking the actual Java path in
s-andrews/FastQCv0.12.1:OverRepresentedSeqs.java:123—double percentage = ((double)sequences.get(seq)/count)*100;— full precision computed.OverRepresentedSeqs.java:253—this.percentage = percentage;— stored unmodified in theOverrepresentedSeqstruct.OverRepresentedSeqs.java:214—case 2: return seqs[rowIndex].percentage();— returned unrounded bygetValueAt(...).AbstractQCModule.java:159—w.writeCharacters(String.valueOf(table.getValueAt(r, c)));— serialized viaString.valueOf()on the unroundedDouble.There is no
Math.roundanywhere in this chain in Java FastQC 0.12.1. TheJAVA COMPATcomment is well-intentioned but doesn't match what Java actually does.Fix
Drop the rounding line; pass
s.percentagedirectly tojava_format_double. The helper already produces Java'sDouble.toString()format faithfully for full-precision doubles (it's exercised by the existing test suite and by every other section that emits a percentage).Regression test (commit
b12da4e)The existing equivalence suite missed this divergence because every reference test case has exactly one overrepresented entry at exactly
100.0%, where the buggy rounding is a no-op (Math.round(100.0 * 100.0) / 100.0 == 100.0). To close that coverage gap, this PR also adds a new equivalence caserealistic_defaultsized to expose non-round percentages:7.234886025768088,3.6669970267591676,1.0901883052527255,0.4955401387512388,0.19821605550049554overrepresented:warnthreshold so they don't pollute the section)tests/data/gen_realistic.py(fixed seed)_normalize_svg'd Java reference, mirroring the pattern used bycomplex_*/minimal_*casesBidirectional validation:
Testing: realistic_default ... PASSsrc/modules/overrepresented_seqs.rs(HEAD~1 state):Testing: realistic_default ... FAILwithfastqc_data.txt: 5 lines differ— matching exactly the 5 designed overrepresented entries.So the new test case both verifies the fix is correct and serves as a regression guard for future changes in the same code path.
Note on adapter list:
complex_default/minimal_defaultwere generated against fastqc-rust's bundled 6-adapterassets/adapter_list.txt(the Java FastQC tarball contains the same older list). To match,realistic_default's reference was produced withfastqc -a assets/adapter_list.txt tests/data/realistic.fastq. Same convention as the existing reference data.Validation
After the fix, the
>>Overrepresented sequencessection produced by fastqc-rust matches Java FastQC 0.12.1's output byte-for-byte for the percentage column. All 99 upstream tests still pass (cargo test --release). The full equivalence suite passes (uv run tests/equivalence/compare.py --binary ./target/release/fastqc --test realistic_default).One unrelated divergence remains: when two sequences have the same count, fastqc-rust uses sequence-ascending tiebreaker (intentional, per the comment at
src/modules/overrepresented_seqs.rs:310-314) while Java preserves HashMap iteration order. That's a separate design call and out of scope for this PR.Why this matters
The README and equivalence report position fastqc-rust as producing
fastqc_data.txtbyte-identical to Java FastQC. This bug breaks that contract for any sample with overrepresented sequences. Downstream MultiQC aggregations offastqc_trimmed_overrepresented_sequences_plot.txtthen yield different MD5s on different MultiQC versions (some MultiQCs round upstream of aggregation, some don't), causing fragile snapshot tests in pipelines that pin the bundled engine.Context
I hit this while integrating fastqc-rust into Trim Galore v2.1.0-beta.4 as a bundled dep — the regression surfaced via
nf-core/rnaseq#1789's nf-test snapshot suite (NF 25.04.3 with its pinned MultiQC version). Reproducer is portable.