Performance: Eliminate per-sequence `char[]` allocations in seven QC modules by ewels · Pull Request #199 · s-andrews/FastQC

ewels · 2026-05-21T20:49:28Z

String.toCharArray() allocates a fresh char[] the same length as the string, ~300 bytes per call. Seven modules each did this per sequence, so a single 50 M-read file produced ~100 GB of char arrays. The GC handled it, but at the cost of heap headroom and GC time.

This branch replaces every such loop with String + length + charAt, which on JDK 17 intrinsifies to the same machine code as char[] indexing, minus the allocation and copy. Where the inner loop reads the same base more than once, the new code caches it into a local char b so the bounds check fires once per iteration. PerSequenceGCContent.truncateSequence switches from returning a char[] to returning a String for the same reason.

Affects: BasicStats, NContent, PerBaseQualityScores, PerBaseSequenceContent, PerSequenceGCContent, PerSequenceQualityScores, PerTileQualityScores.

Benchmark report shows a small speed increase and a drop in peak RSS memory usage for single files. When running with multiple files however the memory usage is significantly less (~30% less). All fastqc_data.txt, summary.txt, and fastqc_report.html files remain byte-identical to master. Full report: report.html

Screenshot

_Volumes_T7%20Shield_fastqc-bench_tochararray-full_report html

Each per-sequence module called String.toCharArray() once per read, allocating a fresh char[] each time. Switching to a String reference plus charAt() removes that allocation without changing the algorithm. PerSequenceGCContent#truncateSequence now returns the truncated String directly so the per-read char[] in the no-truncation path also goes away. Affects: BasicStats, NContent, PerBaseQualityScores, PerBaseSequenceContent, PerSequenceGCContent, PerSequenceQualityScores, PerTileQualityScores. Co-Authored-By: Paolo Di Tommaso <paolo.ditommaso@gmail.com>

ewels · 2026-05-21T20:53:36Z

I expect this to be the last performance-related PR for a bit. @pditommaso did push a whole load of other changes as well, but none seem to make a significant impact on run time (even if they seem sensible changes), so I'm not sure that they're worth the code changes.

I thought this one was worth pushing forward mostly because of the memory savings when running with 2 FastQ files, which is a pretty common setup.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance: Eliminate per-sequence `char[]` allocations in seven QC modules#199

Performance: Eliminate per-sequence `char[]` allocations in seven QC modules#199
ewels wants to merge 1 commit into
s-andrews:masterfrom
ewels:perf/tochararray-churn

ewels commented May 21, 2026

Uh oh!

ewels commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ewels commented May 21, 2026

Uh oh!

ewels commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant