Skip to content

~3x run time speedup using parallel processing#197

Merged
s-andrews merged 3 commits into
s-andrews:performancefrom
ewels:perf/parallel-pipeline
May 21, 2026
Merged

~3x run time speedup using parallel processing#197
s-andrews merged 3 commits into
s-andrews:performancefrom
ewels:perf/parallel-pipeline

Conversation

@ewels
Copy link
Copy Markdown
Contributor

@ewels ewels commented May 21, 2026

I spent a while trying to get the performance improvements from my Rust rewrite back upstream into the Java distribution. After a lot of different attempts, none of which really made any difference, I gave up. Thankfully I posted about this on our Seqera Slack and @pditommaso, defender of the Java faith, stood up for Java and said that anything Rust could do, Java could do better (or as fast, anyway).

@pditommaso proceeded to try a lot of things and sure enough, pulled a 3x speed improvement out of the bag.

I've now gone though all of his changes with a tooth pick and tried to separate out the different aspects and benchmark which things had what effect. The result of that was pinpointing the smallest change that made the biggest difference: parallelisation of the stream reader. I pulled that code out into a new clean branch and hammered on it to cut it down as much as possible, as well as polish and benchmark. This PR is the result of that.

What it does

As of this PR, each file now runs through a small in-process pipeline instead of a single loop: one reader thread (gzip + FASTQ parse) feeds up to three processor threads that each own a disjoint slice of the QC modules. The modules themselves are unchanged and no module needs to become thread-safe. Despite us thinking that gzip was the bottleneck and that was that, this does indeed give a major performance boost.

Behaviour change

The core code change is in the first commit and is really rather small. The second commit is to handle the pre-existing -t flag which sets threads. Previously threads = number of files processed in parallel, as FastQC was single-threaded. After the above change, each -t value for a new file was actually 4 more CPUs, which is not what the end user would expect.

To handle this, I tweaked how it worked and made it truly reflect the number of CPUs to use. -t / -Dfastqc.threads is now a total CPU budget, split between files in parallel (outer concurrency) and the per-file pipeline (inner concurrency). It defaults to min(4 × num_files, available_cpus).

Invocation Before After
-t 1 1 core / file 1 core / file (sequential path, unchanged)
-t N (N > 1) up to N files in parallel, 1 core each total budget of N cores across files × pipeline
no -t 1 file at a time, 1 core up to 4 cores / file

Benchmark

To validate the results I ran a set of benchmarks. Full report with results, and also discussing the changes in the PR and why they work, is here: report.html

Note that all outputs are byte-identical to master across every run.

To avoid downloading + opening the HTML to view, here's a full-page screenshot which you can squint at in the GitHub UI, if you prefer:

Report (screenshot) _Volumes_T7%20Shield_fastqc-bench_results_report html

ewels and others added 3 commits May 21, 2026 07:53
AnalysisRunner now runs as one reader thread plus three processor
threads. The reader batches Sequences (1024 per batch) and pushes each
batch reference onto N ArrayBlockingQueues. Each processor drains its
own queue and runs an evenly split subset of the QCModule array, so
modules stay single-threaded per processor and no in-module locking is
needed.

Progress callbacks (analysisUpdated) are fired from the reader thread
at the same cadence as the previous single-threaded version (every
batch boundary, gated on a 5% file-position advance).

Co-Authored-By: Paolo Di Tommaso <paolo.ditommaso@gmail.com>
AnalysisQueue treats -t as a total-thread budget and splits it between
outer concurrency (files in parallel) and inner concurrency (per-file
reader + processor pipeline):

    processorsPerFile = min(MAX_PROCESSORS_PER_FILE, totalThreads - 1)
    outerSlots        = max(1, totalThreads / (1 + processorsPerFile))

When -t is unset, OfflineRunner now tells AnalysisQueue how many files
the run has via configure(); the default becomes
min(THREADS_PER_FILE * max(1, expectedFiles), availableProcessors), so
a single file gets the full per-file pipeline and many files scale up
to the host's CPU count without the user needing to set -t.

A budget of one CPU makes AnalysisRunner take its single-threaded path
so -t 1 produces byte-identical behaviour to the unbatched runner.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@s-andrews s-andrews changed the base branch from master to performance May 21, 2026 12:50
@s-andrews s-andrews merged commit 31c54fd into s-andrews:performance May 21, 2026
1 check failed
@ewels ewels deleted the perf/parallel-pipeline branch May 21, 2026 14:30
@s-andrews
Copy link
Copy Markdown
Owner

Have pulled this to a local branch and tested. Can confirm that it does seem to offer a meaningful increase in speed.

On a test file which should be in memory file cache.

old fastqc (-t 1 effectively) took 8m6.225s
new fastqc -t 4 took 2m15.368s
new fastqc -t 1 took 8m10.267s

Will go ahead and pull into master.

@ewels
Copy link
Copy Markdown
Contributor Author

ewels commented May 21, 2026

Awesome, happy to hear it! Thanks for reviewing 🙏🏻

@ewels
Copy link
Copy Markdown
Contributor Author

ewels commented May 21, 2026

Minor side note: you probably don't need to change the target branch and do a second PR. I should have the Allow edits and access to secrets by maintainers box checked on all my PRs, meaning that you can push commits to my fork:

CleanShot 2026-05-21 at 17 37 15@2x

Then the workflow with GitHub CLI is as follows to work basically as if it were one of your own branches:

gh pr checkout 197
# test, make edits if needed
git commit -am "My review changes"
git push
gh pr merge  # or on the web interface

@pditommaso
Copy link
Copy Markdown
Contributor

Happy to see this merged! 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants