~3x run time speedup using parallel processing#197
Merged
Conversation
AnalysisRunner now runs as one reader thread plus three processor threads. The reader batches Sequences (1024 per batch) and pushes each batch reference onto N ArrayBlockingQueues. Each processor drains its own queue and runs an evenly split subset of the QCModule array, so modules stay single-threaded per processor and no in-module locking is needed. Progress callbacks (analysisUpdated) are fired from the reader thread at the same cadence as the previous single-threaded version (every batch boundary, gated on a 5% file-position advance). Co-Authored-By: Paolo Di Tommaso <paolo.ditommaso@gmail.com>
AnalysisQueue treats -t as a total-thread budget and splits it between
outer concurrency (files in parallel) and inner concurrency (per-file
reader + processor pipeline):
processorsPerFile = min(MAX_PROCESSORS_PER_FILE, totalThreads - 1)
outerSlots = max(1, totalThreads / (1 + processorsPerFile))
When -t is unset, OfflineRunner now tells AnalysisQueue how many files
the run has via configure(); the default becomes
min(THREADS_PER_FILE * max(1, expectedFiles), availableProcessors), so
a single file gets the full per-file pipeline and many files scale up
to the host's CPU count without the user needing to set -t.
A budget of one CPU makes AnalysisRunner take its single-threaded path
so -t 1 produces byte-identical behaviour to the unbatched runner.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Owner
|
Have pulled this to a local branch and tested. Can confirm that it does seem to offer a meaningful increase in speed. On a test file which should be in memory file cache. old fastqc (-t 1 effectively) took 8m6.225s Will go ahead and pull into master. |
Contributor
Author
|
Awesome, happy to hear it! Thanks for reviewing 🙏🏻 |
Contributor
Author
|
Minor side note: you probably don't need to change the target branch and do a second PR. I should have the Allow edits and access to secrets by maintainers box checked on all my PRs, meaning that you can push commits to my fork:
Then the workflow with GitHub CLI is as follows to work basically as if it were one of your own branches: gh pr checkout 197
# test, make edits if needed
git commit -am "My review changes"
git push
gh pr merge # or on the web interface |
Contributor
|
Happy to see this merged! 🎉 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

I spent a while trying to get the performance improvements from my Rust rewrite back upstream into the Java distribution. After a lot of different attempts, none of which really made any difference, I gave up. Thankfully I posted about this on our Seqera Slack and @pditommaso, defender of the Java faith, stood up for Java and said that anything Rust could do, Java could do better (or as fast, anyway).
@pditommaso proceeded to try a lot of things and sure enough, pulled a 3x speed improvement out of the bag.
I've now gone though all of his changes with a tooth pick and tried to separate out the different aspects and benchmark which things had what effect. The result of that was pinpointing the smallest change that made the biggest difference: parallelisation of the stream reader. I pulled that code out into a new clean branch and hammered on it to cut it down as much as possible, as well as polish and benchmark. This PR is the result of that.
What it does
As of this PR, each file now runs through a small in-process pipeline instead of a single loop: one reader thread (gzip + FASTQ parse) feeds up to three processor threads that each own a disjoint slice of the QC modules. The modules themselves are unchanged and no module needs to become thread-safe. Despite us thinking that gzip was the bottleneck and that was that, this does indeed give a major performance boost.
Behaviour change
The core code change is in the first commit and is really rather small. The second commit is to handle the pre-existing
-tflag which sets threads. Previously threads = number of files processed in parallel, as FastQC was single-threaded. After the above change, each-tvalue for a new file was actually 4 more CPUs, which is not what the end user would expect.To handle this, I tweaked how it worked and made it truly reflect the number of CPUs to use.
-t/-Dfastqc.threadsis now a total CPU budget, split between files in parallel (outer concurrency) and the per-file pipeline (inner concurrency). It defaults tomin(4 × num_files, available_cpus).-t 1-t N(N > 1)-tBenchmark
To validate the results I ran a set of benchmarks. Full report with results, and also discussing the changes in the PR and why they work, is here: report.html
Note that all outputs are byte-identical to master across every run.
To avoid downloading + opening the HTML to view, here's a full-page screenshot which you can squint at in the GitHub UI, if you prefer:
Report (screenshot)