feat: write bcl2fastq output directly to /flowcells/, eliminating 3TB cross-filesystem rsync#82
Open
jemma-nelson wants to merge 6 commits intomainfrom
Open
feat: write bcl2fastq output directly to /flowcells/, eliminating 3TB cross-filesystem rsync#82jemma-nelson wants to merge 6 commits intomainfrom
jemma-nelson wants to merge 6 commits intomainfrom
Conversation
- Rename scripts/flowcells/link_nextseq.py -> rename_fastq_files.py to reflect that it now moves files rather than creating symlinks. Update all references in setup.sh and fix internal verbiage (create_links -> rename_files, 'symlinks' -> 'renames' in help/docstrings). - Add a bcl_output/ subdirectory under analysis_dir as the landing zone for all bcl-convert raw output. Create it with chmod 700 *before* submitting the bcl-convert job, so permissions are set before any files exist. Remove the post-hoc chmod in __COPY__. - Update all fastq_dir assignments and bcl-convert --output-dir flags to write into analysis_dir/bcl_output/ instead of analysis_dir/ directly. - Update novaseq_link_command glob from 'fastq-withmask-*' to 'bcl_output/fastq-withmask-*' and non-NovaSeq link_command -i arg from 'fastq' to 'bcl_output/fastq'. - Remove the rm -rf of intermediate fastq dirs from __COPY__: those directories retain valuable stats, reports, and logs from bcl-convert.
Contributor
Author
|
I let Claude Code write most of this - I think the core idea is solid, but I am tagging myself to review the details at a later date. |
There was a problem hiding this comment.
Pull request overview
This PR removes the large cross-filesystem copy step by having bcl2fastq write outputs directly under the flowcell analysis directory on /flowcells/, and then reorganizes/moves outputs in-place (with mv for internal samples) during the copy stage.
Changes:
- Update
bcl2fastq --output-dirpaths to write into$analysis_dir/bcl_output/and create that directory with restricted permissions ahead of job submission. - Replace the old “link” phase with
rename_fastq_files.py, which now moves FASTQs into canonical names/structure. - Optimize the copy SLURM job: internal samples use filesystem renames (
mv), while samples destined forproject_share_directorystill usersync.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| scripts/flowcells/setup.sh | Redirects bcl2fastq output to /flowcells under bcl_output/, adjusts linking/rename invocation, and changes the copy job to mv internal outputs. |
| scripts/flowcells/rename_fastq_files.py | Renames/refactors the previous linking behavior to move files via os.rename() and updates messaging/function naming accordingly. |
Comments suppressed due to low confidence (1)
scripts/flowcells/rename_fastq_files.py:147
os.rename()is only executed whenoutput_filedoes not exist, but if it does exist the code silently leaves the input FASTQ in place (after logging "Moving …"), which can produce partially-processed runs and make reruns non-deterministic. Consider explicitly logging a warning/error whenoutput_filealready exists (and either aborting, or adding an overwrite/force option) so failures don’t go unnoticed and inputs aren’t left behind unexpectedly.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Contributor
Author
|
Okay - I'm going to merge this in, and will deploy early next week. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Eliminates the slow copy phase in
setup.shby writing bcl2fastq output directly to the destination filesystem (/flowcells/) instead of/sequencers/, then rsyncing 3TB across.Key changes
scripts/flowcells/setup.sh--output-dirargs now write to$analysis_dir/bcl_output/(a restricted subdirectory under the flowcell analysis dir on/flowcells/)bcl_output/is created withchmod 700before the bcl2fastq job is submitted, so permissions are set before any files existcopy_from_diris unified to$analysis_dir/Demultiplexed/for all non-legacy pathsmv(O(1) rename, no data movement) instead ofrsync -aL; only samples with aproject_share_directorystill usersyncscripts/flowcells/rename_fastq_files.py(waslink_nextseq.py)os.symlink()replaced withos.rename()— files are moved to canonical sample names; since everything is on the same filesystem this is an O(1) metadata operationcreate_links, 'symlinks') updated to matchCommits
d42d1a2feat: write bcl2fastq output directly to /flowcells/2428eearefactor: rename link_nextseq.py, nest bcl output in bcl_output/ subdirfb3c5a7fix: terminology (bcl-convert → bcl2fastq in comments)5651e75fix: GUIDEseq index-swap paths updated to bcl_output/fastq/