Skip to content

feat: write bcl2fastq output directly to /flowcells/, eliminating 3TB cross-filesystem rsync#82

Open
jemma-nelson wants to merge 6 commits intomainfrom
feat/direct-flowcell-output
Open

feat: write bcl2fastq output directly to /flowcells/, eliminating 3TB cross-filesystem rsync#82
jemma-nelson wants to merge 6 commits intomainfrom
feat/direct-flowcell-output

Conversation

@jemma-nelson
Copy link
Copy Markdown
Contributor

Summary

Eliminates the slow copy phase in setup.sh by writing bcl2fastq output directly to the destination filesystem (/flowcells/) instead of /sequencers/, then rsyncing 3TB across.

Key changes

scripts/flowcells/setup.sh

  • All bcl2fastq --output-dir args now write to $analysis_dir/bcl_output/ (a restricted subdirectory under the flowcell analysis dir on /flowcells/)
  • bcl_output/ is created with chmod 700 before the bcl2fastq job is submitted, so permissions are set before any files exist
  • copy_from_dir is unified to $analysis_dir/Demultiplexed/ for all non-legacy paths
  • In the copy SLURM job: internal samples now use mv (O(1) rename, no data movement) instead of rsync -aL; only samples with a project_share_directory still use rsync
  • bcl_output/ intermediate directories are not deleted — they retain stats, logs, and reports from bcl2fastq

scripts/flowcells/rename_fastq_files.py (was link_nextseq.py)

  • Renamed to reflect that it now moves files rather than creating symlinks
  • os.symlink() replaced with os.rename() — files are moved to canonical sample names; since everything is on the same filesystem this is an O(1) metadata operation
  • All internal verbiage (create_links, 'symlinks') updated to match

Commits

  • d42d1a2 feat: write bcl2fastq output directly to /flowcells/
  • 2428eea refactor: rename link_nextseq.py, nest bcl output in bcl_output/ subdir
  • fb3c5a7 fix: terminology (bcl-convert → bcl2fastq in comments)
  • 5651e75 fix: GUIDEseq index-swap paths updated to bcl_output/fastq/

- Rename scripts/flowcells/link_nextseq.py -> rename_fastq_files.py to
  reflect that it now moves files rather than creating symlinks. Update
  all references in setup.sh and fix internal verbiage (create_links ->
  rename_files, 'symlinks' -> 'renames' in help/docstrings).

- Add a bcl_output/ subdirectory under analysis_dir as the landing zone
  for all bcl-convert raw output. Create it with chmod 700 *before*
  submitting the bcl-convert job, so permissions are set before any files
  exist. Remove the post-hoc chmod in __COPY__.

- Update all fastq_dir assignments and bcl-convert --output-dir flags to
  write into analysis_dir/bcl_output/ instead of analysis_dir/ directly.

- Update novaseq_link_command glob from 'fastq-withmask-*' to
  'bcl_output/fastq-withmask-*' and non-NovaSeq link_command -i arg from
  'fastq' to 'bcl_output/fastq'.

- Remove the rm -rf of intermediate fastq dirs from __COPY__: those
  directories retain valuable stats, reports, and logs from bcl-convert.
@jemma-nelson
Copy link
Copy Markdown
Contributor Author

I let Claude Code write most of this - I think the core idea is solid, but I am tagging myself to review the details at a later date.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR removes the large cross-filesystem copy step by having bcl2fastq write outputs directly under the flowcell analysis directory on /flowcells/, and then reorganizes/moves outputs in-place (with mv for internal samples) during the copy stage.

Changes:

  • Update bcl2fastq --output-dir paths to write into $analysis_dir/bcl_output/ and create that directory with restricted permissions ahead of job submission.
  • Replace the old “link” phase with rename_fastq_files.py, which now moves FASTQs into canonical names/structure.
  • Optimize the copy SLURM job: internal samples use filesystem renames (mv), while samples destined for project_share_directory still use rsync.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
scripts/flowcells/setup.sh Redirects bcl2fastq output to /flowcells under bcl_output/, adjusts linking/rename invocation, and changes the copy job to mv internal outputs.
scripts/flowcells/rename_fastq_files.py Renames/refactors the previous linking behavior to move files via os.rename() and updates messaging/function naming accordingly.
Comments suppressed due to low confidence (1)

scripts/flowcells/rename_fastq_files.py:147

  • os.rename() is only executed when output_file does not exist, but if it does exist the code silently leaves the input FASTQ in place (after logging "Moving …"), which can produce partially-processed runs and make reruns non-deterministic. Consider explicitly logging a warning/error when output_file already exists (and either aborting, or adding an overwrite/force option) so failures don’t go unnoticed and inputs aren’t left behind unexpectedly.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@jemma-nelson
Copy link
Copy Markdown
Contributor Author

Okay - I'm going to merge this in, and will deploy early next week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants