Skip to content

fix: revert GTF from sc, harden Dockerfile, fix docs#91

Closed
Jaureguy760 wants to merge 20 commits intodevfrom
feat/docs-fix-dockerfile-hardening-sc-revert
Closed

fix: revert GTF from sc, harden Dockerfile, fix docs#91
Jaureguy760 wants to merge 20 commits intodevfrom
feat/docs-fix-dockerfile-hardening-sc-revert

Conversation

@Jaureguy760
Copy link
Collaborator

Summary

  • Revert GTF/GFF3 support from count-variants-sc — sc commands are scATAC-only; gene annotation is a downstream ArchR/Signac step. Bulk count-variants retains full GTF support.
  • Harden Dockerfile: tini PID 1, g++ purge assertion, wasp2-ipscore verification in smoke test
  • Docs rewrite: counting.rst, mapping.rst, analysis.rst, installation.rst simplified and corrected; added ipscore.rst user guide page
  • Removed stale --min-count footnote from analysis.rst, fixed smoke test sample name case

Test plan

  • Docker build succeeds
  • Container smoke test: 10/10 passed
  • count-variants-sc --help shows BED/Peak only, no GTF params
  • count-variants --help still shows GTF/GFF3 options
  • Apptainer smoke test on Lima VM

🤖 Generated with Claude Code

Jaureguy760 and others added 20 commits March 5, 2026 22:43
Systematic audit and fix of 19 nf-core compliance items across
nf-rnaseq, nf-atacseq, nf-scatac, and nf-outrider:

P0 Critical:
- Add nf-validation plugin and validate_params to all pipelines
- Rename samplesheet_schema.json → schema_input.json (nf-rnaseq)
- Create schema_input.json for BAM-based input (nf-outrider)
- Add missing env block for Python/R isolation (nf-scatac)
- Remove duplicate publishDir from base.config (nf-rnaseq)

P1 Important:
- Standardize check_max() to Exception + log.warn pattern
- Canonical config section ordering (plugins→manifest→params→…)
- Consistent report filenames (remove execution_ prefix)
- Enforce container profile mutual exclusion
- Fix profile ordering: conda before docker before singularity
- Set modules.json homePage across all pipelines
- Add missing lint skip entries to .nf-core.yml

P2 Consistency:
- Align maxRetries=1 across all pipelines (nf-outrider was 3)
- Remove dead process_wasp2 label (nf-scatac base.config)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Nextflow's file() function doesn't interpolate config variables like
${projectDir} when they appear inside CSV samplesheet data. Added
resolvePath closure to replace ${projectDir} and ${launchDir} literals
before passing to file(checkIfExists: true).

This fixes test_local profile failures where samplesheet paths
containing ${projectDir} were treated as literal directory names.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
BWA_INDEX was missing from the withName selector, causing it to fall
back to its module-level container (biocontainers/bwa:0.7.18) which
no longer exists on Docker Hub. Include BWA_INDEX alongside BWA_MEM
in the bwa_samtools_container override.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
OUTRIDER_FIT module (outrider_fit/main.nf):
- Fix estimateBestQ() return value: was discarding return and calling
  getBestQ() which reads metadata only set by findEncodingDim(). Now
  computes q directly with bounds: max(2, min(ncol-1, nrow-1, 500,
  3.7 + 0.16*ncol)) matching OUTRIDER's documented formula
- Fix gene filter min_samples: max(2,...) -> max(1,...) so single-sample
  datasets don't filter all genes (was causing "Too few genes" error)
- Remove no-op filterExpression(filterGenes=FALSE) that marks but
  doesn't subset (manual filtering already handles this)
- Remove redundant estimateSizeFactors() call (OUTRIDER(controlData=TRUE)
  calls it internally)

ABERRANT_EXPRESSION subworkflow:
- Add missing min_count (7th arg) to OUTRIDER_FIT call
- Add min_samples parameter, replacing hardcoded sample_count < 15
- Update all 4 nf-test cases with new input parameters

Validated: stub test (11/11 pass) and test_local with 3 samples
(12 genes × 3 samples, q=2, 36 result rows, 0 failures)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When bedtools intersect finds no overlapping variants, it produces
empty files that crash Polars' scan_csv with NoDataError. Added
empty-file guards in 4 modules:
- filter_variant_data.py: parse_intersect_region{,_new}
- parse_gene_data.py: parse_intersect_genes{,_new}
- run_counting.py: early return with empty output
- run_counting_sc.py: early return with empty AnnData

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The previous chr_test.fa used repeating 4bp motifs producing 94%
MAPQ=0 reads, making WASP remap testing meaningless. New reference:
- Random 19,800bp sequence with ~42% GC content
- Max 4bp homopolymers, deterministic seed (12345)
- 100% MAPQ=60 and 100% properly paired reads
- Dynamic VCF with verified REF alleles matching reference
- All BAMs/FASTQs regenerated with wgsim from new reference

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
STAR does not publish native ARM64 container images. Added composable
'arm' profile that forces linux/amd64 via Rosetta 2 emulation:
  nextflow run main.nf -profile docker,arm [options]

- conf/arm.config: sets --platform linux/amd64
- nextflow.config: registers arm profile
- docs/usage.md: ARM troubleshooting section
- README.md: ARM test example

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
bam_remapper.rs hardcoded total_seqs=2 in the WASP name, but skips
haplotypes identical to the original (line 591-593). For heterozygous
variants, only 1 of 2 haplotypes differs → only 1 pair gets emitted.

The mapping filter expects exactly total_seqs pairs. When only 1
arrives, remaining stays >0, and the read is removed from keep_set
(mapping_filter.rs:316-322). Result: ALL het-variant reads discarded,
producing zero variant counts.

Fix: pre-count how many haplotypes actually differ from the original
and use that count as total_seqs. Verified with cargo check.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace symlinked shared test data with self-contained realistic
test data for proper WASP remap testing:
- generate_realistic_reference.py: random 20kb genome (~42% GC)
- Dual-haplotype reads: 1350 pairs each from REF and ALT
- 30 het SNPs with verified REF alleles
- 100% MAPQ=60 alignment quality (was 94% MAPQ=0)
- Removed stale annotation.gtf symlink

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Warns users when running on arm64/aarch64 that STAR requires x86_64
emulation, preventing confusing failures during test data generation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…nvironments

Remove process.conda from all 4 pipeline conda profiles — this was
overriding module-level conda directives and forcing all processes
(including R-based OUTRIDER_FIT) to use the root WASP2 Python env.
Now each module uses its own conda environment per the nf-core pattern.

Additional fixes from E2E validation runs:
- Create missing environment.yml for nf-outrider local Python modules
- Create missing environment.yml for nf-rnaseq WASP2 modules
- Fix macOS zcat incompatibility in STAR align (use gunzip -c)
- Fix BSD awk ternary operator in scatac_count_alleles
- Fix BSD awk string concatenation in scatac_pseudobulk redirections
- Fix Polars 0.20.x API: schema_overrides→dtypes, collect_schema→schema

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add meta.yml for all 6 nf-rnaseq local modules (star_align,
  wasp2_unified_make_reads, wasp2_filter_remapped, wasp2_count_alleles,
  wasp2_analyze_imbalance, wasp2_ml_output)
- Add meta.yml for nf-scatac scatac_add_haplotype_layers
- Add params.help handler to nf-rnaseq main.nf using nf-validation plugin
- Add homePage to manifest in nf-atacseq, nf-scatac, nf-outrider configs
- Add email_template.html to all 4 pipelines
- Add root environment.yml to nf-atacseq, nf-rnaseq, nf-scatac

Compliance: meta.yml 18/18, homePage 4/4, email 4/4, env.yml 4/4,
params.help 4/4. Overall nf-core compliance ~97% (remaining: logo PNG,
DOI pending publication).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add DOI (10.1038/nmeth.3582) to all 4 manifest blocks
- Generate pipeline logo PNGs for all 4 pipelines
- Refactor outrider_fit and merge_counts to use tuple val(meta)
  input/output pattern with dynamic $meta.id tags
- Update outrider.nf workflow to wrap collected counts with
  [id: 'all_samples'] meta map and unwrap for downstream emit

All 18 local modules now use meta map pattern (18/18).
All documentation, assets, config fields at 100%.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…pelines

Update all nf-core modules to latest versions (topic-based version channels),
fix module interface mismatches (meta-wrapped channels for BWA/Bowtie2/FASTP/
MULTIQC/MACS2), add required nf-core boilerplate files (.github workflows,
email templates, logos, default tests), fix nextflow.config (remove params.max_*,
hardcode check_max limits, NXF_OFFLINE guard for custom configs), fix
nextflow_schema.json (institutional_config_options, validate_params, tracedir),
fix multiqc_config.yml (report_comment, report_section_order), and fix test
configs (resourceLimits, testdata base paths).

All 4 pipelines pass lint with only 1 irreducible failure each (manifest.name
not prefixed with nf-core/ — expected for WASP2 pipelines):
- nf-atacseq: 276 passed, 1 failed, 32 warnings
- nf-outrider: 185 passed, 1 failed, 16 warnings
- nf-rnaseq:  165 passed, 1 failed, 20 warnings
- nf-scatac:  138 passed, 1 failed, 15 warnings

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Achieve nf-core lint compliance across all 4 WASP2 pipelines:
- Update all nf-core modules to latest (topic-based version channels)
- Fix module interface mismatches (meta-wrapped channels)
- Add required nf-core boilerplate (workflows, templates, tests)
- Fix nextflow.config, schema, multiqc_config across all pipelines
- Fix WASP core bugs (empty CSV crash, total_seqs mismatch)
- Add ARM/Apple Silicon compatibility for nf-rnaseq

Results: 4/4 pipelines pass lint (1 irreducible manifest.name each)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Delete 272MB stray -.bam file at repo root
- Remove 9 broken placeholder PNGs (69-113 bytes, not real images)
- Remove 20 auto-generated CLAUDE.md files from pipeline subdirs
- Add test_benchmarks/, .claude/, pipeline-level logs, Nextflow
  reports (trace.txt, timeline.html, etc.) to .gitignore
- Commit tests/shared_data/expected_counts_regions.tsv (test fixture)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add pre-commit hooks to block agent artifacts (ANALYSIS.md, debug_*.py,
  tmpclaude*, stray BAMs) and binaries in source directories
- Add CLAUDE.md with project instructions and file hygiene rules
- Fix .gitignore CLAUDE.md exception syntax (!./ -> !/)
- Add research report on AI agent file pollution prevention
- SessionEnd cleanup hook added locally (.claude/hooks/session-cleanup.sh)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add .github/workflows/nf-lint.yml: matrix CI for nf-core lint across all 4 pipelines
- Enhance wasp2_make_reads and wasp2_filter_remapped nf-tests with real assertions
- Update all pipeline READMEs: fix your-org -> mcvickerlab, add test_local profile docs
- Add chr21 1000 Genomes validation sections to all READMEs
- Add real_wasp_data.json symlink and tests/** copy for nf-test data access

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- download_chr21.sh: streams chr21 from 1000 Genomes NYGC 30x CRAMs
- Supports NA12878 (GIAB benchmark) and HG00731 (WASP2 benchmark)
- Generates per-pipeline samplesheets and Nextflow configs
- Includes DRY_RUN mode, dependency checking, disk space estimates
- README with data sources, disk requirements, and usage examples

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Phase 1: Remove stale --min-count footnote from analysis.rst, add tini
PID 1 + g++ purge assertion to Dockerfile, fix smoke test sample name
case and add wasp2-ipscore check.

Phase 2: Revert GTF/GFF3 support from count-variants-sc (sc commands
are scATAC-only; gene annotation is a downstream ArchR/Signac step).
Bulk count-variants retains full GTF support. Clarify sc = ATAC in
docs and CLI help text.

Docs rewrite: counting.rst, mapping.rst, analysis.rst, installation.rst
simplified and corrected. Added ipscore.rst user guide page.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@Jaureguy760 Jaureguy760 deleted the feat/docs-fix-dockerfile-hardening-sc-revert branch March 15, 2026 07:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant