fix: revert GTF from sc, harden Dockerfile, fix docs#91
Closed
Jaureguy760 wants to merge 20 commits intodevfrom
Closed
fix: revert GTF from sc, harden Dockerfile, fix docs#91Jaureguy760 wants to merge 20 commits intodevfrom
Jaureguy760 wants to merge 20 commits intodevfrom
Conversation
Systematic audit and fix of 19 nf-core compliance items across nf-rnaseq, nf-atacseq, nf-scatac, and nf-outrider: P0 Critical: - Add nf-validation plugin and validate_params to all pipelines - Rename samplesheet_schema.json → schema_input.json (nf-rnaseq) - Create schema_input.json for BAM-based input (nf-outrider) - Add missing env block for Python/R isolation (nf-scatac) - Remove duplicate publishDir from base.config (nf-rnaseq) P1 Important: - Standardize check_max() to Exception + log.warn pattern - Canonical config section ordering (plugins→manifest→params→…) - Consistent report filenames (remove execution_ prefix) - Enforce container profile mutual exclusion - Fix profile ordering: conda before docker before singularity - Set modules.json homePage across all pipelines - Add missing lint skip entries to .nf-core.yml P2 Consistency: - Align maxRetries=1 across all pipelines (nf-outrider was 3) - Remove dead process_wasp2 label (nf-scatac base.config) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Nextflow's file() function doesn't interpolate config variables like
${projectDir} when they appear inside CSV samplesheet data. Added
resolvePath closure to replace ${projectDir} and ${launchDir} literals
before passing to file(checkIfExists: true).
This fixes test_local profile failures where samplesheet paths
containing ${projectDir} were treated as literal directory names.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
BWA_INDEX was missing from the withName selector, causing it to fall back to its module-level container (biocontainers/bwa:0.7.18) which no longer exists on Docker Hub. Include BWA_INDEX alongside BWA_MEM in the bwa_samtools_container override. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
OUTRIDER_FIT module (outrider_fit/main.nf): - Fix estimateBestQ() return value: was discarding return and calling getBestQ() which reads metadata only set by findEncodingDim(). Now computes q directly with bounds: max(2, min(ncol-1, nrow-1, 500, 3.7 + 0.16*ncol)) matching OUTRIDER's documented formula - Fix gene filter min_samples: max(2,...) -> max(1,...) so single-sample datasets don't filter all genes (was causing "Too few genes" error) - Remove no-op filterExpression(filterGenes=FALSE) that marks but doesn't subset (manual filtering already handles this) - Remove redundant estimateSizeFactors() call (OUTRIDER(controlData=TRUE) calls it internally) ABERRANT_EXPRESSION subworkflow: - Add missing min_count (7th arg) to OUTRIDER_FIT call - Add min_samples parameter, replacing hardcoded sample_count < 15 - Update all 4 nf-test cases with new input parameters Validated: stub test (11/11 pass) and test_local with 3 samples (12 genes × 3 samples, q=2, 36 result rows, 0 failures) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When bedtools intersect finds no overlapping variants, it produces
empty files that crash Polars' scan_csv with NoDataError. Added
empty-file guards in 4 modules:
- filter_variant_data.py: parse_intersect_region{,_new}
- parse_gene_data.py: parse_intersect_genes{,_new}
- run_counting.py: early return with empty output
- run_counting_sc.py: early return with empty AnnData
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The previous chr_test.fa used repeating 4bp motifs producing 94% MAPQ=0 reads, making WASP remap testing meaningless. New reference: - Random 19,800bp sequence with ~42% GC content - Max 4bp homopolymers, deterministic seed (12345) - 100% MAPQ=60 and 100% properly paired reads - Dynamic VCF with verified REF alleles matching reference - All BAMs/FASTQs regenerated with wgsim from new reference Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
STAR does not publish native ARM64 container images. Added composable 'arm' profile that forces linux/amd64 via Rosetta 2 emulation: nextflow run main.nf -profile docker,arm [options] - conf/arm.config: sets --platform linux/amd64 - nextflow.config: registers arm profile - docs/usage.md: ARM troubleshooting section - README.md: ARM test example Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
bam_remapper.rs hardcoded total_seqs=2 in the WASP name, but skips haplotypes identical to the original (line 591-593). For heterozygous variants, only 1 of 2 haplotypes differs → only 1 pair gets emitted. The mapping filter expects exactly total_seqs pairs. When only 1 arrives, remaining stays >0, and the read is removed from keep_set (mapping_filter.rs:316-322). Result: ALL het-variant reads discarded, producing zero variant counts. Fix: pre-count how many haplotypes actually differ from the original and use that count as total_seqs. Verified with cargo check. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace symlinked shared test data with self-contained realistic test data for proper WASP remap testing: - generate_realistic_reference.py: random 20kb genome (~42% GC) - Dual-haplotype reads: 1350 pairs each from REF and ALT - 30 het SNPs with verified REF alleles - 100% MAPQ=60 alignment quality (was 94% MAPQ=0) - Removed stale annotation.gtf symlink Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Warns users when running on arm64/aarch64 that STAR requires x86_64 emulation, preventing confusing failures during test data generation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…nvironments Remove process.conda from all 4 pipeline conda profiles — this was overriding module-level conda directives and forcing all processes (including R-based OUTRIDER_FIT) to use the root WASP2 Python env. Now each module uses its own conda environment per the nf-core pattern. Additional fixes from E2E validation runs: - Create missing environment.yml for nf-outrider local Python modules - Create missing environment.yml for nf-rnaseq WASP2 modules - Fix macOS zcat incompatibility in STAR align (use gunzip -c) - Fix BSD awk ternary operator in scatac_count_alleles - Fix BSD awk string concatenation in scatac_pseudobulk redirections - Fix Polars 0.20.x API: schema_overrides→dtypes, collect_schema→schema Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add meta.yml for all 6 nf-rnaseq local modules (star_align, wasp2_unified_make_reads, wasp2_filter_remapped, wasp2_count_alleles, wasp2_analyze_imbalance, wasp2_ml_output) - Add meta.yml for nf-scatac scatac_add_haplotype_layers - Add params.help handler to nf-rnaseq main.nf using nf-validation plugin - Add homePage to manifest in nf-atacseq, nf-scatac, nf-outrider configs - Add email_template.html to all 4 pipelines - Add root environment.yml to nf-atacseq, nf-rnaseq, nf-scatac Compliance: meta.yml 18/18, homePage 4/4, email 4/4, env.yml 4/4, params.help 4/4. Overall nf-core compliance ~97% (remaining: logo PNG, DOI pending publication). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add DOI (10.1038/nmeth.3582) to all 4 manifest blocks - Generate pipeline logo PNGs for all 4 pipelines - Refactor outrider_fit and merge_counts to use tuple val(meta) input/output pattern with dynamic $meta.id tags - Update outrider.nf workflow to wrap collected counts with [id: 'all_samples'] meta map and unwrap for downstream emit All 18 local modules now use meta map pattern (18/18). All documentation, assets, config fields at 100%. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…pelines Update all nf-core modules to latest versions (topic-based version channels), fix module interface mismatches (meta-wrapped channels for BWA/Bowtie2/FASTP/ MULTIQC/MACS2), add required nf-core boilerplate files (.github workflows, email templates, logos, default tests), fix nextflow.config (remove params.max_*, hardcode check_max limits, NXF_OFFLINE guard for custom configs), fix nextflow_schema.json (institutional_config_options, validate_params, tracedir), fix multiqc_config.yml (report_comment, report_section_order), and fix test configs (resourceLimits, testdata base paths). All 4 pipelines pass lint with only 1 irreducible failure each (manifest.name not prefixed with nf-core/ — expected for WASP2 pipelines): - nf-atacseq: 276 passed, 1 failed, 32 warnings - nf-outrider: 185 passed, 1 failed, 16 warnings - nf-rnaseq: 165 passed, 1 failed, 20 warnings - nf-scatac: 138 passed, 1 failed, 15 warnings Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Achieve nf-core lint compliance across all 4 WASP2 pipelines: - Update all nf-core modules to latest (topic-based version channels) - Fix module interface mismatches (meta-wrapped channels) - Add required nf-core boilerplate (workflows, templates, tests) - Fix nextflow.config, schema, multiqc_config across all pipelines - Fix WASP core bugs (empty CSV crash, total_seqs mismatch) - Add ARM/Apple Silicon compatibility for nf-rnaseq Results: 4/4 pipelines pass lint (1 irreducible manifest.name each) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Delete 272MB stray -.bam file at repo root - Remove 9 broken placeholder PNGs (69-113 bytes, not real images) - Remove 20 auto-generated CLAUDE.md files from pipeline subdirs - Add test_benchmarks/, .claude/, pipeline-level logs, Nextflow reports (trace.txt, timeline.html, etc.) to .gitignore - Commit tests/shared_data/expected_counts_regions.tsv (test fixture) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add pre-commit hooks to block agent artifacts (ANALYSIS.md, debug_*.py, tmpclaude*, stray BAMs) and binaries in source directories - Add CLAUDE.md with project instructions and file hygiene rules - Fix .gitignore CLAUDE.md exception syntax (!./ -> !/) - Add research report on AI agent file pollution prevention - SessionEnd cleanup hook added locally (.claude/hooks/session-cleanup.sh) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add .github/workflows/nf-lint.yml: matrix CI for nf-core lint across all 4 pipelines - Enhance wasp2_make_reads and wasp2_filter_remapped nf-tests with real assertions - Update all pipeline READMEs: fix your-org -> mcvickerlab, add test_local profile docs - Add chr21 1000 Genomes validation sections to all READMEs - Add real_wasp_data.json symlink and tests/** copy for nf-test data access Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- download_chr21.sh: streams chr21 from 1000 Genomes NYGC 30x CRAMs - Supports NA12878 (GIAB benchmark) and HG00731 (WASP2 benchmark) - Generates per-pipeline samplesheets and Nextflow configs - Includes DRY_RUN mode, dependency checking, disk space estimates - README with data sources, disk requirements, and usage examples Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Phase 1: Remove stale --min-count footnote from analysis.rst, add tini PID 1 + g++ purge assertion to Dockerfile, fix smoke test sample name case and add wasp2-ipscore check. Phase 2: Revert GTF/GFF3 support from count-variants-sc (sc commands are scATAC-only; gene annotation is a downstream ArchR/Signac step). Bulk count-variants retains full GTF support. Clarify sc = ATAC in docs and CLI help text. Docs rewrite: counting.rst, mapping.rst, analysis.rst, installation.rst simplified and corrected. Added ipscore.rst user guide page. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
count-variants-sc— sc commands are scATAC-only; gene annotation is a downstream ArchR/Signac step. Bulkcount-variantsretains full GTF support.--min-countfootnote from analysis.rst, fixed smoke test sample name caseTest plan
count-variants-sc --helpshows BED/Peak only, no GTF paramscount-variants --helpstill shows GTF/GFF3 options🤖 Generated with Claude Code