A production-grade, config-driven Snakemake framework for end-to-end chromatin accessibility analysis. Built for resilience, it supports both bulk and single-cell modalities, automatically scales from 4GB laptops to HPC clusters, and implements strict Quality Control gating to halt poor samples before downstream processing.
Unlike single-modality pipelines, this framework handles both bulk and single-cell ATAC-seq from a single config file, with built-in QC gating, agentic LLM integration, and execution profiles spanning laptops to Kubernetes clusters.
The pipeline handles all tool dependencies internally via Conda and Singularity.
conda create -n atacseq snakemake>=8.0 -c conda-forge -c bioconda
conda activate atacseq
# Standard Bulk run (8 cores)
snakemake --use-conda --cores 8
# Single-cell ATAC run
ATAC_MODE=scatac snakemake --use-conda --cores 8For platforms where Conda/Singularity installation is difficult (e.g., macOS or Windows), you can run the pipeline directly inside a Docker container.
1. Build the Host Runner Image:
docker build -t bdb-atacseq .Note
The Dockerfile creates a host environment (using micromamba) containing Snakemake and Python. Individual rule dependencies (like Bowtie2, MACS2) will still be downloaded dynamically by Snakemake at runtime.
2. Execute the Pipeline via Docker: Run the pipeline by mounting your workspace directory into the container:
docker run -it --rm \
-v $(pwd):/app \
-v /var/run/docker.sock:/var/run/docker.sock \
bdb-atacseq --use-conda --cores 8Tip
If running in a Docker-in-Docker environment, mounting the docker socket allows Snakemake to spin up tool containers from inside the host runner.
The pipeline is completely configuration-driven. You never need to modify the underlying Snakemake rules.
Every file path, genome size, and thread count is defined here. For example, MACS2 dynamically computes the genome size at parse time:
# The pipeline automatically parses the genome size from your reference files:
gsize=sum(int(line.strip().split()[1]) for line in open(config['global']['references']['genome_sizes']))You can override specific parameters on the fly without touching the main config.yaml file. The pipeline safely merges your overrides into memory.
# custom_override.yaml
qc_gate:
params:
min_frip: 0.0 # Turn off FRiP requirement for a quick test# Snakemake natively merges files from left to right
snakemake --configfile config.yaml custom_override.yaml --cores 8The framework supports 8 pre-configured profiles under the profile/ directory:
| Profile | Purpose | Environment |
|---|---|---|
local |
Up to 8 concurrent local jobs | Workstation |
slurm |
Cluster workload manager submission | HPC |
low_resource |
Caps memory to 4GB per rule | Laptop |
test |
Relaxed QC for synthetic CI datasets | CI/CD |
aws |
AWS Batch + S3 + Tibanna executor | Cloud |
gcp |
Google Life Sciences + GCS | Cloud |
azure |
Azure Batch + Blob storage | Cloud |
kubernetes |
Container-native K8s scaling | Cloud |
For massive sample datasets on limited-resource systems:
python3 rules/scripts/run_batched.py --batch-size 2 --cores 8 --profile profile/localSwitching the ATAC_MODE environment variable dictates which modality analysis track is executed:
| Stage | Bulk Mode (bulk) |
Single-Cell Mode (scatac) |
|---|---|---|
| Alignment | Bowtie2 (--very-sensitive) |
Chromap (--preset atac) |
| Filtering | MAPQ > 30, Fixmate, ENCODE Blacklist removal, Tn5 Shift | ArchR Arrow creation & doublet removal |
| Peak Calling | MACS2, IDR replicate concordance | ArchR marker peak identification |
| Co-accessibility | N/A | Cicero (500bp window, 250kb distance) |
| Differential | DESeq2 (FDR 0.05, log2FC 1.0) | ArchR cluster markers |
| Footprinting | HINT-ATAC & TOBIAS BINDetect | chromVAR motif accessibility |
The pipeline implements a hard gate (rules/scripts/parse_qc_metrics.py) before peak calling. Samples must pass four strict thresholds (configurable in config.yaml):
-
FRiP Score:
$\ge 0.2$ -
TSS Enrichment:
$\ge 7.0$ -
Mapping Rate:
$\ge 80.0%$ -
Duplicate Rate:
$\le 20.0%$
Samples that fail are documented in {sample}_qc_pass.txt and automatically bypassed for downstream footprinting/differential analysis to save compute time.
If a sample passes the gate but yields zero peaks after blacklist filtering, downstream rules (ChIPseeker, heatmaps, TOBIAS) will automatically detect the empty file. Rather than crashing the DAG, they utilize single-line Python routines to instantly generate valid dummy outputs (e.g., empty DataFrames, blank PDFs) and proceed.
On every run (success or failure), the pipeline generates a machine-readable execution summary at:
results/reporting/pipeline_execution_summary.json
This JSON includes per-rule CPU time, peak memory, and — on failure — the last 5 error lines extracted from logs/.
Synthetic data (for CI — no downloads needed):
python3 rules/scripts/generate_test_data.pyGenerates FASTQ, FASTA, GTF, and Bowtie2 indices with TSS-targeted reads sufficient to pass all QC gates.
Real data (subsampled ENCODE hg38, ~200 MB):
bash rules/scripts/download_real_data.shDownloads ENCSR356KRQ chr19+chrM, GENCODE v44 annotations, JASPAR motifs, and builds all indices.
All outputs are written to the results/ directory, cleanly organized by stage:
results/alignment/: Post-filtered, sorted, and Tn5-shifted BAMs.results/metrics_qc/: MultiQC HTML report aggregating FastQC, Preseq, Picard, and the QC Gate JSONs.results/peak_calling/: MACS2 narrowPeaks, consensus BEDs, and IDR sets.results/differential/: DESeq2 tables, Volcano/MA/PCA plots.results/footprinting/: TOBIAS bias-corrected BigWigs and BINDetect motif plots.benchmarks/: Memory and CPU time consumption for every single job executed.
GEOAgent bridge — convert GEO metadata directly into a pipeline run:
python3 rules/scripts/geo_agent_bridge.py path/to/ATAC_meta.csv --download--download fetches SRA FASTQs automatically. Outputs a ready-to-use config_geo.yaml and sample sheet.
LangChain tool wrapper — register the pipeline as an autonomous agent tool via rules/scripts/atacseq_tool.py. See the script docstring for the full API.
BDB-Genomics/atacseq-pipeline/
├── .github/workflows/ # CI/CD configuration (lint.yml)
├── assets/ # Banners, logos, visual assets
├── profile/
│ ├── local/ # Local execution profile
│ ├── slurm/ # SLURM cluster profile
│ ├── low_resource/ # ≤4GB RAM laptop profile
│ ├── test/ # CI test profile (relaxed QC)
│ ├── aws/ # AWS Batch + Tibanna profile
│ ├── gcp/ # Google Life Sciences profile
│ ├── azure/ # Azure Batch profile
│ └── kubernetes/ # Kubernetes cluster profile
├── rules/
│ ├── scripts/
│ │ ├── aggregate_logs.py # Structured JSON telemetry
│ │ ├── atacseq_tool.py # LangChain agent wrapper
│ │ ├── geo_agent_bridge.py # GEOAgent metadata importer
│ │ ├── validate_config.py # Pre-flight validation
│ │ ├── test_validate_config.py # Pytest suite (100+ assertions)
│ │ ├── generate_test_data.py # Synthetic CI data generator
│ │ ├── download_real_data.sh # ENCODE real-data sandbox
│ │ ├── run_batched.py # Low-memory batch executor
│ │ ├── run_tobias_atacorrect.py # TOBIAS ATACorrect wrapper
│ │ ├── run_tobias_score.py # TOBIAS ScoreBigwig wrapper
│ │ ├── parse_qc_metrics.py # QC gate metric parser
│ │ ├── zenodo_deposit.py # Zenodo publishing helper
│ │ └── [*.R] # R analysis scripts (ArchR, DESeq2, etc.)
│ ├── envs/ # Conda environment definitions (per-tool)
│ ├── config/ # Configuration schemas and templates
│ ├── [40+ .smk files] # Snakemake rules (alignment, QC, peaks, etc.)
│ └── template_tool.smk # Boilerplate for adding new tools
├── config.yaml # Main configuration file (single source of truth)
├── Snakefile # Main workflow entry point
├── Dockerfile # Container with micromamba + ENTRYPOINT
├── CITATION.cff # Citation metadata
├── CHANGELOG.md # Version history
├── CONTRIBUTING.md # Contribution guidelines
├── LICENSE # MIT License
├── test_envs.sh # Environment validation script
└── README.md # This file
Citation: Bhandary, H. (2026). BDB-Genomics ATAC-seq Framework (Version 3.0.0). https://github.com/BDB-Genomics/atacseq-pipeline
