Name	Name	Last commit message	Last commit date
parent directory ..
atac-download	atac-download
atac-process	atac-process
dataset-stats	dataset-stats
rna-download	rna-download
rna-process	rna-process
README.md	README.md

Dataset prep code

This folder contains code to download RNA + ATAC datasets and create uniform input files for later benchmarking. (Excludes CELLxGENE download). It doesn't produce any specific figures, but these jobs collect the main data dependencies for other benchmarks.

Initial download steps require ~175GB of storage for RNA, and ~375GB of storage for ATAC. Processed data stored with copies in multiple file formats require a total of ~1.4TB of space for ATAC, and ~900GB of storage for RNA.

datasets/atac-download:
- Download ATAC datasets, create BPCells unfiltered fragment files, and call cell barcodes passing quality filters
- Produces:
  - {DATA_ROOT}/atac/[dataset]/
    - fragments/[sample].fragments.tsv.gz: 10x-format fragment files
    - bpcells/[sample]/: BPCells-format fragment files
    - cell-barcodes.txt, cell-barcodes/[sample].txt: Cell barcodes passing quality filtering, or a directory of per-sample barcodes for multi-sample datasets (which will be consolidated into a single list during atac-process).
    - peaks.bed: Bed-format peak set
    - genome.txt: Genome used for the dataset
    - cell-barcodes-prefix.txt: "true" or "false" if cell barcodes need to be prefixed by sample names
  - {DATA_ROOT}/reference_metadata: Genome reference annotation files
- Task summary: 209/209 completed. Total time: 1 day, 23:03:07
datasets/atac-process:
- Process downloaded ATAC files into uniform inputs
- Requires: atac-download
- Produces:
  - {DATA_ROOT}/atac/[dataset]/
    - bpcells_filtered/[sample]: Filtered fragments with only cells passing QC
    - bpcells(_filtered)/merged: Fragment files containing a merge of all samples with or without QC filtering
    - archr/[sample].arrow, archr/[sample].filtered.arrow: ArchR files with and without filtering
    - snapatac2/[sample].h5ad, snapatac2/filtered.h5ad: SnapATAC2 files with and without filtering
    - matrices/: Peak and tile matrices in feature-major or cell-major (_transpose) order
      - peaks/[sample], peaks_transpose/[sample], tiles/[sample], tiles_transpose/[sample]: BPCells-format compressed peak and tile matrices per-sample
      - peaks/merged, peaks_transpose/merged, tiles/merged, tiles_transpose/merged: Whole dataset (merged) peak and tile matrices
- Task summary: 657/657 completed. Total time: 3 days, 12:31:39
datasets/rna-download:
- Download RNA datasets
- Produces:
  - {DATA_ROOT}/rna/downloads/[dataset]/: Raw file downloads for each RNA benchmark dataset (excluding CELLxGENE)
- Task summary: 9/9 completed. Total time: 3:05:25
datasets/rna-process:
- Process download RNA files into uniform inputs
- Requires: rna-download
- Produces:
  - {DATA_ROOT}/rna/[dataset]/:
    - 10x.h5: 10x-format HDF5 matrix file
    - bpcells: Cell-major BPCells counts matrix
    - bpcells_transpose: Feature-major BPCells counts matrix (note the meaning of _transpose is flipped relative to peak and tile matrices)
    - clusts.txt: Cluster assignments for each cell, one cluster ID per line in same order as the cells in the data matrices. Used for marker feature benchmarks
    - variable_genes.txt: Pre-calculated variable gene sets to keep PCA benchmark calculations exactly matching
    - svd.rds: RDS file containing the dimensionality reduction results used for cluster assignments
- Task summary: 9/9 completed. Total time: 8:14:54
datasets/dataset-stats:
- Collect basic size statistics on each input dataset
- Requires: atac-process, rna-process
- Produces:
  - {RESULTS_ROOT}/raw/datasets/dataset-stats/: Raw outputs
  - {RESULTS_ROOT}/data_tables/datasets/
- Task summary: 15/15 completed. Total time: 0:10:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Dataset prep code

FilesExpand file tree

datasets

Directory actions

More options

Directory actions

More options

Latest commit

History

datasets

Folders and files

parent directory

README.md

Dataset prep code