Ultra-high throughput processing for 10x Genomics Flex single-cell sequencing.
cyto is a fast, memory-efficient processor for 10x Genomics Flex single-cell RNA sequencing data, designed specifically for production-scale analysis. It handles:
- Gene expression profiling from FFPE samples and fresh tissue
- Highly multiplexed experiments (16-plex Flex-V1 / 384-plex Flex-V2)
- CRISPR perturbation screens (Perturb-seq) with efficient guide assignment
- Probe-based multiplexing for clinical and archived samples
cyto achieves dramatic performance improvements through algorithmic innovations optimized for Flex's fixed sequence geometry, making previously prohibitive experiments computationally feasible.
- Ultra-fast processing: Processes 320k-cell datasets in minutes rather than hours
- Memory efficient: Runs on smaller cloud instances with reduced resource requirements
- Highly accurate: 99.85% concordance with standard CellRanger outputs, identical cell clustering
- Modular architecture: Independent, composable tools for flexible workflows
- Production-ready: Built for atlas-scale projects and genome-wide screens
- BINSEQ support: Efficient binary format for highly parallel sequence parsing
- Compact IBU format: Binary Index-Barcode-UMI storage for efficient read processing
Note: This crate makes use of SIMD instructions for improved performance. To make sure you take advantage of SIMD instructions on your machine set the following environment variable before compiling:
Install via cargo:
export RUSTFLAGS="-C target-cpu=native";
cargo install cytoOr from source:
git clone https://github.com/arcinstitute/cyto
cd cyto
# install with cargo
export RUSTFLAGS="-C target-cpu=native"
cargo install --path crates/cyto
# or with just
just installProcess Flex gene expression data with probe demultiplexing:
cyto workflow gex \
-c gene_probes.tsv \
-w cell_barcode_whitelist.txt \
-p probe_barcodes.txt \
-o output_dir \
sample.cbqOr without probe demultiplexing (single-plex or custom geometry):
cyto workflow gex \
-c gene_probes.tsv \
-w cell_barcode_whitelist.txt \
--geometry "[barcode][umi:12] | [gex]" \
-o output_dir \
sample.cbqProcess Perturb-seq data with guide assignment:
cyto workflow crispr \
-c guide_library.tsv \
-w cell_barcode_whitelist.txt \
-p probe_barcodes.txt \
-o output_dir \
sample.cbqOr without probe demultiplexing:
cyto workflow crispr \
-c guide_library.tsv \
-w cell_barcode_whitelist.txt \
--geometry "[barcode][umi:12] | [:26][anchor][protospacer]" \
-o output_dir \
sample.cbqBoth workflows automatically handle:
- Read mapping to features (with integrated barcode correction)
- UMI deduplication
- Molecule counting
- Guide assignment (CRISPR mode)
Workflows generate organized outputs. With probe demultiplexing, one IBU and count file is generated per probe:
output_dir/
├── metadata/
│ └── features.tsv # Feature index
├── stats/
│ └── mapping.json # Mapping statistics
├── ibu/
│ ├── probe1.sort.ibu # Processed IBU files
│ └── probe2.sort.ibu # (one per probe)
└── counts/
├── probe1.counts.tsv # Count matrices
└── probe2.counts.tsv # (one per probe)
Without probes, a single IBU and count file is generated:
output_dir/
├── metadata/
│ └── features.tsv # Feature index
├── stats/
│ └── mapping.json # Mapping statistics
├── ibu/
│ └── output.sort.ibu # Single IBU file
└── counts/
└── output.counts.tsv # Single count matrix
Gene Expression (-c flag) - 3-column TSV, no header:
ENSG00000000003 TSPAN6 ACGTACGTACGTACGT
ENSG00000000005 TNMD TGCATGCATGCATGCA
Columns: Gene ID | Gene Name | Probe Sequence
CRISPR Guides (-c flag) - 3-column TSV, no header:
gene1_guide1 GGGGCCCC ACGTACGTACGTACGTACGT
gene1_guide2 GGGGCCCC TGCATGCATGCATGCATGCA
Columns: Guide Name | Anchor Sequence | Protospacer Sequence
For multiplexed experiments (-p flag) - 3-column TSV, no header:
ACGTACGT BC001 ProbeSet1
TGCATGCA BC002 ProbeSet2
Columns: True Sequence | Alias | Probe Name
Note: Probe sequences should match those provided by 10x Genomics for your specific chemistry.
Standard 10x barcode whitelist (-w flag):
# Example: 737K barcode list for GEM-X
-w 737K-fixed-rna-profiling.txt.gzcyto accepts both FASTQ and BINSEQ formats:
# BINSEQ (recommended - faster parsing)
cyto workflow gex -c probes.tsv -w whitelist.txt sample.cbq
# FASTQ paired-end
cyto workflow gex -c probes.tsv -w whitelist.txt sample_R1.fastq.gz sample_R2.fastq.gzIf you have a large collection of sequence files that can be processed as a single input you can provide them all on the CLI:
# BINSEQ
cyto workflow gex -c probes.tsv -w whitelist.txt *.cbq
# FASTQ paired-end
cyto workflow gex -c probes.tsv -w whitelist.txt *.fastq.gzcyto uses a flexible and simple Domain-Specific Language (DSL) for specifying read geometries.
This allows you to describe custom experimental designs that differ from the standard 10x sequence structures.
For standard 10x chemistries, use the --preset flag:
# Standard GEX (Flex V1)
cyto workflow gex --preset gex-v1 ...
# GEX with probe on R1 (Flex V2)
cyto workflow gex --preset gex-v2 ...
# CRISPR (standard)
cyto workflow crispr --preset crispr-v1 ...Available presets:
| Preset | Geometry |
|---|---|
gex-v1 |
[barcode][umi:12] | [gex][:18][probe] |
gex-v2 |
[barcode][umi:12][:10][probe] | [gex] |
crispr-v1 |
[barcode][umi:12] | [probe][anchor][protospacer] |
crispr-v2 |
[barcode][umi:12][:10][probe] | [:14][anchor][protospacer] |
Note: White space is allowed between components and separators.
For non-standard designs, use the --geometry flag with a DSL string:
cyto workflow gex --geometry "[barcode][umi:12]|[gex][:18][probe]" ...A geometry string describes the structure of paired-end reads (R1 and R2), separated by |:
[R1 regions]|[R2 regions]
Components are functional elements enclosed in brackets:
| Component | Alias | Description | Length |
|---|---|---|---|
[barcode] |
[bc] |
Cell barcode | Inferred from whitelist |
[umi:N] |
— | Unique Molecular Identifier | Required (e.g., [umi:12]) |
[probe] |
— | Flex probe barcode | Inferred from probe file |
[gex] |
— | Gene expression sequence | Inferred from library |
[anchor] |
— | CRISPR anchor sequence | Inferred from guide library |
[protospacer] |
— | CRISPR protospacer | Inferred from guide library |
Skip regions are anonymous spacers with explicit lengths:
[:N] # Skip N bases (e.g., [:18] skips 18 bases)
Standard GEX with Flex probe:
[barcode][umi:12]|[gex][:18][probe]
- R1: 16bp barcode, 12bp UMI
- R2: Gene expression sequence, 18bp spacer, 8bp probe
GEX V2 (probe on R1):
[barcode][umi:12][:10][probe]|[gex]
- R1: 16bp barcode, 12bp UMI, 10bp spacer, 8bp probe
- R2: Gene expression sequence
CRISPR with custom spacing:
[barcode][umi:12]|[:20][probe][:6][anchor][protospacer]
- R1: 16bp barcode, 12bp UMI
- R2: 20bp offset, probe, 6bp spacer, anchor, protospacer
- Each component can only appear once across both reads
[umi:N]requires an explicit length- Other components infer their lengths from reference files
- Skip regions must have length > 0
- Whitespace between brackets is ignored
The --remap-window flag controls position tolerance for element matching:
# Allow +/- 2 position adjustment on failed matches
cyto workflow gex --geometry "..." --remap-window 2 ...
# Exact positions only
cyto workflow gex --geometry "..." --remap-window 0 ...Note: Default is 1. For V2 presets, this is automatically set to 5.
Pro-Tip: If you're unsure about spacer lengths for your library, use
bqtools grepto visualize your sequences:bqtools grep <input.cbq> <anchor_sequence> <probe_sequence>This highlights the sequences in your reads so you can count the bases between them.
For advanced users, cyto exposes individual processing steps:
# 1. Map reads to features (includes barcode correction)
# With probe demultiplexing:
cyto map gex --preset gex-v1 -c probes.tsv -p probe_barcodes.txt -w whitelist.txt -o map_out sample.cbq
# Or without probes:
cyto map gex --geometry "[barcode][umi:12] | [gex]" -c probes.tsv -w whitelist.txt -o map_out sample.cbq
# 2. Sort IBU files
cyto ibu sort -i map_out/ibu/probe1.ibu -o probe1.sorted.ibu
# 3. Correct UMIs
cyto ibu umi -i probe1.sorted.ibu -o probe1.umi.ibu
# 4. Count molecules
cyto ibu count -i probe1.umi.ibu -f map_out/metadata/features.tsv -o counts.tsvThis modular design allows:
- Custom processing pipelines
- Integration with orchestration tools (Snakemake, Nextflow)
- Independent scaling of pipeline components
- Checkpointing and resumption
Control parallelization with -T:
# Use all available cores
cyto workflow gex --preset gex-v1 -c probes.tsv -w whitelist.txt -T0 sample.cbq
# Use specific number of threads
cyto workflow gex --preset gex-v1 -c probes.tsv -w whitelist.txt -T32 sample.cbq
# Single-threaded (minimal resources)
cyto workflow gex --preset gex-v1 -c probes.tsv -w whitelist.txt -T1 sample.cbqDefault: All available threads
Tab-separated sparse matrix:
barcode feature count
ACGTACGT ENSG00000000003 5
ACGTACGT ENSG00000000005 12
For downstream analysis with scanpy/Seurat:
cyto ibu count -i sample.ibu -f features.tsv -o counts_mtx --format mtxGenerates:
matrix.mtx- Sparse count matrixbarcodes.tsv- Cell barcodesfeatures.tsv- Feature names
Use pycyto utilities for format conversion and aggregation:
# Convert MTX to h5ad
pycyto mtx-to-h5ad counts_mtx/ output.h5ad
# Aggregate cyto output into a single h5ad per sample
pycyto aggregate <config>.json <cyto_output_dir> <aggr_dir>The CRISPR workflow includes automatic guide assignment using the geomux algorithm, which:
- Scales linearly with data sparsity (not total dimensions)
- Handles multi-guide perturbations
- Works on unfiltered cells (no pre-filtering needed)
- Performs hypergeometric testing with FDR correction
Guide assignments are included in the count matrix output.
cyto is optimized for:
- Fixed-geometry protocols: Flex libraries with predetermined sequence structures
- Multiplexed datasets: Efficient probe demultiplexing at scale
- Large-scale screens: Million-cell perturbation experiments
cyto is not designed for:
- Splice-aware alignment (use STAR, kallisto|bustools, Alevin-fry)
- Transcript discovery or quantification
- Variable read architectures
- Full-length transcript sequencing
All components are available under the MIT license:
- cyto: https://github.com/arcinstitute/cyto
- pycyto utilities: https://github.com/arcinstitute/pycyto
- geomux: https://github.com/noamteyssier/geomux
- cell-filter: https://github.com/arcinstitute/cell-filter
- IBU format: https://github.com/noamteyssier/ibu
Rust packages on crates.io | Python packages on PyPI
If you use cyto in your research, please cite our BioRxiv preprint:
Teyssier, N. and Dobin, A. (2025). cyto: ultra high-throughput processing
of 10x-flex single cell sequencing. bioRxiv.
- Issues: https://github.com/arcinstitute/cyto/issues
- Documentation: See
--helpfor any command - Examples: See
justfilefor complete workflows
Developed at Arc Institute with support for computational resources.