cyto

Ultra-high throughput processing for 10x Genomics Flex single-cell sequencing.

Overview

cyto is a fast, memory-efficient processor for 10x Genomics Flex single-cell RNA sequencing data, designed specifically for production-scale analysis. It handles:

Gene expression profiling from FFPE samples and fresh tissue
Highly multiplexed experiments (16-plex Flex-V1 / 384-plex Flex-V2)
CRISPR perturbation screens (Perturb-seq) with efficient guide assignment
Probe-based multiplexing for clinical and archived samples

cyto achieves dramatic performance improvements through algorithmic innovations optimized for Flex's fixed sequence geometry, making previously prohibitive experiments computationally feasible.

Key Features

Ultra-fast processing: Processes 320k-cell datasets in minutes rather than hours
Memory efficient: Runs on smaller cloud instances with reduced resource requirements
Highly accurate: 99.85% concordance with standard CellRanger outputs, identical cell clustering
Modular architecture: Independent, composable tools for flexible workflows
Production-ready: Built for atlas-scale projects and genome-wide screens
BINSEQ support: Efficient binary format for highly parallel sequence parsing
Compact IBU format: Binary Index-Barcode-UMI storage for efficient read processing

Installation

Note: This crate makes use of SIMD instructions for improved performance. To make sure you take advantage of SIMD instructions on your machine set the following environment variable before compiling:

Install via cargo:

export RUSTFLAGS="-C target-cpu=native";
cargo install cyto

Or from source:

git clone https://github.com/arcinstitute/cyto
cd cyto

# install with cargo
export RUSTFLAGS="-C target-cpu=native"
cargo install --path crates/cyto

# or with just
just install

Quick Start

Gene Expression Workflow

Process Flex gene expression data with probe demultiplexing:

cyto workflow gex \
    -c gene_probes.tsv \
    -w cell_barcode_whitelist.txt \
    -p probe_barcodes.txt \
    -o output_dir \
    sample.cbq

Or without probe demultiplexing (single-plex or custom geometry):

cyto workflow gex \
    -c gene_probes.tsv \
    -w cell_barcode_whitelist.txt \
    --geometry "[barcode][umi:12] | [gex]" \
    -o output_dir \
    sample.cbq

CRISPR Screen Workflow

Process Perturb-seq data with guide assignment:

cyto workflow crispr \
    -c guide_library.tsv \
    -w cell_barcode_whitelist.txt \
    -p probe_barcodes.txt \
    -o output_dir \
    sample.cbq

Or without probe demultiplexing:

cyto workflow crispr \
    -c guide_library.tsv \
    -w cell_barcode_whitelist.txt \
    --geometry "[barcode][umi:12] | [:26][anchor][protospacer]" \
    -o output_dir \
    sample.cbq

Both workflows automatically handle:

Read mapping to features (with integrated barcode correction)
UMI deduplication
Molecule counting
Guide assignment (CRISPR mode)

Output Structure

Workflows generate organized outputs. With probe demultiplexing, one IBU and count file is generated per probe:

output_dir/
├── metadata/
│   └── features.tsv         # Feature index
├── stats/
│   └── mapping.json         # Mapping statistics
├── ibu/
│   ├── probe1.sort.ibu      # Processed IBU files
│   └── probe2.sort.ibu      # (one per probe)
└── counts/
    ├── probe1.counts.tsv    # Count matrices
    └── probe2.counts.tsv    # (one per probe)

Without probes, a single IBU and count file is generated:

output_dir/
├── metadata/
│   └── features.tsv         # Feature index
├── stats/
│   └── mapping.json         # Mapping statistics
├── ibu/
│   └── output.sort.ibu      # Single IBU file
└── counts/
    └── output.counts.tsv    # Single count matrix

Input Formats

Feature Libraries

Gene Expression (-c flag) - 3-column TSV, no header:

ENSG00000000003    TSPAN6       ACGTACGTACGTACGT
ENSG00000000005    TNMD         TGCATGCATGCATGCA

Columns: Gene ID | Gene Name | Probe Sequence

CRISPR Guides (-c flag) - 3-column TSV, no header:

gene1_guide1    GGGGCCCC    ACGTACGTACGTACGTACGT
gene1_guide2    GGGGCCCC    TGCATGCATGCATGCATGCA

Columns: Guide Name | Anchor Sequence | Protospacer Sequence

Probe Barcodes (Optional)

For multiplexed experiments (-p flag) - 3-column TSV, no header:

ACGTACGT    BC001    ProbeSet1
TGCATGCA    BC002    ProbeSet2

Columns: True Sequence | Alias | Probe Name

Note: Probe sequences should match those provided by 10x Genomics for your specific chemistry.

Cell Barcode Whitelist

Standard 10x barcode whitelist (-w flag):

# Example: 737K barcode list for GEM-X
-w 737K-fixed-rna-profiling.txt.gz

Sequence Files

cyto accepts both FASTQ and BINSEQ formats:

# BINSEQ (recommended - faster parsing)
cyto workflow gex -c probes.tsv -w whitelist.txt sample.cbq

# FASTQ paired-end
cyto workflow gex -c probes.tsv -w whitelist.txt sample_R1.fastq.gz sample_R2.fastq.gz

If you have a large collection of sequence files that can be processed as a single input you can provide them all on the CLI:

# BINSEQ
cyto workflow gex -c probes.tsv -w whitelist.txt *.cbq

# FASTQ paired-end
cyto workflow gex -c probes.tsv -w whitelist.txt *.fastq.gz

Advanced Usage

Geometry DSL

cyto uses a flexible and simple Domain-Specific Language (DSL) for specifying read geometries. This allows you to describe custom experimental designs that differ from the standard 10x sequence structures.

Quick Start: Using Presets

For standard 10x chemistries, use the --preset flag:

# Standard GEX (Flex V1)
cyto workflow gex --preset gex-v1 ...

# GEX with probe on R1 (Flex V2)
cyto workflow gex --preset gex-v2 ...

# CRISPR (standard)
cyto workflow crispr --preset crispr-v1 ...

Available presets:

Preset	Geometry
`gex-v1`	`[barcode][umi:12] \| [gex][:18][probe]`
`gex-v2`	`[barcode][umi:12][:10][probe] \| [gex]`
`crispr-v1`	`[barcode][umi:12] \| [probe][anchor][protospacer]`
`crispr-v2`	`[barcode][umi:12][:10][probe] \| [:14][anchor][protospacer]`

Note: White space is allowed between components and separators.

Custom Geometries

For non-standard designs, use the --geometry flag with a DSL string:

cyto workflow gex --geometry "[barcode][umi:12]|[gex][:18][probe]" ...

DSL Syntax

A geometry string describes the structure of paired-end reads (R1 and R2), separated by |:

[R1 regions]|[R2 regions]

Components are functional elements enclosed in brackets:

Component	Alias	Description	Length
`[barcode]`	`[bc]`	Cell barcode	Inferred from whitelist
`[umi:N]`	—	Unique Molecular Identifier	Required (e.g., `[umi:12]`)
`[probe]`	—	Flex probe barcode	Inferred from probe file
`[gex]`	—	Gene expression sequence	Inferred from library
`[anchor]`	—	CRISPR anchor sequence	Inferred from guide library
`[protospacer]`	—	CRISPR protospacer	Inferred from guide library

Skip regions are anonymous spacers with explicit lengths:

[:N]    # Skip N bases (e.g., [:18] skips 18 bases)

Examples

Standard GEX with Flex probe:

[barcode][umi:12]|[gex][:18][probe]

R1: 16bp barcode, 12bp UMI
R2: Gene expression sequence, 18bp spacer, 8bp probe

GEX V2 (probe on R1):

[barcode][umi:12][:10][probe]|[gex]

R1: 16bp barcode, 12bp UMI, 10bp spacer, 8bp probe
R2: Gene expression sequence

CRISPR with custom spacing:

[barcode][umi:12]|[:20][probe][:6][anchor][protospacer]

R1: 16bp barcode, 12bp UMI
R2: 20bp offset, probe, 6bp spacer, anchor, protospacer

Validation Rules

Each component can only appear once across both reads
[umi:N] requires an explicit length
Other components infer their lengths from reference files
Skip regions must have length > 0
Whitespace between brackets is ignored

Remap Window

The --remap-window flag controls position tolerance for element matching:

# Allow +/- 2 position adjustment on failed matches
cyto workflow gex --geometry "..." --remap-window 2 ...

# Exact positions only
cyto workflow gex --geometry "..." --remap-window 0 ...

Note: Default is 1. For V2 presets, this is automatically set to 5.

Pro-Tip: If you're unsure about spacer lengths for your library, use bqtools grep to visualize your sequences:
bqtools grep <input.cbq> <anchor_sequence> <probe_sequence>
This highlights the sequences in your reads so you can count the bases between them.

Modular Pipeline

For advanced users, cyto exposes individual processing steps:

# 1. Map reads to features (includes barcode correction)
# With probe demultiplexing:
cyto map gex --preset gex-v1 -c probes.tsv -p probe_barcodes.txt -w whitelist.txt -o map_out sample.cbq
# Or without probes:
cyto map gex --geometry "[barcode][umi:12] | [gex]" -c probes.tsv -w whitelist.txt -o map_out sample.cbq

# 2. Sort IBU files
cyto ibu sort -i map_out/ibu/probe1.ibu -o probe1.sorted.ibu

# 3. Correct UMIs
cyto ibu umi -i probe1.sorted.ibu -o probe1.umi.ibu

# 4. Count molecules
cyto ibu count -i probe1.umi.ibu -f map_out/metadata/features.tsv -o counts.tsv

This modular design allows:

Custom processing pipelines
Integration with orchestration tools (Snakemake, Nextflow)
Independent scaling of pipeline components
Checkpointing and resumption

Multi-threading

Control parallelization with -T:

# Use all available cores
cyto workflow gex --preset gex-v1 -c probes.tsv -w whitelist.txt -T0 sample.cbq

# Use specific number of threads
cyto workflow gex --preset gex-v1 -c probes.tsv -w whitelist.txt -T32 sample.cbq

# Single-threaded (minimal resources)
cyto workflow gex --preset gex-v1 -c probes.tsv -w whitelist.txt -T1 sample.cbq

Default: All available threads

Output Formats

TSV Format (default)

Tab-separated sparse matrix:

barcode    feature    count
ACGTACGT   ENSG00000000003   5
ACGTACGT   ENSG00000000005   12

Matrix Market Format

For downstream analysis with scanpy/Seurat:

cyto ibu count -i sample.ibu -f features.tsv -o counts_mtx --format mtx

Generates:

matrix.mtx - Sparse count matrix
barcodes.tsv - Cell barcodes
features.tsv - Feature names

Convert to h5ad

Use pycyto utilities for format conversion and aggregation:

# Convert MTX to h5ad
pycyto mtx-to-h5ad counts_mtx/ output.h5ad

# Aggregate cyto output into a single h5ad per sample
pycyto aggregate <config>.json <cyto_output_dir> <aggr_dir>

Guide Assignment (Perturb-seq)

The CRISPR workflow includes automatic guide assignment using the geomux algorithm, which:

Scales linearly with data sparsity (not total dimensions)
Handles multi-guide perturbations
Works on unfiltered cells (no pre-filtering needed)
Performs hypergeometric testing with FDR correction

Guide assignments are included in the count matrix output.

Performance Considerations

cyto is optimized for:

Fixed-geometry protocols: Flex libraries with predetermined sequence structures
Multiplexed datasets: Efficient probe demultiplexing at scale
Large-scale screens: Million-cell perturbation experiments

cyto is not designed for:

Splice-aware alignment (use STAR, kallisto|bustools, Alevin-fry)
Transcript discovery or quantification
Variable read architectures
Full-length transcript sequencing

Software Availability

All components are available under the MIT license:

cyto: https://github.com/arcinstitute/cyto
pycyto utilities: https://github.com/arcinstitute/pycyto
geomux: https://github.com/noamteyssier/geomux
cell-filter: https://github.com/arcinstitute/cell-filter
IBU format: https://github.com/noamteyssier/ibu

Rust packages on crates.io | Python packages on PyPI

Citation

If you use cyto in your research, please cite our BioRxiv preprint:

Teyssier, N. and Dobin, A. (2025). cyto: ultra high-throughput processing
of 10x-flex single cell sequencing. bioRxiv.

Support

Issues: https://github.com/arcinstitute/cyto/issues
Documentation: See --help for any command
Examples: See justfile for complete workflows

Acknowledgements

Developed at Arc Institute with support for computational resources.

Name		Name	Last commit message	Last commit date
Latest commit History 753 Commits
.github/workflows		.github/workflows
crates		crates
data		data
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
Cargo.toml		Cargo.toml
LICENSE.md		LICENSE.md
README.md		README.md
justfile		justfile

Folders and files

Latest commit

History

Repository files navigation

cyto

Overview

Key Features

Installation

Quick Start

Gene Expression Workflow

CRISPR Screen Workflow

Output Structure

Input Formats

Feature Libraries

Probe Barcodes (Optional)

Cell Barcode Whitelist

Sequence Files

Advanced Usage

Geometry DSL

Quick Start: Using Presets

Custom Geometries

DSL Syntax

Examples

Validation Rules

Remap Window

Modular Pipeline

Multi-threading

Output Formats

TSV Format (default)

Matrix Market Format

Convert to h5ad

Guide Assignment (Perturb-seq)

Performance Considerations

Software Availability

Citation

Support

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages