scBench

Evaluating AI Agents on Single-Cell RNA-seq Analysis

scBench is a benchmark of 195 verifiable problems derived from practical single-cell RNA-seq workflows. Each problem pairs a data snapshot (AnnData .h5ad) with a natural-language task prompt and a deterministic grader that maps the agent's structured output to pass/fail.

Key Findings

model_name	harness	Accuracy (%)	Cost ($)
gpt-5.5	mini-swe-agent	57.95	1.1136
gpt-5.5	openai-codex	57.78	2.4685
gpt-5.4	mini-swe-agent	57.44	0.8240
claude-opus-4-7	mini-swe-agent	55.21	1.5378
claude-opus-4-7	claude-code	54.02	1.1465
gemini-3.1-pro-preview	mini-swe-agent	53.85	0.8948
claude-opus-4-6	mini-swe-agent	52.65	1.1917
gpt-5.2	mini-swe-agent	52.31	0.8874
claude-sonnet-4-6	mini-swe-agent	50.26	0.9872
claude-opus-4-5	mini-swe-agent	47.18	0.6553
grok-4.20-beta-0309-reasoning	mini-swe-agent	44.44	0.2957
grok-4.3	mini-swe-agent	44.27	0.2147
gpt-5.1	mini-swe-agent	38.80	0.2177
claude-sonnet-4-5	mini-swe-agent	33.16	0.2682
grok-4-1-fast-reasoning	mini-swe-agent	30.26	0.0282
gemini-2.5-pro	mini-swe-agent	23.59	0.1368

Full results with per-task and per-platform breakdowns are in results/.

Benchmark Structure

195 evaluations across:

6 platforms: BD Rhapsody, Chromium, CSGenetics, Illumina, MissionBio, ParseBio
6 task categories: QC, Normalization, Dimensionality Reduction, Clustering, Cell Typing, Differential Expression

Tasks require empirical interaction with the data—agents that rely on prior knowledge without performing the requisite analysis fail.

Canonical Examples

Six canonical examples are in evals/. The sample covers all current platforms and task categories. The full 195-evaluation benchmark is withheld to prevent training contamination.

Task	Platform	Eval
QC	BD Rhapsody	`bd_rhapsody_tnbc_panel_aware_qc`
Dimensionality Reduction	Chromium	`dr_05_pca_preprocessing_sentinels`
Normalization	CS Genetics	`NRM01_sparse_normalization`
Cell Typing	Illumina snRNA	`T04a_endothelin_niche_sources`
Clustering	Mission Bio Tapestri	`tapestri_ccus_clustering_12_largest_mutant_clone`
Differential Expression	Parse Bio	`DE01_pseudobulk_de`

Quick Start

pip install -e .

# Validate an evaluation
scbench validate evals/qc/bd_rhapsody_tnbc_panel_aware_qc.json

# Run with mini-swe-agent
export ANTHROPIC_API_KEY=your_key
scbench run evals/qc/bd_rhapsody_tnbc_panel_aware_qc.json --agent minisweagent --model anthropic/claude-opus-4-5

Custom Agent

from scbench import EvalRunner

def my_agent(task_prompt, work_dir):
    import json
    answer = {
        "n_pbmcs_retained": 14346,
        "median_genes_per_pbmc": 68,
        "n_monocytes_pbmc": 2592,
    }
    (work_dir / "eval_answer.json").write_text(json.dumps(answer))
    return answer

runner = EvalRunner("evals/qc/bd_rhapsody_tnbc_panel_aware_qc.json")
result = runner.run(agent_function=my_agent)
print(f"Passed: {result['passed']}")

Graders

Five grader families handle different answer types:

Grader	Use Case
NumericTolerance	QC metrics, counts, expression values
MultipleChoice	Discrete interpretation questions
MarkerGenePrecisionRecall	Gene lists (P@K, R@K)
LabelSetJaccard	Cell type sets
DistributionComparison	Cell type proportions

See latch-eval-tools for implementations.

Citation

@article{scbench2026,
  title={scBench: Evaluating AI Agents on Single-Cell RNA-seq Analysis},
  author={Workman, Kenny and Yang, Zhen and Muralidharan, Harihara and Abdulali, Aidan and Le, Hannah},
  year={2026},
  note={LatchBio}
}

License

Apache 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
docs		docs
evals		evals
results		results
scbench		scbench
trajectories		trajectories
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

scBench

Key Findings

Benchmark Structure

Canonical Examples

Quick Start

Custom Agent

Graders

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

scBench

Key Findings

Benchmark Structure

Canonical Examples

Quick Start

Custom Agent

Graders

Citation

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages