Benchmark + in-place pipeline: SingleRust vs scanpy (PCA ~6×, ~3× overall, + scaling curve) by iandriver · Pull Request #1 · iandriver/SingleRust

iandriver · 2026-06-15T14:11:42Z

Summary

A focused demonstration of SingleRust's performance and its in-place processing
ergonomics:

examples/inplace_pipeline.rs — a minimal, annotated pipeline that loads an AnnData once and
mutates it through every step (no copies).
demo/scverse_benchmark.ipynb + examples/bench_step.rs — a per-step runtime benchmark vs
the equivalent scanpy function on a ~50k-cell CZI CELLxGENE dataset.
demo/scverse_scaling.ipynb — a cells-vs-runtime scaling sweep (3k → 50k).

Stacked PR: base is integration/features-only (write_h5ad, run_pca_inplace, dense
normalize/log1p, ORA, QC fix), so the diff here is just the benchmark, scaling, the in-place
example, and supporting scaffolding.

Why it's fast: in-place operations

Interior mutability means you pass a shared &adata to each step and results accumulate on the
same object — no copy between steps, the expression matrix is never reallocated.

let adata = io::read_h5ad_memory("input.h5ad")?;          // load ONCE — note: not `mut`
qc_metrics(&adata)?;                                       // -> obs / var
normalize_expression(&adata.x(), 10_000, &Direction::ROW, None)?;  // -> X in place
log1p_expression(&adata.x(), None)?;                       // -> X in place
compute_highly_variable_genes(&adata, Some(HVGParams { n_top_genes: Some(2000), ..Default::default() }))?;
run_pca_inplace::<f64>(&adata, Some(hvg_selection), Some(true), Some(false),
                       Some(50), None, Some(42), Some(svd), None)?;   // -> obsm["X_pca"], uns

Benchmark (per step, 50k × 35.5k, 18 cores)

Compute time only (Rust .h5ad read/write excluded); a warm-up pass runs first so scanpy isn't
charged for one-time numba JIT / thread-pool startup.

step	scanpy	SingleRust	speedup
qc	0.85s	1.13s	0.75×
normalize	0.10s	0.01s	8.2×
log1p	0.20s	0.15s	1.4×
hvg	0.31s	0.12s	2.5×
pca	6.44s	1.06s	6.1×
ora*	1.47s	0.46s	3.2×
total	9.37s	2.93s	3.2×

SingleRust wins decisively on the heavy/vectorizable steps (PCA ~6×, normalize ~8×). QC is the
honest exception — scanpy's optimized C path slightly beats SingleRust's qc_metrics (the top-N
segment proportions are the cost). * ORA has no scanpy equivalent; compared against a
NumPy/SciPy implementation of the same algorithm.

Scaling (runtime vs cells)

cells	scanpy	SingleRust	speedup
3,000	1.80s	0.21s	8.5×
6,000	3.38s	0.37s	9.1×
12,500	5.59s	0.69s	8.1×
25,000	6.12s	1.35s	4.5×
50,000	7.77s	2.62s	3.0×

Faster at every size (3–9×). SingleRust scales ~linearly with non-zeros; scanpy carries higher
fixed per-step overhead, so the margin is largest at moderate sizes and narrows by 50k as both
become compute-bound.

Notes

Timings are compute-only, single-run, machine-dependent — indicative, not micro-benchmarked.
Both sides use all cores (Rust via rayon, scanpy via BLAS).
SingleRust writes blosc/zstd-compressed .h5ad; reading from Python needs import hdf5plugin.

🤖 Generated with Claude Code

Scaling to 500k cells + memory (added)

demo/scverse_scaling_large.ipynb pushes to 500,000 cells (broadened all-assays blood query,
genes fixed across sizes) and tracks peak RSS alongside time — both tools run under
/usr/bin/time -l on identical input. Clean-state reference (48 GB / 18-core, large sizes
measured in isolation):

cells	scanpy	SingleRust	speedup	scanpy RSS	SingleRust RSS
25k	5.8s	1.0s	5.8×	1.7 GB	0.8 GB
50k	6.9s	2.0s	3.4×	3.4 GB	1.4 GB
100k	9.2s	3.9s	2.4×	6.4 GB	2.6 GB
200k	13.5s	7.9s	1.7×	11.2 GB	5.0 GB
350k	22.5s	13.7s	1.65×	17.0 GB	8.7 GB
500k	31.4s	19.7s	1.6×	18.4 GB	11.6 GB

Scaling holds, speedup converges. SingleRust is faster at every size, but the pure-compute
margin narrows from ~5.8× (25k) to ~1.6× (500k) — small-N wins are low fixed overhead; at
scale both are compute-bound (PCA dominates), gap ~1.6× (normalize/HVG/PCA 2–4×, QC at parity).
Memory is a confound, and it favors SingleRust. Peak RSS is ~2× smaller (≈10 vs ≈18 GB
at 500k), so it stays compute-bound where scanpy starts paging. The notebook explicitly flags
the two large-N confounds on a laptop — memory pressure (inflates scanpy 2–4× near the RAM
limit while RSS plateaus) and thermal throttling under sustained benchmarking (slows both). RSS
is immune to both, so the lower-footprint result is the robust takeaway.

Adds a focused demonstration of SingleRust's performance and its in-place processing ergonomics. In-place example (examples/inplace_pipeline.rs): load an AnnData once and run the whole pipeline against the same object — qc_metrics, normalize, log1p, HVG, and run_pca_inplace each mutate adata in place (obs/var, X, obsm["X_pca"], uns). `adata` isn't even `mut`: interior mutability lets every step take a shared &adata, so there is no per-step copy of the matrix. That allocation-avoidance is what the benchmark quantifies. Benchmark (demo/scverse_benchmark.ipynb + examples/bench_step.rs): runs each step one at a time, SingleRust vs the equivalent scanpy function, on a ~50k-cell CZI CELLxGENE slice, reporting compute time only (Rust .h5ad read/write measured separately and excluded). scanpy produces the state before each step and Rust runs that same step on it, so it's apples-to-apples. Indicative (18 cores, 50k × 35.5k): qc 4.3×, normalize 7.8×, hvg 7.0×, pca 6.5×, ora 3.2×; overall ~4.9× (13.8s → 2.8s). Supporting scaffolding: demo/prepare_data.py (parametrized Census fetch), markers.tsv, requirements.txt, README, .gitignore for .venv/data. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

- examples/bench_step.rs: add an `all` step that runs the full core pipeline (qc→normalize→log1p→hvg→pca) in one process, printing per-step and total compute time — used by the scaling sweep. - demo/scverse_scaling.ipynb (+ _build_scaling_notebook.py): sweeps cell count 3k→50k, runs the full pipeline with SingleRust and scanpy at each size on the same raw subsample, and plots runtime-vs-cells (log–log) + speedup-vs-cells + per-step scaling. - Add an untimed warm-up to BOTH notebooks so timings aren't charged for scanpy's one-time numba JIT / thread-pool startup. This corrects the earlier per-step numbers, which were warm-up-inflated: real result is ~3.2× overall (PCA ~6×, normalize ~8×), with QC actually ~0.75× (scanpy's optimized C path wins) — flagged honestly in the README. - Scaling: SingleRust faster at every size (3–9×); margin largest at moderate sizes, narrowing by 50k as both become compute-bound. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Adds a large-scale scaling study answering "does the advantage hold at 500k, and is memory a confound?" - prepare_data.py: --all-assays flag (the narrow 10x-3'-v3 filter caps at 277k cells; broadened blood/primary/normal has 4M, enabling a 500k base). - demo/_scanpy_pipeline.py: standalone scanpy pipeline runner with an internal warm-up, so scanpy can be measured as a subprocess too. - demo/scverse_scaling_large.ipynb (+ builder): sweeps 25k→500k, runs both tools under /usr/bin/time -l to capture compute time AND peak RSS on identical input; clean compute-bound sweep ≤200k, large sizes measured in isolation; plots runtime/speedup/memory vs cells. Findings (48 GB / 18-core, clean-state reference): - Scaling holds — SingleRust faster at every size — but the pure-compute speedup CONVERGES from ~5.8× (25k) to ~1.6× (500k): the small-N margins are low fixed overhead; at scale both are compute-bound (PCA dominates), gap ~1.6× (normalize/HVG/PCA 2–4×, QC reaches parity). - Memory IS a confound, and favors SingleRust: peak RSS ~2× smaller (≈10 vs ≈18 GB at 500k), so it stays compute-bound while scanpy pages. Two large-N confounds are documented: memory pressure (inflates scanpy 2–4× near the RAM limit, RSS plateaus) and thermal throttling under sustained benchmarking (slows both). RSS is immune to both and is the robust signal. requirements.txt: add psutil. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Ian Driver and others added 2 commits June 15, 2026 10:08

iandriver changed the title ~~Benchmark + in-place pipeline: SingleRust vs scanpy (~4.9× on 50k cells)~~ Benchmark + in-place pipeline: SingleRust vs scanpy (PCA ~6×, ~3× overall, + scaling curve) Jun 15, 2026

iandriver mentioned this pull request Jun 18, 2026

Investigate our neighbors algorithm choice scverse/scanpy#4131

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark + in-place pipeline: SingleRust vs scanpy (PCA ~6×, ~3× overall, + scaling curve)#1

Benchmark + in-place pipeline: SingleRust vs scanpy (PCA ~6×, ~3× overall, + scaling curve)#1
iandriver wants to merge 3 commits into
integration/features-onlyfrom
pr/benchmark-inplace

iandriver commented Jun 15, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

iandriver commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why it's fast: in-place operations

Benchmark (per step, 50k × 35.5k, 18 cores)

Scaling (runtime vs cells)

Notes

Scaling to 500k cells + memory (added)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

iandriver commented Jun 15, 2026 •

edited

Loading