Skip to content

Benchmark + in-place pipeline: SingleRust vs scanpy (PCA ~6×, ~3× overall, + scaling curve)#1

Open
iandriver wants to merge 3 commits into
integration/features-onlyfrom
pr/benchmark-inplace
Open

Benchmark + in-place pipeline: SingleRust vs scanpy (PCA ~6×, ~3× overall, + scaling curve)#1
iandriver wants to merge 3 commits into
integration/features-onlyfrom
pr/benchmark-inplace

Conversation

@iandriver

@iandriver iandriver commented Jun 15, 2026

Copy link
Copy Markdown
Owner

Summary

A focused demonstration of SingleRust's performance and its in-place processing
ergonomics
:

  • examples/inplace_pipeline.rs — a minimal, annotated pipeline that loads an AnnData once and
    mutates it through every step (no copies).
  • demo/scverse_benchmark.ipynb + examples/bench_step.rs — a per-step runtime benchmark vs
    the equivalent scanpy function on a ~50k-cell CZI CELLxGENE dataset.
  • demo/scverse_scaling.ipynb — a cells-vs-runtime scaling sweep (3k → 50k).

Stacked PR: base is integration/features-only (write_h5ad, run_pca_inplace, dense
normalize/log1p, ORA, QC fix), so the diff here is just the benchmark, scaling, the in-place
example, and supporting scaffolding.

Why it's fast: in-place operations

Interior mutability means you pass a shared &adata to each step and results accumulate on the
same object — no copy between steps, the expression matrix is never reallocated.

let adata = io::read_h5ad_memory("input.h5ad")?;          // load ONCE — note: not `mut`
qc_metrics(&adata)?;                                       // -> obs / var
normalize_expression(&adata.x(), 10_000, &Direction::ROW, None)?;  // -> X in place
log1p_expression(&adata.x(), None)?;                       // -> X in place
compute_highly_variable_genes(&adata, Some(HVGParams { n_top_genes: Some(2000), ..Default::default() }))?;
run_pca_inplace::<f64>(&adata, Some(hvg_selection), Some(true), Some(false),
                       Some(50), None, Some(42), Some(svd), None)?;   // -> obsm["X_pca"], uns

Benchmark (per step, 50k × 35.5k, 18 cores)

Compute time only (Rust .h5ad read/write excluded); a warm-up pass runs first so scanpy isn't
charged for one-time numba JIT / thread-pool startup.

step scanpy SingleRust speedup
qc 0.85s 1.13s 0.75×
normalize 0.10s 0.01s 8.2×
log1p 0.20s 0.15s 1.4×
hvg 0.31s 0.12s 2.5×
pca 6.44s 1.06s 6.1×
ora* 1.47s 0.46s 3.2×
total 9.37s 2.93s 3.2×

SingleRust wins decisively on the heavy/vectorizable steps (PCA ~6×, normalize ~8×). QC is the
honest exception — scanpy's optimized C path slightly beats SingleRust's qc_metrics (the top-N
segment proportions are the cost). * ORA has no scanpy equivalent; compared against a
NumPy/SciPy implementation of the same algorithm.

Scaling (runtime vs cells)

cells scanpy SingleRust speedup
3,000 1.80s 0.21s 8.5×
6,000 3.38s 0.37s 9.1×
12,500 5.59s 0.69s 8.1×
25,000 6.12s 1.35s 4.5×
50,000 7.77s 2.62s 3.0×

Faster at every size (3–9×). SingleRust scales ~linearly with non-zeros; scanpy carries higher
fixed per-step overhead, so the margin is largest at moderate sizes and narrows by 50k as both
become compute-bound.

Notes

  • Timings are compute-only, single-run, machine-dependent — indicative, not micro-benchmarked.
  • Both sides use all cores (Rust via rayon, scanpy via BLAS).
  • SingleRust writes blosc/zstd-compressed .h5ad; reading from Python needs import hdf5plugin.

🤖 Generated with Claude Code

Scaling to 500k cells + memory (added)

demo/scverse_scaling_large.ipynb pushes to 500,000 cells (broadened all-assays blood query,
genes fixed across sizes) and tracks peak RSS alongside time — both tools run under
/usr/bin/time -l on identical input. Clean-state reference (48 GB / 18-core, large sizes
measured in isolation):

cells scanpy SingleRust speedup scanpy RSS SingleRust RSS
25k 5.8s 1.0s 5.8× 1.7 GB 0.8 GB
50k 6.9s 2.0s 3.4× 3.4 GB 1.4 GB
100k 9.2s 3.9s 2.4× 6.4 GB 2.6 GB
200k 13.5s 7.9s 1.7× 11.2 GB 5.0 GB
350k 22.5s 13.7s 1.65× 17.0 GB 8.7 GB
500k 31.4s 19.7s 1.6× 18.4 GB 11.6 GB
  • Scaling holds, speedup converges. SingleRust is faster at every size, but the pure-compute
    margin narrows from ~5.8× (25k) to ~1.6× (500k) — small-N wins are low fixed overhead; at
    scale both are compute-bound (PCA dominates), gap ~1.6× (normalize/HVG/PCA 2–4×, QC at parity).
  • Memory is a confound, and it favors SingleRust. Peak RSS is ~2× smaller (≈10 vs ≈18 GB
    at 500k), so it stays compute-bound where scanpy starts paging. The notebook explicitly flags
    the two large-N confounds on a laptop — memory pressure (inflates scanpy 2–4× near the RAM
    limit while RSS plateaus) and thermal throttling under sustained benchmarking (slows both). RSS
    is immune to both, so the lower-footprint result is the robust takeaway.

Ian Driver and others added 2 commits June 15, 2026 10:08
Adds a focused demonstration of SingleRust's performance and its in-place
processing ergonomics.

In-place example (examples/inplace_pipeline.rs): load an AnnData once and
run the whole pipeline against the same object — qc_metrics, normalize,
log1p, HVG, and run_pca_inplace each mutate adata in place (obs/var, X,
obsm["X_pca"], uns). `adata` isn't even `mut`: interior mutability lets
every step take a shared &adata, so there is no per-step copy of the
matrix. That allocation-avoidance is what the benchmark quantifies.

Benchmark (demo/scverse_benchmark.ipynb + examples/bench_step.rs): runs
each step one at a time, SingleRust vs the equivalent scanpy function, on
a ~50k-cell CZI CELLxGENE slice, reporting compute time only (Rust .h5ad
read/write measured separately and excluded). scanpy produces the state
before each step and Rust runs that same step on it, so it's
apples-to-apples. Indicative (18 cores, 50k × 35.5k): qc 4.3×, normalize
7.8×, hvg 7.0×, pca 6.5×, ora 3.2×; overall ~4.9× (13.8s → 2.8s).

Supporting scaffolding: demo/prepare_data.py (parametrized Census fetch),
markers.tsv, requirements.txt, README, .gitignore for .venv/data.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- examples/bench_step.rs: add an `all` step that runs the full core
  pipeline (qc→normalize→log1p→hvg→pca) in one process, printing per-step
  and total compute time — used by the scaling sweep.
- demo/scverse_scaling.ipynb (+ _build_scaling_notebook.py): sweeps cell
  count 3k→50k, runs the full pipeline with SingleRust and scanpy at each
  size on the same raw subsample, and plots runtime-vs-cells (log–log) +
  speedup-vs-cells + per-step scaling.
- Add an untimed warm-up to BOTH notebooks so timings aren't charged for
  scanpy's one-time numba JIT / thread-pool startup. This corrects the
  earlier per-step numbers, which were warm-up-inflated: real result is
  ~3.2× overall (PCA ~6×, normalize ~8×), with QC actually ~0.75×
  (scanpy's optimized C path wins) — flagged honestly in the README.
- Scaling: SingleRust faster at every size (3–9×); margin largest at
  moderate sizes, narrowing by 50k as both become compute-bound.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@iandriver iandriver changed the title Benchmark + in-place pipeline: SingleRust vs scanpy (~4.9× on 50k cells) Benchmark + in-place pipeline: SingleRust vs scanpy (PCA ~6×, ~3× overall, + scaling curve) Jun 15, 2026
Adds a large-scale scaling study answering "does the advantage hold at
500k, and is memory a confound?"

- prepare_data.py: --all-assays flag (the narrow 10x-3'-v3 filter caps at
  277k cells; broadened blood/primary/normal has 4M, enabling a 500k base).
- demo/_scanpy_pipeline.py: standalone scanpy pipeline runner with an
  internal warm-up, so scanpy can be measured as a subprocess too.
- demo/scverse_scaling_large.ipynb (+ builder): sweeps 25k→500k, runs both
  tools under /usr/bin/time -l to capture compute time AND peak RSS on
  identical input; clean compute-bound sweep ≤200k, large sizes measured
  in isolation; plots runtime/speedup/memory vs cells.

Findings (48 GB / 18-core, clean-state reference):
- Scaling holds — SingleRust faster at every size — but the pure-compute
  speedup CONVERGES from ~5.8× (25k) to ~1.6× (500k): the small-N margins
  are low fixed overhead; at scale both are compute-bound (PCA dominates),
  gap ~1.6× (normalize/HVG/PCA 2–4×, QC reaches parity).
- Memory IS a confound, and favors SingleRust: peak RSS ~2× smaller
  (≈10 vs ≈18 GB at 500k), so it stays compute-bound while scanpy pages.
  Two large-N confounds are documented: memory pressure (inflates scanpy
  2–4× near the RAM limit, RSS plateaus) and thermal throttling under
  sustained benchmarking (slows both). RSS is immune to both and is the
  robust signal.

requirements.txt: add psutil.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant