Out-of-core preprocessing (qc/normalize/log1p/hvg/pca) with deterministic parallelism by iandriver · Pull Request #2 · iandriver/SingleRust

iandriver · 2026-06-17T02:35:15Z

Summary

Implements native out-of-core (disk-backed) preprocessing for SingleRust — the flagship item
from the performance roadmap. The whole Tier-1 pipeline (QC → normalize_total → log1p → HVG → PCA)
now runs streamed against an .h5ad in bounded memory, the larger-than-RAM niche scanpy fills
with Dask (where, per scanpy's own issues, the sparse path is weak). All compute-bound passes are
parallelized with a determinism guarantee: results are bit-identical regardless of thread count.

Stacked on demo/scverse-integration, so this PR's diff is just the OOC engine + benchmarks.

What's new

OOC engine (src/backed/processing/) — streams X in row-chunks (x().iter) and writes back
via set_x_from_iter (extendable HDF5 dataset); peak memory ≈ one chunk + small accumulators:

transformation: log1p_backed, normalize_total_backed
qc: qc_metrics_backed (one pass; obs/var written in place)
pipeline: preprocess_backed — fused QC+normalize+log1p (the per-cell total is reused as the
normalization row-sum, so it's computed once)
hvg: highly_variable_genes_backed (Seurat) — shares seurat_select with the in-memory path
pca: pca_backed — exact covariance/eigendecomposition PCA over the HVGs (pass-1 Gram matrix,
symmetric eigendecomp, pass-2 projection → obsm["X_pca"])

Deterministic parallelism (backed::processing::det) — naive parallel FP reduction is not
reproducible (non-associative addition + work-stealing). All reductions use a fixed-block,
ordered-merge scheme so the summation order depends only on data size + a constant block size,
never on threads/scheduling. Applied to PCA Gram, HVG sum/sumsq, and QC accumulation.

scanpy	SingleRust (out-of-core)
`sc.pp.calculate_qc_metrics(adata)`	`sr.pp.calculate_qc_metrics(adata)`
`sc.pp.normalize_total(adata, target_sum=1e4)`	`sr.pp.normalize_total(adata, target_sum=1e4)`
`sc.pp.log1p(adata)`	`sr.pp.log1p(adata)`
`sc.pp.highly_variable_genes(adata, …)`	`sr.pp.highly_variable_genes(adata, …)`
`sc.pp.pca(adata, …)`	`sr.pp.pca(adata, …)`

adata is a path or a backed AnnData; the shim releases the HDF5 handle before invoking Rust.

Benchmark (demo/bench_ooc.py, 3 lanes): scanpy in-memory vs scanpy+Dask vs SingleRust OOC.

Results

Memory, 500k × 48,788 genes (48 GB / 18-core):

lane	time	peak RSS
scanpy in-memory	13.5 s	13.6 GB
scanpy + Dask OOC	35.4 s	14.5 GB
SingleRust OOC (fused)	45.0 s	2.5 GB

SingleRust OOC uses ~5.4× less memory than scanpy in-memory and ~5.8× less than scanpy+Dask,
with a chunk-bounded footprint that stays flat as cells grow (scanpy OOMs ~46 GB at 2M). Dask's
sparse OOC path gave no memory benefit (RSS ≥ in-memory) at 2.6× the time — the immaturity
scanpy's issues describe.

Correctness & determinism

Each step is unit-tested against its same-algorithm counterpart: log1p/normalize match scanpy's X
exactly; QC matches; HVG selects the same genes as in-memory HVG; PCA matches a dense
covariance-PCA reference (variance ratios exact, embedding norms match up to sign).

Determinism verified two ways:

unit tests run QC/HVG/preprocess/PCA under 1 vs 8 rayon threads → bit-identical;
at scale, the full pipeline on 50k real cells with RAYON_NUM_THREADS=1 vs 18 → bit-identical
X, obs/var, obsm["X_pca"], and variance ratios.

Notes for reviewers

Adds rayon as a direct dependency. 21 lib tests pass; clippy clean (one pre-existing dead-code
warning unrelated to this PR).
End-to-end PCA axes differ from scanpy only because SingleRust's Seurat HVG selects a different
gene set than scanpy's — a pre-existing in-memory difference, not introduced here.
Surfaced (and flagged separately) a pre-existing convert_to_array_f64 CSR densify bug; the OOC
tests densify manually to avoid it.
The Dask benchmark lane needs an isolated .venv-dask (anndata ≥ 0.11 / scanpy ≥ 1.11); the main
3.9 demo env is untouched.

🤖 Generated with Claude Code

…e shim Implements native streaming (out-of-core) preprocessing on disk-backed .h5ad — the highest-leverage item from the performance roadmap, targeting the larger-than-RAM niche scanpy fills with Dask (where Dask+sparse is weak). - backed::processing::transformation: log1p_backed + normalize_total_backed. Read X in row-chunks (x().iter), transform, stream back out via set_x_from_iter (extendable HDF5 dataset). Peak memory ~ one chunk, not the dataset. normalize is two passes (cheap per-cell sums, then scale). - backed::processing::qc: qc_metrics_backed — one streaming pass computing the same cell/gene metrics as the in-memory version (totals, nnz, top-N segments, mito), written back into obs/var in place (X untouched). - examples/sr_ooc.rs: CLI (qc | normalize_total | log1p) with in-place rewrite (temp + atomic rename) so a file can be processed without loading. - demo/singlerust.py: near-drop-in `sr.pp.*` mirroring `sc.pp.*` (calculate_qc_metrics / normalize_total / log1p), operating on a backed AnnData's file or a path; releases the HDF5 handle before invoking Rust. Validated against scanpy on a 200×600 dataset: X exact match after normalize_total+log1p, and total_counts / n_genes / pct_counts_mito / pct_top_{50,500} / var totals / n_cells / mean all match. Also surfaced a pre-existing bug (flagged separately): convert_to_array_f64 corrupts CSR values when densifying — the OOC tests densify manually. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

bench_ooc.py runs the same work (QC + normalize_total + log1p) in both lanes as subprocesses under /usr/bin/time -l, capturing peak RSS + time. Result at 500k cells × 48,788 genes (48 GB / 18-core): scanpy in-memory : 12.3s compute | 12.4 GB peak RSS SingleRust OOC : 44.9s wall | 2.3 GB peak RSS (~5.3× less memory) SingleRust OOC's RSS is chunk-bounded (≈flat as cells grow); scanpy in-memory scales linearly and OOMs (~50 GB at 2M > 48 GB RAM). Time is not apples-to-apples (scanpy = compute only; SingleRust includes a 6 GB working copy + two full disk passes as separate processes) — documented in OOC_BENCHMARK.md; the memory result is the durable signal. The scanpy+Dask out-of-core lane is left as a hook: it needs anndata>=0.11 (read_elem_as_dask; this env has 0.10.9) and the Dask sparse path is immature per scanpy's own issues. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Adds a third lane (_scanpy_dask_pp.py) using anndata's experimental read_elem_lazy to read X as a dask array, run dask-enabled normalize_total/log1p (+qc), and stream the result out. bench_ooc.py now compares all three lanes (peak RSS + time), driving the scanpy lanes with an isolated .venv-dask (anndata 0.12 / scanpy 1.12 / dask), since the lazy dask read needs anndata>=0.11 (Python>=3.10). Result at 500k × 48,788 genes (48 GB / 18-core): scanpy in-memory : 13.5s | 11.7 GB scanpy + Dask OOC : 92.2s | 7.5 GB SingleRust OOC : 128s | 2.1 GB Key finding: Dask's sparse OOC path only modestly cuts memory (7.5 vs 11.7 GB) while costing ~7× the runtime — the dask-sparse immaturity scanpy's own issues describe. SingleRust native streaming is the only lane with truly bounded memory (5.6× less than in-memory, 3.6× less than Dask). 2M projection documented but not run (would OOM the box); peak RSS is the fair metric (time isn't apples-to-apples — see OOC_BENCHMARK.md). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…e too preprocess_backed runs QC + normalize_total + log1p as one streamed job: pass 1 accumulates QC metrics (the per-cell total IS the normalization row-sum, so it's computed once); pass 2 streams normalize+log1p into X. Two passes total instead of the ~4 read/write passes the separate ops incurred, and no working copy. - src/backed/processing/pipeline.rs: preprocess_backed (+ test). - qc.rs: extracted stream_qc / write_metrics (pub(crate)) so the pipeline reuses the QC accumulation and its totals. - transformation.rs: normalize_chunk made pub(crate) for reuse. - examples/sr_ooc.rs: `preprocess` subcommand. - demo/singlerust.py: sr.pp.preprocess. - demo/bench_ooc.py: SingleRust lane now a single fused, copy-free call. Benchmark (500k × 48,788 genes, 48 GB / 18-core): scanpy in-memory : 13.5s | 13.6 GB scanpy + Dask OOC : 35.4s | 14.5 GB SingleRust OOC : 45.0s | 2.5 GB Fusion cut SingleRust wall time 128s -> 45s (~2.8×): now in the same ballpark as Dask on time, at ~1/6th the memory. Dask's sparse OOC again showed no memory benefit (RSS >= in-memory). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

One streaming pass accumulates per-gene sum + sum-of-squares; mean and Bessel-corrected sample variance are formed identically to the in-memory var_col, then handed to the shared seurat_select (refactored out of the in-memory compute_seurat_hvg) for dispersion binning/normalization and top-N selection. Results written into var in place; memory bounded by one chunk + two n_vars accumulators. - memory/processing/hvg: extract pub(crate) seurat_select; module made pub(crate) so the backed path can reuse it. - backed/processing/hvg.rs: highly_variable_genes_backed (+ test asserting the OOC mask equals the in-memory HVG mask on the same data). - sr_ooc CLI `hvg` + sr.pp.highly_variable_genes shim. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Exact PCA over the HVG-selected genes without holding the matrix in memory: pass 1: stream X, accumulate the gene×gene Gram matrix + per-gene sums over selected genes (memory O(n_hvg²), e.g. 32 MB for 2000 HVGs) -> centered covariance C = (G - n·μμᵀ)/(n-1), symmetric eigendecomp (n_hvg × n_hvg) for loadings + variances pass 2: stream X again, project each centered cell onto the top axes -> obsm["X_pca"], uns["pca_variance_ratio"]; written in place. Centering is folded into the projection (per-component offset), so pass 2 only touches each cell's selected nonzeros. Exact (not randomized) PCA; matches a dense covariance-PCA reference (variance ratios exact, embedding column norms match) up to per-component sign — see test. - backed/processing/pca.rs: pca_backed (+ dense-reference test). - sr_ooc CLI `pca` + sr.pp.pca shim. - hvg: add SeuratResult type alias (clippy type-complexity). Out-of-core Tier-1 is now complete: qc, normalize/log1p, fused preprocess, HVG, and PCA all run disk-backed in bounded memory. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

OOC_BENCHMARK.md: summarize that QC, normalize_total, log1p, fused preprocess, HVG (Seurat) and PCA all run disk-backed via sr_ooc / sr.pp.*, with each step validated against its in-memory/dense counterpart. Note the end-to-end PCA-vs-scanpy axis difference traces to SingleRust's HVG gene selection differing from scanpy's (pre-existing), not the OOC path. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Parallelize the compute-bound OOC reductions with rayon, but guarantee bit-identical results regardless of thread count via a fixed-block, ordered-merge reduction (backed::processing::det). Floating-point sums are non-associative and rayon work-stealing varies the order, so partials are computed over fixed-index row blocks and merged in block order -> the summation order depends only on data size + a constant block size, never on threads or scheduling. Parallelized: - PCA pass-1 Gram matrix + per-gene sums (the compute hotspot) via det_block_reduce; pass-2 projection is per-cell independent (parallel, order-free). - HVG per-gene sum/sum-of-squares via det_block_reduce. - QC per-cell metrics + per-gene reduction via det_block_reduce (per-cell records kept in row order, per-gene summed in block order). normalize/log1p stay sequential (trivial elementwise, no reduction) and are deterministic by construction. Determinism verified: - unit tests: qc/hvg/preprocess/pca under 1 vs 8 rayon threads are bit-identical (metrics, masks, embeddings, variance ratios), plus a repeat-run check. - at scale: full pipeline on 50k real cells with RAYON_NUM_THREADS=1 vs 18 yields bit-identical X, obs/var, obsm["X_pca"], and uns variance ratios. Adds rayon as a direct dependency. 21 lib tests pass; clippy clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Ian Driver and others added 8 commits June 16, 2026 11:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Out-of-core preprocessing (qc/normalize/log1p/hvg/pca) with deterministic parallelism#2

Out-of-core preprocessing (qc/normalize/log1p/hvg/pca) with deterministic parallelism#2
iandriver wants to merge 8 commits into
demo/scverse-integrationfrom
feature/ooc-processing

iandriver commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

iandriver commented Jun 17, 2026

Summary

What's new

Results

Correctness & determinism

Notes for reviewers

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant