Out-of-core preprocessing (qc/normalize/log1p/hvg/pca) with deterministic parallelism#2
Open
iandriver wants to merge 8 commits into
Open
Out-of-core preprocessing (qc/normalize/log1p/hvg/pca) with deterministic parallelism#2iandriver wants to merge 8 commits into
iandriver wants to merge 8 commits into
Conversation
…e shim
Implements native streaming (out-of-core) preprocessing on disk-backed
.h5ad — the highest-leverage item from the performance roadmap, targeting
the larger-than-RAM niche scanpy fills with Dask (where Dask+sparse is weak).
- backed::processing::transformation: log1p_backed + normalize_total_backed.
Read X in row-chunks (x().iter), transform, stream back out via
set_x_from_iter (extendable HDF5 dataset). Peak memory ~ one chunk, not
the dataset. normalize is two passes (cheap per-cell sums, then scale).
- backed::processing::qc: qc_metrics_backed — one streaming pass computing
the same cell/gene metrics as the in-memory version (totals, nnz, top-N
segments, mito), written back into obs/var in place (X untouched).
- examples/sr_ooc.rs: CLI (qc | normalize_total | log1p) with in-place
rewrite (temp + atomic rename) so a file can be processed without loading.
- demo/singlerust.py: near-drop-in `sr.pp.*` mirroring `sc.pp.*`
(calculate_qc_metrics / normalize_total / log1p), operating on a backed
AnnData's file or a path; releases the HDF5 handle before invoking Rust.
Validated against scanpy on a 200×600 dataset: X exact match after
normalize_total+log1p, and total_counts / n_genes / pct_counts_mito /
pct_top_{50,500} / var totals / n_cells / mean all match.
Also surfaced a pre-existing bug (flagged separately): convert_to_array_f64
corrupts CSR values when densifying — the OOC tests densify manually.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
bench_ooc.py runs the same work (QC + normalize_total + log1p) in both lanes as subprocesses under /usr/bin/time -l, capturing peak RSS + time. Result at 500k cells × 48,788 genes (48 GB / 18-core): scanpy in-memory : 12.3s compute | 12.4 GB peak RSS SingleRust OOC : 44.9s wall | 2.3 GB peak RSS (~5.3× less memory) SingleRust OOC's RSS is chunk-bounded (≈flat as cells grow); scanpy in-memory scales linearly and OOMs (~50 GB at 2M > 48 GB RAM). Time is not apples-to-apples (scanpy = compute only; SingleRust includes a 6 GB working copy + two full disk passes as separate processes) — documented in OOC_BENCHMARK.md; the memory result is the durable signal. The scanpy+Dask out-of-core lane is left as a hook: it needs anndata>=0.11 (read_elem_as_dask; this env has 0.10.9) and the Dask sparse path is immature per scanpy's own issues. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Adds a third lane (_scanpy_dask_pp.py) using anndata's experimental read_elem_lazy to read X as a dask array, run dask-enabled normalize_total/log1p (+qc), and stream the result out. bench_ooc.py now compares all three lanes (peak RSS + time), driving the scanpy lanes with an isolated .venv-dask (anndata 0.12 / scanpy 1.12 / dask), since the lazy dask read needs anndata>=0.11 (Python>=3.10). Result at 500k × 48,788 genes (48 GB / 18-core): scanpy in-memory : 13.5s | 11.7 GB scanpy + Dask OOC : 92.2s | 7.5 GB SingleRust OOC : 128s | 2.1 GB Key finding: Dask's sparse OOC path only modestly cuts memory (7.5 vs 11.7 GB) while costing ~7× the runtime — the dask-sparse immaturity scanpy's own issues describe. SingleRust native streaming is the only lane with truly bounded memory (5.6× less than in-memory, 3.6× less than Dask). 2M projection documented but not run (would OOM the box); peak RSS is the fair metric (time isn't apples-to-apples — see OOC_BENCHMARK.md). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…e too preprocess_backed runs QC + normalize_total + log1p as one streamed job: pass 1 accumulates QC metrics (the per-cell total IS the normalization row-sum, so it's computed once); pass 2 streams normalize+log1p into X. Two passes total instead of the ~4 read/write passes the separate ops incurred, and no working copy. - src/backed/processing/pipeline.rs: preprocess_backed (+ test). - qc.rs: extracted stream_qc / write_metrics (pub(crate)) so the pipeline reuses the QC accumulation and its totals. - transformation.rs: normalize_chunk made pub(crate) for reuse. - examples/sr_ooc.rs: `preprocess` subcommand. - demo/singlerust.py: sr.pp.preprocess. - demo/bench_ooc.py: SingleRust lane now a single fused, copy-free call. Benchmark (500k × 48,788 genes, 48 GB / 18-core): scanpy in-memory : 13.5s | 13.6 GB scanpy + Dask OOC : 35.4s | 14.5 GB SingleRust OOC : 45.0s | 2.5 GB Fusion cut SingleRust wall time 128s -> 45s (~2.8×): now in the same ballpark as Dask on time, at ~1/6th the memory. Dask's sparse OOC again showed no memory benefit (RSS >= in-memory). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
One streaming pass accumulates per-gene sum + sum-of-squares; mean and Bessel-corrected sample variance are formed identically to the in-memory var_col, then handed to the shared seurat_select (refactored out of the in-memory compute_seurat_hvg) for dispersion binning/normalization and top-N selection. Results written into var in place; memory bounded by one chunk + two n_vars accumulators. - memory/processing/hvg: extract pub(crate) seurat_select; module made pub(crate) so the backed path can reuse it. - backed/processing/hvg.rs: highly_variable_genes_backed (+ test asserting the OOC mask equals the in-memory HVG mask on the same data). - sr_ooc CLI `hvg` + sr.pp.highly_variable_genes shim. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Exact PCA over the HVG-selected genes without holding the matrix in
memory:
pass 1: stream X, accumulate the gene×gene Gram matrix + per-gene sums
over selected genes (memory O(n_hvg²), e.g. 32 MB for 2000 HVGs)
-> centered covariance C = (G - n·μμᵀ)/(n-1), symmetric eigendecomp
(n_hvg × n_hvg) for loadings + variances
pass 2: stream X again, project each centered cell onto the top axes
-> obsm["X_pca"], uns["pca_variance_ratio"]; written in place.
Centering is folded into the projection (per-component offset), so pass 2
only touches each cell's selected nonzeros. Exact (not randomized) PCA;
matches a dense covariance-PCA reference (variance ratios exact, embedding
column norms match) up to per-component sign — see test.
- backed/processing/pca.rs: pca_backed (+ dense-reference test).
- sr_ooc CLI `pca` + sr.pp.pca shim.
- hvg: add SeuratResult type alias (clippy type-complexity).
Out-of-core Tier-1 is now complete: qc, normalize/log1p, fused preprocess,
HVG, and PCA all run disk-backed in bounded memory.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
OOC_BENCHMARK.md: summarize that QC, normalize_total, log1p, fused preprocess, HVG (Seurat) and PCA all run disk-backed via sr_ooc / sr.pp.*, with each step validated against its in-memory/dense counterpart. Note the end-to-end PCA-vs-scanpy axis difference traces to SingleRust's HVG gene selection differing from scanpy's (pre-existing), not the OOC path. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Parallelize the compute-bound OOC reductions with rayon, but guarantee bit-identical results regardless of thread count via a fixed-block, ordered-merge reduction (backed::processing::det). Floating-point sums are non-associative and rayon work-stealing varies the order, so partials are computed over fixed-index row blocks and merged in block order -> the summation order depends only on data size + a constant block size, never on threads or scheduling. Parallelized: - PCA pass-1 Gram matrix + per-gene sums (the compute hotspot) via det_block_reduce; pass-2 projection is per-cell independent (parallel, order-free). - HVG per-gene sum/sum-of-squares via det_block_reduce. - QC per-cell metrics + per-gene reduction via det_block_reduce (per-cell records kept in row order, per-gene summed in block order). normalize/log1p stay sequential (trivial elementwise, no reduction) and are deterministic by construction. Determinism verified: - unit tests: qc/hvg/preprocess/pca under 1 vs 8 rayon threads are bit-identical (metrics, masks, embeddings, variance ratios), plus a repeat-run check. - at scale: full pipeline on 50k real cells with RAYON_NUM_THREADS=1 vs 18 yields bit-identical X, obs/var, obsm["X_pca"], and uns variance ratios. Adds rayon as a direct dependency. 21 lib tests pass; clippy clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements native out-of-core (disk-backed) preprocessing for SingleRust — the flagship item
from the performance roadmap. The whole Tier-1 pipeline (QC → normalize_total → log1p → HVG → PCA)
now runs streamed against an
.h5adin bounded memory, the larger-than-RAM niche scanpy fillswith Dask (where, per scanpy's own issues, the sparse path is weak). All compute-bound passes are
parallelized with a determinism guarantee: results are bit-identical regardless of thread count.
Stacked on
demo/scverse-integration, so this PR's diff is just the OOC engine + benchmarks.What's new
OOC engine (
src/backed/processing/) — streamsXin row-chunks (x().iter) and writes backvia
set_x_from_iter(extendable HDF5 dataset); peak memory ≈ one chunk + small accumulators:transformation:log1p_backed,normalize_total_backedqc:qc_metrics_backed(one pass; obs/var written in place)pipeline:preprocess_backed— fused QC+normalize+log1p (the per-cell total is reused as thenormalization row-sum, so it's computed once)
hvg:highly_variable_genes_backed(Seurat) — sharesseurat_selectwith the in-memory pathpca:pca_backed— exact covariance/eigendecomposition PCA over the HVGs (pass-1 Gram matrix,symmetric eigendecomp, pass-2 projection →
obsm["X_pca"])Deterministic parallelism (
backed::processing::det) — naive parallel FP reduction is notreproducible (non-associative addition + work-stealing). All reductions use a fixed-block,
ordered-merge scheme so the summation order depends only on data size + a constant block size,
never on threads/scheduling. Applied to PCA Gram, HVG sum/sumsq, and QC accumulation.
Drop-in scanpy interface —
examples/sr_ooc.rsCLI (qc|normalize_total|log1p|preprocess|hvg|pca,in-place via temp+rename) and
demo/singlerust.pymirroringsc.pp.*:sc.pp.calculate_qc_metrics(adata)sr.pp.calculate_qc_metrics(adata)sc.pp.normalize_total(adata, target_sum=1e4)sr.pp.normalize_total(adata, target_sum=1e4)sc.pp.log1p(adata)sr.pp.log1p(adata)sc.pp.highly_variable_genes(adata, …)sr.pp.highly_variable_genes(adata, …)sc.pp.pca(adata, …)sr.pp.pca(adata, …)adatais a path or a backedAnnData; the shim releases the HDF5 handle before invoking Rust.Benchmark (
demo/bench_ooc.py, 3 lanes): scanpy in-memory vs scanpy+Dask vs SingleRust OOC.Results
Memory, 500k × 48,788 genes (48 GB / 18-core):
SingleRust OOC uses ~5.4× less memory than scanpy in-memory and ~5.8× less than scanpy+Dask,
with a chunk-bounded footprint that stays flat as cells grow (scanpy OOMs ~46 GB at 2M). Dask's
sparse OOC path gave no memory benefit (RSS ≥ in-memory) at 2.6× the time — the immaturity
scanpy's issues describe.
Correctness & determinism
Each step is unit-tested against its same-algorithm counterpart: log1p/normalize match scanpy's X
exactly; QC matches; HVG selects the same genes as in-memory HVG; PCA matches a dense
covariance-PCA reference (variance ratios exact, embedding norms match up to sign).
Determinism verified two ways:
RAYON_NUM_THREADS=1vs18→ bit-identicalX, obs/var,
obsm["X_pca"], and variance ratios.Notes for reviewers
rayonas a direct dependency. 21 lib tests pass; clippy clean (one pre-existing dead-codewarning unrelated to this PR).
gene set than scanpy's — a pre-existing in-memory difference, not introduced here.
convert_to_array_f64CSR densify bug; the OOCtests densify manually to avoid it.
.venv-dask(anndata ≥ 0.11 / scanpy ≥ 1.11); the main3.9 demo env is untouched.
🤖 Generated with Claude Code