Skip to content

Out-of-core preprocessing (qc/normalize/log1p/hvg/pca) with deterministic parallelism#2

Open
iandriver wants to merge 8 commits into
demo/scverse-integrationfrom
feature/ooc-processing
Open

Out-of-core preprocessing (qc/normalize/log1p/hvg/pca) with deterministic parallelism#2
iandriver wants to merge 8 commits into
demo/scverse-integrationfrom
feature/ooc-processing

Conversation

@iandriver

Copy link
Copy Markdown
Owner

Summary

Implements native out-of-core (disk-backed) preprocessing for SingleRust — the flagship item
from the performance roadmap. The whole Tier-1 pipeline (QC → normalize_total → log1p → HVG → PCA)
now runs streamed against an .h5ad in bounded memory, the larger-than-RAM niche scanpy fills
with Dask (where, per scanpy's own issues, the sparse path is weak). All compute-bound passes are
parallelized with a determinism guarantee: results are bit-identical regardless of thread count.

Stacked on demo/scverse-integration, so this PR's diff is just the OOC engine + benchmarks.

What's new

OOC engine (src/backed/processing/) — streams X in row-chunks (x().iter) and writes back
via set_x_from_iter (extendable HDF5 dataset); peak memory ≈ one chunk + small accumulators:

  • transformation: log1p_backed, normalize_total_backed
  • qc: qc_metrics_backed (one pass; obs/var written in place)
  • pipeline: preprocess_backed — fused QC+normalize+log1p (the per-cell total is reused as the
    normalization row-sum, so it's computed once)
  • hvg: highly_variable_genes_backed (Seurat) — shares seurat_select with the in-memory path
  • pca: pca_backed — exact covariance/eigendecomposition PCA over the HVGs (pass-1 Gram matrix,
    symmetric eigendecomp, pass-2 projection → obsm["X_pca"])

Deterministic parallelism (backed::processing::det) — naive parallel FP reduction is not
reproducible (non-associative addition + work-stealing). All reductions use a fixed-block,
ordered-merge
scheme so the summation order depends only on data size + a constant block size,
never on threads/scheduling. Applied to PCA Gram, HVG sum/sumsq, and QC accumulation.

Drop-in scanpy interfaceexamples/sr_ooc.rs CLI (qc|normalize_total|log1p|preprocess|hvg|pca,
in-place via temp+rename) and demo/singlerust.py mirroring sc.pp.*:

scanpy SingleRust (out-of-core)
sc.pp.calculate_qc_metrics(adata) sr.pp.calculate_qc_metrics(adata)
sc.pp.normalize_total(adata, target_sum=1e4) sr.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata) sr.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, …) sr.pp.highly_variable_genes(adata, …)
sc.pp.pca(adata, …) sr.pp.pca(adata, …)

adata is a path or a backed AnnData; the shim releases the HDF5 handle before invoking Rust.

Benchmark (demo/bench_ooc.py, 3 lanes): scanpy in-memory vs scanpy+Dask vs SingleRust OOC.

Results

Memory, 500k × 48,788 genes (48 GB / 18-core):

lane time peak RSS
scanpy in-memory 13.5 s 13.6 GB
scanpy + Dask OOC 35.4 s 14.5 GB
SingleRust OOC (fused) 45.0 s 2.5 GB

SingleRust OOC uses ~5.4× less memory than scanpy in-memory and ~5.8× less than scanpy+Dask,
with a chunk-bounded footprint that stays flat as cells grow (scanpy OOMs ~46 GB at 2M). Dask's
sparse OOC path gave no memory benefit (RSS ≥ in-memory) at 2.6× the time — the immaturity
scanpy's issues describe.

Correctness & determinism

Each step is unit-tested against its same-algorithm counterpart: log1p/normalize match scanpy's X
exactly; QC matches; HVG selects the same genes as in-memory HVG; PCA matches a dense
covariance-PCA reference (variance ratios exact, embedding norms match up to sign).

Determinism verified two ways:

  • unit tests run QC/HVG/preprocess/PCA under 1 vs 8 rayon threads → bit-identical;
  • at scale, the full pipeline on 50k real cells with RAYON_NUM_THREADS=1 vs 18 → bit-identical
    X, obs/var, obsm["X_pca"], and variance ratios.

Notes for reviewers

  • Adds rayon as a direct dependency. 21 lib tests pass; clippy clean (one pre-existing dead-code
    warning unrelated to this PR).
  • End-to-end PCA axes differ from scanpy only because SingleRust's Seurat HVG selects a different
    gene set than scanpy's — a pre-existing in-memory difference, not introduced here.
  • Surfaced (and flagged separately) a pre-existing convert_to_array_f64 CSR densify bug; the OOC
    tests densify manually to avoid it.
  • The Dask benchmark lane needs an isolated .venv-dask (anndata ≥ 0.11 / scanpy ≥ 1.11); the main
    3.9 demo env is untouched.

🤖 Generated with Claude Code

Ian Driver and others added 8 commits June 16, 2026 11:53
…e shim

Implements native streaming (out-of-core) preprocessing on disk-backed
.h5ad — the highest-leverage item from the performance roadmap, targeting
the larger-than-RAM niche scanpy fills with Dask (where Dask+sparse is weak).

- backed::processing::transformation: log1p_backed + normalize_total_backed.
  Read X in row-chunks (x().iter), transform, stream back out via
  set_x_from_iter (extendable HDF5 dataset). Peak memory ~ one chunk, not
  the dataset. normalize is two passes (cheap per-cell sums, then scale).
- backed::processing::qc: qc_metrics_backed — one streaming pass computing
  the same cell/gene metrics as the in-memory version (totals, nnz, top-N
  segments, mito), written back into obs/var in place (X untouched).
- examples/sr_ooc.rs: CLI (qc | normalize_total | log1p) with in-place
  rewrite (temp + atomic rename) so a file can be processed without loading.
- demo/singlerust.py: near-drop-in `sr.pp.*` mirroring `sc.pp.*`
  (calculate_qc_metrics / normalize_total / log1p), operating on a backed
  AnnData's file or a path; releases the HDF5 handle before invoking Rust.

Validated against scanpy on a 200×600 dataset: X exact match after
normalize_total+log1p, and total_counts / n_genes / pct_counts_mito /
pct_top_{50,500} / var totals / n_cells / mean all match.

Also surfaced a pre-existing bug (flagged separately): convert_to_array_f64
corrupts CSR values when densifying — the OOC tests densify manually.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
bench_ooc.py runs the same work (QC + normalize_total + log1p) in both
lanes as subprocesses under /usr/bin/time -l, capturing peak RSS + time.

Result at 500k cells × 48,788 genes (48 GB / 18-core):
  scanpy in-memory : 12.3s compute | 12.4 GB peak RSS
  SingleRust OOC   : 44.9s wall     |  2.3 GB peak RSS  (~5.3× less memory)

SingleRust OOC's RSS is chunk-bounded (≈flat as cells grow); scanpy
in-memory scales linearly and OOMs (~50 GB at 2M > 48 GB RAM). Time is
not apples-to-apples (scanpy = compute only; SingleRust includes a 6 GB
working copy + two full disk passes as separate processes) — documented
in OOC_BENCHMARK.md; the memory result is the durable signal.

The scanpy+Dask out-of-core lane is left as a hook: it needs anndata>=0.11
(read_elem_as_dask; this env has 0.10.9) and the Dask sparse path is
immature per scanpy's own issues.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Adds a third lane (_scanpy_dask_pp.py) using anndata's experimental
read_elem_lazy to read X as a dask array, run dask-enabled
normalize_total/log1p (+qc), and stream the result out. bench_ooc.py now
compares all three lanes (peak RSS + time), driving the scanpy lanes with
an isolated .venv-dask (anndata 0.12 / scanpy 1.12 / dask), since the lazy
dask read needs anndata>=0.11 (Python>=3.10).

Result at 500k × 48,788 genes (48 GB / 18-core):
  scanpy in-memory  : 13.5s | 11.7 GB
  scanpy + Dask OOC : 92.2s |  7.5 GB
  SingleRust OOC    : 128s  |  2.1 GB

Key finding: Dask's sparse OOC path only modestly cuts memory (7.5 vs
11.7 GB) while costing ~7× the runtime — the dask-sparse immaturity
scanpy's own issues describe. SingleRust native streaming is the only lane
with truly bounded memory (5.6× less than in-memory, 3.6× less than Dask).
2M projection documented but not run (would OOM the box); peak RSS is the
fair metric (time isn't apples-to-apples — see OOC_BENCHMARK.md).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…e too

preprocess_backed runs QC + normalize_total + log1p as one streamed job:
pass 1 accumulates QC metrics (the per-cell total IS the normalization
row-sum, so it's computed once); pass 2 streams normalize+log1p into X.
Two passes total instead of the ~4 read/write passes the separate ops
incurred, and no working copy.

- src/backed/processing/pipeline.rs: preprocess_backed (+ test).
- qc.rs: extracted stream_qc / write_metrics (pub(crate)) so the pipeline
  reuses the QC accumulation and its totals.
- transformation.rs: normalize_chunk made pub(crate) for reuse.
- examples/sr_ooc.rs: `preprocess` subcommand.
- demo/singlerust.py: sr.pp.preprocess.
- demo/bench_ooc.py: SingleRust lane now a single fused, copy-free call.

Benchmark (500k × 48,788 genes, 48 GB / 18-core):
  scanpy in-memory  : 13.5s | 13.6 GB
  scanpy + Dask OOC : 35.4s | 14.5 GB
  SingleRust OOC    : 45.0s |  2.5 GB

Fusion cut SingleRust wall time 128s -> 45s (~2.8×): now in the same
ballpark as Dask on time, at ~1/6th the memory. Dask's sparse OOC again
showed no memory benefit (RSS >= in-memory).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
One streaming pass accumulates per-gene sum + sum-of-squares; mean and
Bessel-corrected sample variance are formed identically to the in-memory
var_col, then handed to the shared seurat_select (refactored out of the
in-memory compute_seurat_hvg) for dispersion binning/normalization and
top-N selection. Results written into var in place; memory bounded by one
chunk + two n_vars accumulators.

- memory/processing/hvg: extract pub(crate) seurat_select; module made
  pub(crate) so the backed path can reuse it.
- backed/processing/hvg.rs: highly_variable_genes_backed (+ test asserting
  the OOC mask equals the in-memory HVG mask on the same data).
- sr_ooc CLI `hvg` + sr.pp.highly_variable_genes shim.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Exact PCA over the HVG-selected genes without holding the matrix in
memory:
  pass 1: stream X, accumulate the gene×gene Gram matrix + per-gene sums
          over selected genes (memory O(n_hvg²), e.g. 32 MB for 2000 HVGs)
  -> centered covariance C = (G - n·μμᵀ)/(n-1), symmetric eigendecomp
     (n_hvg × n_hvg) for loadings + variances
  pass 2: stream X again, project each centered cell onto the top axes
          -> obsm["X_pca"], uns["pca_variance_ratio"]; written in place.

Centering is folded into the projection (per-component offset), so pass 2
only touches each cell's selected nonzeros. Exact (not randomized) PCA;
matches a dense covariance-PCA reference (variance ratios exact, embedding
column norms match) up to per-component sign — see test.

- backed/processing/pca.rs: pca_backed (+ dense-reference test).
- sr_ooc CLI `pca` + sr.pp.pca shim.
- hvg: add SeuratResult type alias (clippy type-complexity).

Out-of-core Tier-1 is now complete: qc, normalize/log1p, fused preprocess,
HVG, and PCA all run disk-backed in bounded memory.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
OOC_BENCHMARK.md: summarize that QC, normalize_total, log1p, fused
preprocess, HVG (Seurat) and PCA all run disk-backed via sr_ooc / sr.pp.*,
with each step validated against its in-memory/dense counterpart. Note the
end-to-end PCA-vs-scanpy axis difference traces to SingleRust's HVG gene
selection differing from scanpy's (pre-existing), not the OOC path.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Parallelize the compute-bound OOC reductions with rayon, but guarantee
bit-identical results regardless of thread count via a fixed-block,
ordered-merge reduction (backed::processing::det). Floating-point sums are
non-associative and rayon work-stealing varies the order, so partials are
computed over fixed-index row blocks and merged in block order -> the
summation order depends only on data size + a constant block size, never
on threads or scheduling.

Parallelized:
- PCA pass-1 Gram matrix + per-gene sums (the compute hotspot) via
  det_block_reduce; pass-2 projection is per-cell independent (parallel,
  order-free).
- HVG per-gene sum/sum-of-squares via det_block_reduce.
- QC per-cell metrics + per-gene reduction via det_block_reduce (per-cell
  records kept in row order, per-gene summed in block order).

normalize/log1p stay sequential (trivial elementwise, no reduction) and
are deterministic by construction.

Determinism verified:
- unit tests: qc/hvg/preprocess/pca under 1 vs 8 rayon threads are
  bit-identical (metrics, masks, embeddings, variance ratios), plus a
  repeat-run check.
- at scale: full pipeline on 50k real cells with RAYON_NUM_THREADS=1 vs 18
  yields bit-identical X, obs/var, obsm["X_pca"], and uns variance ratios.

Adds rayon as a direct dependency. 21 lib tests pass; clippy clean.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant