Phase 14: STARsolo single-cell (Gene + GeneFull, EmptyDrops_CR, CellRanger matching) by iandriver · Pull Request #90 · scverse/rustar-aligner

iandriver · 2026-06-13T19:44:21Z

What & why

Adds STARsolo single-cell support to rustar-aligner (Phase 14): the 10x
Chromium count-matrix pipeline (Gene + GeneFull), the CellRanger 4.x/5.x
flag set ported faithfully from STAR source and verified byte-identical against
real STARsolo, an in-Rust EmptyDrops_CR cell caller, and a 3-way benchmark
(CellRanger vs STARsolo vs rustar) that drove a set of accuracy, performance, and
memory improvements.

All counting changes are verified behavior-preserving (count matrices
byte-identical before/after) and the branch is green on every CI platform
(linux x86_64 / aarch64 / v3, macos-aarch64, windows-x86_64, clippy, fmt, MSRV,
security audit).

1. STARsolo MVP + CellRanger matching (Phases 14.1–14.4)

New src/solo/:

mod.rs — --soloType params + validation, barcode-read input
(SoloReadReader), SoloContext, per-read processing, CellRanger4 adapter clip.
whitelist.rs — 2-bit barcode packing, sorted-whitelist load, read-stage CB
matching (exact / 1MM / 1MM_multi) + UMI checks (STAR SoloReadBarcode_getCBandUMI.cpp).
gene.rs — per-read gene assignment (--soloStrand, Gene/GeneFull).
count.rs — UMI dedup (Exact/NoDedup/1MM_All/1MM_Directional/1MM_CR),
MultiGeneUMI(_CR) filtering, 1MM_multi_Nbase_pseudocounts CB posterior,
raw matrix.mtx/barcodes.tsv/features.tsv writer.

align_reads_solo in lib.rs reads cDNA + barcode in lockstep, aligns the cDNA,
and quantifies per cell. The CellRanger-matching flags (--clipAdapterType CellRanger4, --outFilterScoreMin, --soloCBmatchWLtype 1MM_multi_Nbase_pseudocounts, --soloUMIfiltering MultiGeneUMI_CR, --soloUMIdedup 1MM_CR) produce a matrix byte-identical to real STARsolo (3/3 deterministic).

2. GeneFull — intron-inclusive counting (`--soloFeatures Gene GeneFull`)

CellRanger's default include-introns mode. GeneAnnotation gains a per-gene
gene-body interval index (overlapping_genes_full), SoloContext holds a list of
features each with its own recorder, and process_read does the shared CB-match +
UMI check once then assigns a gene per feature — so one pass writes both
Solo.out/Gene/raw/ and Solo.out/GeneFull/raw/. A purely intronic read now counts
under GeneFull but not Gene.

3. EmptyDrops_CR cell caller — in Rust (`src/bin/emptydrops.rs`)

Standalone binary porting STAR's SoloFeature_emptyDrops_CR.cpp / CellRanger's
EmptyDrops: CR2.2 knee for guaranteed cells → ambient profile (Good-Turing P0) →
per-candidate multinomial log-likelihood → Monte-Carlo p-values → Benjamini-Hochberg.
Reads any raw MatrixMarket matrix (plain or .gz), so it can call cells on
rustar / STARsolo / CellRanger output with one algorithm for an apples-to-apples
filtered comparison. Defaults mirror --soloCellFilter EmptyDrops_CR 3000 0.99 10 45000 90000 500 0.01 20000.

Accuracy results (mouse 5k-PBMC, 5′ GEM-X, 10M-read subsample)

GeneFull raw counts — rustar ≈ STARsolo, and the intron gap to CellRanger closes:

GeneFull raw	rustar	STARsolo	CellRanger
total UMI	4,981,160	4,965,170	4,843,682
genes detected	17,539	17,448	17,258

rustar ↔ STARsolo per-cell UMI r = 0.999997, 93.0% exact-total match.
rustar ↔ CellRanger r = 0.999792; STARsolo ↔ CellRanger r = 0.999805.
GeneFull adds +18% UMI over Gene → within 2.8% of CellRanger (Gene-only was ~14% short).

EmptyDrops cell calling — my Rust port vs CellRanger's native filtered set:

rustar-ED ↔ STARsolo-ED: Jaccard 0.9993.
Each captures 3,857 / 3,858 of CellRanger's native filtered cells; per-cell UMI r = 0.9995 on shared cells.

Caveat: on this subsample STAR's default umiMin=500 leaves no sub-knee
candidates, so --umi-min 100 was used to exercise the Monte-Carlo rescue
(over-calls ~600 low-count droplets beyond CellRanger's tuned cutoff — a
threshold choice, not a counting discrepancy).

Performance & memory

Index mmap (no OOM). Loading used std::fs::read, pinning the ~21 GB SA +
~1.8 GB SAindex + ~5.6 GB genome as un-reclaimable anonymous Vecs (froze the
host on a real run). PackedArray and Genome.sequence are now Owned(Vec) | Mapped(Arc<Mmap>); load memory-maps the SA, SAindex, and forward genome strand
(reverse-complement computed on access). File-backed pages are reclaimable → a
multi-GB index no longer pins RAM. The run completes under a 24 GB cap
(OOMKilled=false) where the old ~37 GB anonymous footprint was OOM-killed;
output byte-identical.

Matrix output. Buffered the raw-matrix writers (was one write(2) per line,
~3.7M syscalls): raw-matrix write 1306 s → 3 s. Streaming per-cell
build_matrix bounds output memory to a single cell. Both byte-identical.

Two new speedups (commit perf:, matrices verified byte-identical):

--outSAMtype None (previously errored "not yet implemented") now skips
building SAM records entirely for count-only runs via params.emits_alignments().
~12% faster natively (clean 3M: 29.5 s vs 33.4 s) and avoids the multi-GB
Aligned.out.sam — a much larger win in a container where that write crosses virtiofs.
Thread-local scratch buffers in gene assignment (ran 2×/read for Gene GeneFull, allocating fresh Vecs each call); GeneAnnotation gains
overlapping_genes{,_full}_into(tr, &mut buf).

On the "rustar looked ~5× slower" question. That ratio came from running
everything in an amd64 Docker container (CellRanger is x86-only) under Rosetta,
with the index mmap'd over virtiofs. Native arm64 rustar does the same 10M reads
in ~85–97 s (~100k+ reads/s) — ~10–15× faster than its own emulated run, so the
true native gap vs STAR is far smaller than 5×. (STAR can't run native on Apple
Silicon, so a native-vs-native number isn't available.)

Testing — what's verified and how

Unit/integration: 480 lib + 11 integration tests pass, 0 clippy, cargo fmt
clean, MSRV check + cargo audit green. Integration tests build a synthetic 20 kb
genome with a planted GT-AG intron and exercise BAM/PE/spliced/BySJout/GeneCounts/
unmapped/two-pass; genefull_counts_intronic_read covers GeneFull semantics.
Differential vs real STARsolo: Linux-container harness
(test/solo_cellranger_diff.py + Dockerfile.solodiff) confirms the CellRanger-flag
matrix is byte-identical (3/3 deterministic).
3-way benchmark + h5 comparison: test/solo_bench.py (CellRanger/STARsolo/rustar
runtime + counts), test/solo_genefull_compare.py / solo_genefull_h5_compare.py
(GeneFull parity + EmptyDrops cell-set overlap), test/solo_compare_h5ad.py
(knee/EmptyDrops-filtered h5ad + correlations).
Behavior preservation: the GeneFull Gene/GeneFull matrices and the
--outSAMtype None / scratch-buffer changes were re-run on the full 10M set and
diff-checked byte-identical against the prior output.
Cross-platform: fixed a Unix-only madvise(MADV_RANDOM) build break on
windows-x86_64 (gated behind cfg(unix) with a no-op stub) — all CI platforms green.

Notable files

src/solo/{mod,whitelist,gene,count}.rs, src/bin/emptydrops.rs,
src/quant/mod.rs (gene-body index), src/genome/mod.rs + src/index/{io,packed_array}.rs
(mmap), src/lib.rs (align_reads_solo, --outSAMtype None), test/solo_*.py,
docs-old/phase14_starsolo.md + phase14_benchmark.md.

🤖 Generated with Claude Code

Implements STARsolo Phases 14.1–14.4 (the 10x Chromium Gene-count MVP) plus the CellRanger 4.x/5.x-matching flag set, all ported faithfully from STAR's source and verified byte-identical against real STARsolo. New `src/solo/`: - mod.rs — SoloType/params plumbing, barcode-read input (SoloReadReader), SoloContext, per-read processing, CellRanger4 adapter clip - whitelist.rs — 2-bit barcode packing + sorted-whitelist load + read-stage CB matching (exact/1MM/1MM_multi) + UMI checks (STAR SoloReadBarcode_getCBandUMI.cpp) - gene.rs — per-read gene assignment (--soloStrand), reuses quant::overlapping_genes - count.rs — UMI dedup (Exact/NoDedup/1MM_All/1MM_Directional/1MM_CR), MultiGeneUMI(_CR) filtering, 1MM_multi_Nbase_pseudocounts CB posterior, raw matrix.mtx/barcodes.tsv/features.tsv writer Driver: new align_reads_solo loop in lib.rs (reads cDNA + barcode in lockstep, aligns cDNA, quantifies per cell); solo params + validation in params/mod.rs. CellRanger-matching flags (--clipAdapterType CellRanger4, --outFilterScoreMin, --soloCBmatchWLtype 1MM_multi_Nbase_pseudocounts, --soloUMIfiltering MultiGeneUMI_CR, --soloUMIdedup 1MM_CR) produce a matrix byte-identical to real STARsolo (3/3 deterministic) — verified via a Linux-container differential harness (test/solo_cellranger_diff.py + Dockerfile.solodiff + solo_diff_docker.sh), since STAR 2.7.11b reads 0 reads on Apple-Silicon macOS. Also adds a CellRanger-vs-STARsolo-vs-rustar runtime/stats benchmark scaffold (test/solo_bench.py + Dockerfile.bench). Tests: 479 lib + 11 integration (incl. test_starsolo_cellranger_style_matrix), 0 clippy warnings. Docs in docs-old/phase14_starsolo.md + ROADMAP.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Performance + scalability fixes for STARsolo matrix output, validated by a CellRanger-vs-STARsolo-vs-rustar benchmark on a real 5' mouse 10x dataset. - Buffered I/O: write_barcodes / write_matrix_mtx / write_features wrote to a raw std::fs::File (one syscall per line). For the full 3.69M-barcode barcodes.tsv that was ~3.7M syscalls, dominating runtime (esp. over virtiofs). Wrap in BufWriter + add CbWhitelist::unpack_barcode_into (no per-line String alloc). On the mouse bench, raw-matrix write dropped 1306s -> 3s, output byte-identical. - build_matrix Step 1 (per-cell): sort the flat record list by cell barcode and process one cell's contiguous slice at a time, so peak memory is a single cell's umi->gene maps rather than a global cell->umi->gene nested map over all records (the previous version's tens-of-GB allocator blowup). Mirrors STAR SoloFeature_collapseUMIall.cpp. Output identical. - Benchmark harness test/solo_bench.py (+ Dockerfile.bench): runs all three tools under /usr/bin/time -v and reports runtime/peak-RSS/matrix stats; results + method in docs-old/phase14_benchmark.md. 479 lib + 11 integration tests, 0 clippy warnings. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

build_matrix previously materialized a global cell -> (gene -> count) map for the whole raw matrix before writing. Replace it with stream_matrix: process one cell at a time (records already sorted by cb) and write each cell's entries straight to a temporary MatrixMarket body, counting nnz on the fly, then emit matrix.mtx as the header (rows cols nnz) followed by the body (the BySJout temp-file pattern). The global output map is never built, so matrix-output memory is bounded by a single cell regardless of how many cells the raw whitelist matrix spans. Verified byte-identical to the prior output on the real 10M-read mouse benchmark (matrix.mtx / barcodes.tsv / features.tsv all identical, 2,650,591 entries). Note: this bounds matrix-output memory for large raw matrices but does not cut the current ~37 GB peak, which is dominated by the loaded index (~27 GB SA) plus alignment buffers — mmap'ing the index is the lever for that. 479 lib + 11 integration tests, 0 clippy warnings. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Loading the suffix array and SAindex with std::fs::read materialized them as anonymous Vecs (~21 GB + ~1.8 GB for mouse). Anonymous memory is not reclaimable, so under pressure the OS swaps it — which froze the host (Jetsam) on a real solo run. Back PackedArray with PackedBytes { Owned(Vec) | Mapped(Arc<Mmap>) }: build still uses Owned (write() requires it); load now memory-maps the SA and the SAindex packed-data region (read normally past the small header) with MADV_RANDOM. The read path is unchanged (operates on &[u8] either way). Arc<Mmap> keeps the two-pass GenomeIndex clone cheap. mmap pages are file-backed: demand-loaded and reclaimable (dropped, never swapped) under pressure, so a multi-GB SA no longer pins anonymous RAM and the run degrades to paging instead of OOM-killing. Validated on the real 10M-read mouse benchmark (GRCm39, 21 GB SA): matrix.mtx / barcodes.tsv / features.tsv byte-identical to the in-RAM result, and the run COMPLETES under a 24 GB container cap (peak RSS 23.4 GB, OOMKilled=false) where the previous ~37 GB anonymous footprint would have been OOM-killed. (Genome's forward+RC buffer, ~5.6 GB, is still an owned Vec — Phase 2.) 479 lib + 11 integration tests, 0 clippy warnings. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…se 2) load_genome built a 2*n_genome Vec (~5.6 GB for mouse) holding the forward strand plus a precomputed reverse complement. Replace Genome.sequence's type with GenomeSeq { Owned(Vec) | Mapped { fwd: Arc<Mmap>, n_genome } }: - Build stays Owned (the full forward+RC buffer is needed for SA construction and the on-disk write, and only Owned supports slicing/mutation). - Load memory-maps the on-disk Genome file, which holds ONLY the forward strand (n_genome bytes). The reverse-complement half is computed on access in GenomeSeq::base(i) = complement(fwd[2n-1-i]) for i >= n. So the ~2.8 GB RC buffer is never materialized, and the forward bytes are reclaimable file-backed pages instead of an anonymous Vec. Hot alignment access was already single-byte (seed extension, SAindex lookup), so it routes through base()/get(); build-time slicing routes through as_slice() (always Owned). PartialEq/Debug are hand-implemented (memmap2::Mmap has neither). Combined with the SA + SAindex mmap (commit 4480806), all three large index components are now reclaimable file-backed memory; the anonymous floor of a solo run is just the alignment working buffers. Validated on the real 10M-read mouse benchmark (GRCm39, 2.79 Gb genome): matrix.mtx / barcodes.tsv / features.tsv byte-identical to the precomputed-RC result — i.e. RC-on-access is correct for every reverse-strand alignment — with OOMKilled=false. 479 lib + 11 integration tests, 0 clippy warnings. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

STARsolo's `--soloFeatures GeneFull` (CellRanger's include-introns default): a read counts toward a gene if it overlaps the gene's full body (exons + introns), so purely intronic reads are counted — unlike exonic `Gene`. - GeneAnnotation gains chr_gene_body (one [min exon start, max exon end) span per gene) + overlapping_genes_full(); overlapping_genes() refactored to a shared interval-overlap helper. - gene.rs: SoloFeature{Gene,GeneFull}; assign_gene_se() takes the feature and picks exon vs gene-body overlap. - SoloContext now holds a list of features, each with its own SoloRecorder; process_read does the shared CB-match + UMI check once and assigns a gene per feature. write_gene_matrix writes one Solo.out/<feature>/raw/ per feature, so `--soloFeatures Gene GeneFull` produces both matrices in one pass. - params: validate soloFeatures (only Gene/GeneFull supported). 480 lib + 11 integration tests (incl. genefull_counts_intronic_read), 0 clippy. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Faithful port of STAR's SoloFeature_emptyDrops_CR.cpp (CellRanger's EmptyDrops variant) as a standalone `emptydrops` binary that post-processes any raw count matrix (MatrixMarket genes×cells + barcodes/features tsv, plain or .gz): 1. guaranteed cells from the CellRanger-2.2 knee (nExpectedCells/maxPct/ratio); 2. ambient RNA profile from barcode ranks [indMin,indMax) with a Good-Turing P0 unseen-mass correction (approximating STAR's SGT smoothing); 3. candidate barcodes (rank >= nSimple, total >= max(umiMin, frac*median)); 4. per-candidate multinomial log-likelihood under the ambient profile; 5. Monte-Carlo p-values — simN ambient-drawn barcodes, running log-prob per count reused across candidates; p = (1+#sim<obs)/(1+simN); 6. Benjamini-Hochberg FDR; cells with padj <= FDR added to the guaranteed set. Defaults mirror `--soloCellFilter EmptyDrops_CR 3000 0.99 10 45000 90000 500 0.01 20000` with seed 19760110 and simN 10000. Writes barcodes.tsv + cells.txt + emptydrops.json. Lets us call cells on rustar/STARsolo/CellRanger raw matrices with one algorithm for an apples-to-apples filtered comparison. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

- solo_genefull_compare.py: GeneFull raw-count parity (total UMI / genes / per-cell correlation) across rustar/STARsolo/CellRanger + EmptyDrops cell-set overlap (Jaccard) vs CellRanger's filtered barcodes. - solo_compare_h5ad.py: optional --{rustar,starsolo,cellranger}-cells to filter each raw matrix by an EmptyDrops cells.txt instead of the CR2.2 knee, so the written h5ad reflect EmptyDrops cell calls for the CellRanger comparison. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

solo_genefull_h5_compare.py loads matrices one at a time (memory-careful) and reports the intron effect (Gene vs GeneFull total UMI), raw-count parity vs CellRanger, EmptyDrops cell-set agreement (same-algorithm + vs CR native filtered), per-cell UMI correlation on shared cells, and writes the EmptyDrops-filtered h5ad for both tools. Result on mouse 5k-PBMC (10M subsample): GeneFull +18% UMI over Gene, within 2.8% of CellRanger; rustar-ED captures 3857/3858 CR native cells; per-cell UMI r=0.9995; same-algorithm Jaccard 0.9749. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

CellRanger appends '-1' to barcodes; STARsolo/rustar do not. Normalize on load so the rustar/STARsolo vs CellRanger per-cell correlations match cells instead of reporting shared=0. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…fers Two behavior-preserving speedups (count matrices verified byte-identical on the mouse 5k-PBMC 10M-read set): 1. `--outSAMtype None` (previously errored "not yet implemented") now routes to NullWriter, and the SE/PE/solo loops skip building SAM RecordBufs entirely via params.emits_alignments(). Quant/solo runs that only need the count matrix no longer pay SAM text formatting + the multi-GB Aligned.out.sam write. ~12-15% faster natively; far more in a container where the SAM write crosses virtiofs. 2. Gene assignment (assign_gene_se) ran once per feature per read — twice/read for `Gene GeneFull` — allocating fresh Vecs each call. It now reuses two thread-local scratch buffers, and GeneAnnotation gains overlapping_genes{,_full}_into(transcript, &mut buf) so the overlap query writes into a caller-provided buffer instead of allocating. 480 lib + 11 integration tests pass, 0 clippy. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

`memmap2::Advice` / `Mmap::advise` (madvise) are Unix-only, so the SA + SAindex mmap load failed to build on windows-x86_64. Wrap the best-effort `advise(Advice::Random)` in a cfg(unix) `advise_random` helper with a no-op cfg(not(unix)) stub. No behavior change on Unix. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Each `Solo.out/<feature>/Summary.csv` now reports the sequencing/mapping funnel and per-cell statistics, matching STARsolo's format: - Number of Reads, Reads With Valid Barcodes, Sequencing Saturation - Reads Mapped to Genome (Unique+Multiple / Unique), Reads Mapped to <feature> - genome -> Exonic / Intronic / Intergenic confident-mapping split (computed from the Gene vs GeneFull read counts: exonic=Gene, intronic=GeneFull-Gene, intergenic=genome_unique-GeneFull) - Estimated Number of Cells (CR2.2 knee), reads/UMIs in cells, fraction in cells, mean/median reads/UMI/genes per cell, total genes detected Mechanism: SoloContext gains per-feature `feature_reads` atomics (incremented in process_read on a unique gene assignment); stream_matrix returns per-cell (reads, UMIs, genes) + genes-detected; write_gene_matrix takes AlignmentStats and writes the summary. Matrix output is unchanged (byte-identical verified). test/solo_summary_compare.py cross-tabulates rustar / STARsolo Summary.csv and CellRanger metrics_summary.csv. On mouse 5k-PBMC (10M): saturation 14.5/14.5/14.6%, genome 92.0/92.0/91.9%, median genes/cell 601/601/599, median UMI/cell 915/916/908; rustar exonic/intronic/intergenic 49.5/8.8/16.9% vs CellRanger 53.5/11.4/15.9%. 482 lib tests (+median/knee helpers), 0 clippy. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…sense) The Summary.csv mapping funnel previously derived exonic/intronic from the unique-gene-assignment counts (Gene, GeneFull), which dropped multi-gene (ambiguous) reads into the intergenic residual and folded antisense reads in too. Replace it with CellRanger's method: classify each uniquely-mapped read by position (exonic if it overlaps any exon, else intronic if in a gene body, else intergenic) independent of strand, and report antisense (maps to a gene body on the opposite strand) as a separate orientation metric. `classify_read` does this in a single pass, sharing the two overlap queries with the per-feature gene assignment, so process_read does no more gene-model work than before (the two assign_gene_se calls are replaced by one classify_read). Region/antisense are counted over uniquely-mapped reads regardless of barcode (CellRanger's "confidently mapped" = MAPQ 255); matrix output is unchanged (byte-identical verified). Result vs CellRanger (mouse 5k-PBMC, 10M): intronic 10.8% vs 11.4%, intergenic 16.1% vs 15.9%, antisense 5.9% vs 6.0% — all within ~0.5pp (were 8.8/16.9/—). The residual exonic gap (48.4 vs 53.5%) is the genome-mapping denominator (STAR unique 75.2% vs CellRanger confident 80.8%) + CellRanger's >=50%-exon / splice-compatibility rule, not the binning method. Runtime unchanged (~78s/10M), 483 lib tests, 0 clippy. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Phase 14.6 — write Solo.out/<feature>/filtered/ (the called cells), not just raw/. `--soloCellFilter`: - CellRanger2.2 <nExp> <maxPct> <ratio> (default) — CR2.2 knee - TopCells <N> — top-N barcodes by UMI - None — skip filtered/ - EmptyDrops_CR — knee-guaranteed cells (Monte-Carlo rescue stays in the standalone `emptydrops` binary; logged) Mechanism: stream_matrix is split into build_matrix_body (per-cell dedup → shared plain temp body + per-cell CellStat{cb,reads,umis,genes}) and finalize_matrix (header + verbatim body for raw, or cb-remapped/renumbered body for filtered). Raw and filtered share the one streaming pass; the filtered matrix re-reads the temp body filtering by called cb. Raw output byte-identical (verified). `--soloOutGzip yes` gzips matrix.mtx/barcodes.tsv/features.tsv (raw + filtered) and appends .gz (CellRanger-style); default `no` keeps STARsolo's plain files so the byte-for-byte STARsolo comparison still holds. write_file() finishes the gzip stream explicitly. Verified on mouse 5k-PBMC (10M): GeneFull filtered 3821 cells vs STARsolo 3820 (Jaccard 0.9997, 0 STARsolo-only); filtered UMI sum == Summary "UMIs in Cells"; all 6 .gz files pass gunzip -t; runtime unchanged (~85s/10M). 484 lib + 12 integration tests, 0 clippy. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

`--soloFeatures SJ` writes Solo.out/SJ/raw/{features.tsv,barcodes.tsv,matrix.mtx} where rows are the SJ.out.tab junctions (STARsolo symlinks features → SJ.out.tab; rustar writes the identical 9-column lines in the same chr/start/end-sorted order). Per uniquely-mapped read the crossed junctions (absolute intron coords from extract_junction_keys) are recorded with the resolved CB+UMI, then collapsed per (cell, junction) by --soloUMIdedup; junctions filtered out of SJ.out.tab are dropped. Plumbing: SoloContext gains sj_enabled + sj_records; process_read takes the read's junctions and emits SjCountRecord; the solo loop extracts junctions for unique spliced reads. SpliceJunctionStats gains sj_feature_order() (row order) and a shared write_sj_lines() (used by both SJ.out.tab and the SJ features.tsv). The post-run solo output moved into run_single_pass (write_solo_output) where sj_stats is live. Respects --soloOutGzip. Verified on mouse 5k-PBMC (10M): SJ features.tsv byte-identical to SJ.out.tab (100,970 junctions), matrix 100,970 × 3,686,400, 1.41M entries / 1.87M molecules, all rows/cols in range; runtime unchanged (~82s). Integration test test_starsolo_sj_feature plants a spliced read across the GT-AG intron and asserts the one-junction matrix. 501 tests, 0 clippy. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Reads mapping to multiple genes (gene-ambiguous), previously dropped, are now distributed across their gene set and written as real-valued UniqueAndMult-<method>.mtx alongside the unique matrix.mtx in raw/. - classify_read gains want_multi → returns the sense gene set for ambiguous reads (Gene + GeneFull); process_read records a MultiGeneRecord (resolved CB) per ambiguous read into recorder.multi_gene. - build_multi_matrices re-reads the raw matrix body (per-cell unique counts u_g, cb-ascending) and merges each cell with its ambiguous molecules (deduped by UMI, gene set = union), then per method: Uniform — 1/N to each gene in the set PropUnique — proportional to unique counts (uniform fallback if none) Rescue — proportional to unique + a uniform multi prior EM — iterate theta_g = u_g + (multi ∝ theta) to convergence Each ambiguous molecule contributes total mass 1.0, so every method conserves the grand total (only the spread differs). Cells with only multi reads are skipped (no unique row). Verified on mouse 5k-PBMC (10M, Gene): all four matrices sum to 4,527,229 (= 4,219,608 unique + 307,621 multi molecules); Uniform/Rescue spread wider (2.80M nnz) than PropUnique/EM (2.77M); values real (e.g. 0.5, 1.33333). Runtime unchanged (~79s, multi is a second pass over the raw body). Unit test distribute_multi_methods + integration test test_starsolo_multimappers (overlapping genes G1/G3). 503 tests, 0 clippy. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

solo_sj_multi_compare.py compares the SJ matrix (junctions coord-matched, since each tool has its own SJ.out.tab; STARsolo symlinks features→SJ.out.tab) and the UniqueAndMult-*.mtx matrices (genes/barcodes align by GTF/whitelist order). Result on mouse 5k-PBMC (10M, CellRanger params): multiMappers correlate r=0.98-0.998 vs STARsolo; SJ junction sets overlap 99.9% with per-junction r=0.93. Totals run ~4% (Gene) / ~10% (SJ) higher — the known "rustar maps slightly more reads" gap, amplified for splice-spanning reads — not a distribution-logic difference. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Plate-well full-length protocols (Smart-seq2): no cell barcodes or UMIs in the reads. A --readFilesManifest TSV (read1<TAB>read2<TAB>cellID) lists one library per cell; reads are aligned, gene-assigned (Gene/exon), and counted per cell with no UMI dedup (each uniquely-gene-assigned read = a count). - src/solo/smartseq.rs: manifest parser (single-end MVP; read2 must be '-'), SmartSeqCounts (per-cell gene→count), and the genes×cells matrix writer (barcodes.tsv = manifest cell IDs; respects --soloOutGzip). - run_smartseq in lib.rs builds the gene model from --sjdbGTFfile, iterates each cell's reads (rayon per batch), and writes Solo.out/Gene/raw/. Dispatched ahead of the droplet solo_ctx (which needs a whitelist). - params: --readFilesManifest; SmartSeq exempt from the --readFilesIn / whitelist-correction / two-read-file rules; requires the manifest. Verified on mouse cDNA reads (2 synthetic cells → per-cell read counts) and an integration test test_starsolo_smartseq (Exon1 reads → G1 counts 5/3). 504 tests, 0 clippy. PE SmartSeq is a follow-up. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Chemistries whose cell barcode is split into several fixed-position segments (sci-RNA-seq, SPLiT-seq, …), each with its own whitelist. - SoloBarcodeLayout is now an enum: Simple (CB_UMI_Simple, unchanged) and Complex, which parses --soloCBposition/--soloUMIposition (startAnchor_startDist_endAnchor_ endDist; read-start anchoring) and assembles the CB by concatenating the segment slices from the barcode read. - CbWhitelist::load_complex builds the combined whitelist as the cartesian product of the per-segment whitelists (concatenated, packed). Matching the assembled CB against this is equivalent to STARsolo's per-segment matching for both Exact and 1MM (a 1MM in the concatenation is a 1MM in exactly one segment), so the rest of the pipeline (correction, UMI dedup, matrix) is reused unchanged. - params: --soloCBposition / --soloUMIposition; CB_UMI_Complex requires one position + one whitelist per segment. Adapter-anchored / variable positions are a follow-up (read-start only for now). Unit tests (parse_position, complex_layout_assembles_segments) + integration test test_starsolo_cb_umi_complex (2 segments × 2 whitelists → 4-cell product, CB AAGG matched, 1 molecule). 507 tests, 0 clippy. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

--soloCellFilter EmptyDrops_CR now writes filtered/ with the full rescue (knee guaranteed cells + cells whose profile is significantly non-ambient), instead of just the knee. emptydrops_called re-reads the raw matrix body for the ambient (rank [indMin,indMax)) and candidate (rank >= nSimple, total >= minUMI) cell profiles, then runs the same multinomial Monte-Carlo + Benjamini-Hochberg as the standalone `emptydrops` binary (seed 19760110, Good-Turing P0 ambient). Params: EmptyDrops_CR nExpected maxPct maxMinRatio indMin indMax umiMin umiMinFracMedian candMaxN FDR [simN]. Verified on mouse 5k-PBMC (10M, GeneFull, umiMin=100): 3821 knee + 629 rescued = 4450 filtered cells — identical to the standalone binary's result on the same matrix. With STAR's default umiMin=500 there are no sub-knee candidates so it reduces to the knee. 507 tests, 0 clippy. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

--readFilesManifest read2 may now be a mate-2 file (not just '-'). For PE cells, run_smartseq reads mate pairs in lockstep (PairedFastqReader), aligns each with align_paired_read, and counts the fragment once toward the gene from the union of both mates' overlaps (both-mapped → both transcripts; half-mapped → the mapped mate). SE manifests are unchanged (read counts). Verified end-to-end on real Smart-seq2 mouse data (GEO GSE228456 / SRP429940, 3 sorted single monocytes, 1M reads each): PE fragment counts run slightly below SE read counts (stricter proper-pairing, one count/fragment) and detect more genes (read2 covers extra regions) — as expected. Integration test test_starsolo_smartseq_paired (mate1 Exon1 + mate2 rc(Exon2) → proper FR pair on G1, 4 fragments). 508 tests, 0 clippy. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Ian Driver and others added 13 commits June 12, 2026 22:22

Merge branch 'main' into phase14-starsolo

7ba3a9d

iandriver changed the title ~~Phase 14: STARsolo single-cell support (MVP + CellRanger matching)~~ Phase 14: STARsolo single-cell (Gene + GeneFull, EmptyDrops_CR, CellRanger matching) Jun 16, 2026

Ian Driver and others added 10 commits June 16, 2026 21:06

Psy-Fer self-assigned this Jun 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phase 14: STARsolo single-cell (Gene + GeneFull, EmptyDrops_CR, CellRanger matching)#90

Phase 14: STARsolo single-cell (Gene + GeneFull, EmptyDrops_CR, CellRanger matching)#90
iandriver wants to merge 23 commits into
scverse:mainfrom
iandriver:phase14-starsolo

iandriver commented Jun 13, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

iandriver commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What & why

1. STARsolo MVP + CellRanger matching (Phases 14.1–14.4)

2. GeneFull — intron-inclusive counting (--soloFeatures Gene GeneFull)

3. EmptyDrops_CR cell caller — in Rust (src/bin/emptydrops.rs)

Accuracy results (mouse 5k-PBMC, 5′ GEM-X, 10M-read subsample)

Performance & memory

Testing — what's verified and how

Notable files

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

iandriver commented Jun 13, 2026 •

edited

Loading

2. GeneFull — intron-inclusive counting (`--soloFeatures Gene GeneFull`)

3. EmptyDrops_CR cell caller — in Rust (`src/bin/emptydrops.rs`)