Phase 14: STARsolo single-cell (Gene + GeneFull, EmptyDrops_CR, CellRanger matching)#90
Open
iandriver wants to merge 23 commits into
Open
Phase 14: STARsolo single-cell (Gene + GeneFull, EmptyDrops_CR, CellRanger matching)#90iandriver wants to merge 23 commits into
iandriver wants to merge 23 commits into
Conversation
Implements STARsolo Phases 14.1–14.4 (the 10x Chromium Gene-count MVP) plus
the CellRanger 4.x/5.x-matching flag set, all ported faithfully from STAR's
source and verified byte-identical against real STARsolo.
New `src/solo/`:
- mod.rs — SoloType/params plumbing, barcode-read input (SoloReadReader),
SoloContext, per-read processing, CellRanger4 adapter clip
- whitelist.rs — 2-bit barcode packing + sorted-whitelist load + read-stage
CB matching (exact/1MM/1MM_multi) + UMI checks (STAR
SoloReadBarcode_getCBandUMI.cpp)
- gene.rs — per-read gene assignment (--soloStrand), reuses
quant::overlapping_genes
- count.rs — UMI dedup (Exact/NoDedup/1MM_All/1MM_Directional/1MM_CR),
MultiGeneUMI(_CR) filtering, 1MM_multi_Nbase_pseudocounts CB
posterior, raw matrix.mtx/barcodes.tsv/features.tsv writer
Driver: new align_reads_solo loop in lib.rs (reads cDNA + barcode in lockstep,
aligns cDNA, quantifies per cell); solo params + validation in params/mod.rs.
CellRanger-matching flags (--clipAdapterType CellRanger4, --outFilterScoreMin,
--soloCBmatchWLtype 1MM_multi_Nbase_pseudocounts, --soloUMIfiltering
MultiGeneUMI_CR, --soloUMIdedup 1MM_CR) produce a matrix byte-identical to real
STARsolo (3/3 deterministic) — verified via a Linux-container differential
harness (test/solo_cellranger_diff.py + Dockerfile.solodiff + solo_diff_docker.sh),
since STAR 2.7.11b reads 0 reads on Apple-Silicon macOS.
Also adds a CellRanger-vs-STARsolo-vs-rustar runtime/stats benchmark scaffold
(test/solo_bench.py + Dockerfile.bench).
Tests: 479 lib + 11 integration (incl. test_starsolo_cellranger_style_matrix),
0 clippy warnings. Docs in docs-old/phase14_starsolo.md + ROADMAP.md.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Performance + scalability fixes for STARsolo matrix output, validated by a CellRanger-vs-STARsolo-vs-rustar benchmark on a real 5' mouse 10x dataset. - Buffered I/O: write_barcodes / write_matrix_mtx / write_features wrote to a raw std::fs::File (one syscall per line). For the full 3.69M-barcode barcodes.tsv that was ~3.7M syscalls, dominating runtime (esp. over virtiofs). Wrap in BufWriter + add CbWhitelist::unpack_barcode_into (no per-line String alloc). On the mouse bench, raw-matrix write dropped 1306s -> 3s, output byte-identical. - build_matrix Step 1 (per-cell): sort the flat record list by cell barcode and process one cell's contiguous slice at a time, so peak memory is a single cell's umi->gene maps rather than a global cell->umi->gene nested map over all records (the previous version's tens-of-GB allocator blowup). Mirrors STAR SoloFeature_collapseUMIall.cpp. Output identical. - Benchmark harness test/solo_bench.py (+ Dockerfile.bench): runs all three tools under /usr/bin/time -v and reports runtime/peak-RSS/matrix stats; results + method in docs-old/phase14_benchmark.md. 479 lib + 11 integration tests, 0 clippy warnings. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
build_matrix previously materialized a global cell -> (gene -> count) map for the whole raw matrix before writing. Replace it with stream_matrix: process one cell at a time (records already sorted by cb) and write each cell's entries straight to a temporary MatrixMarket body, counting nnz on the fly, then emit matrix.mtx as the header (rows cols nnz) followed by the body (the BySJout temp-file pattern). The global output map is never built, so matrix-output memory is bounded by a single cell regardless of how many cells the raw whitelist matrix spans. Verified byte-identical to the prior output on the real 10M-read mouse benchmark (matrix.mtx / barcodes.tsv / features.tsv all identical, 2,650,591 entries). Note: this bounds matrix-output memory for large raw matrices but does not cut the current ~37 GB peak, which is dominated by the loaded index (~27 GB SA) plus alignment buffers — mmap'ing the index is the lever for that. 479 lib + 11 integration tests, 0 clippy warnings. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Loading the suffix array and SAindex with std::fs::read materialized them as
anonymous Vecs (~21 GB + ~1.8 GB for mouse). Anonymous memory is not
reclaimable, so under pressure the OS swaps it — which froze the host (Jetsam)
on a real solo run.
Back PackedArray with PackedBytes { Owned(Vec) | Mapped(Arc<Mmap>) }: build still
uses Owned (write() requires it); load now memory-maps the SA and the SAindex
packed-data region (read normally past the small header) with MADV_RANDOM. The
read path is unchanged (operates on &[u8] either way). Arc<Mmap> keeps the
two-pass GenomeIndex clone cheap.
mmap pages are file-backed: demand-loaded and reclaimable (dropped, never
swapped) under pressure, so a multi-GB SA no longer pins anonymous RAM and the
run degrades to paging instead of OOM-killing.
Validated on the real 10M-read mouse benchmark (GRCm39, 21 GB SA): matrix.mtx /
barcodes.tsv / features.tsv byte-identical to the in-RAM result, and the run
COMPLETES under a 24 GB container cap (peak RSS 23.4 GB, OOMKilled=false) where
the previous ~37 GB anonymous footprint would have been OOM-killed. (Genome's
forward+RC buffer, ~5.6 GB, is still an owned Vec — Phase 2.)
479 lib + 11 integration tests, 0 clippy warnings.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…se 2)
load_genome built a 2*n_genome Vec (~5.6 GB for mouse) holding the forward
strand plus a precomputed reverse complement. Replace Genome.sequence's type
with GenomeSeq { Owned(Vec) | Mapped { fwd: Arc<Mmap>, n_genome } }:
- Build stays Owned (the full forward+RC buffer is needed for SA construction
and the on-disk write, and only Owned supports slicing/mutation).
- Load memory-maps the on-disk Genome file, which holds ONLY the forward strand
(n_genome bytes). The reverse-complement half is computed on access in
GenomeSeq::base(i) = complement(fwd[2n-1-i]) for i >= n. So the ~2.8 GB RC
buffer is never materialized, and the forward bytes are reclaimable
file-backed pages instead of an anonymous Vec.
Hot alignment access was already single-byte (seed extension, SAindex lookup),
so it routes through base()/get(); build-time slicing routes through as_slice()
(always Owned). PartialEq/Debug are hand-implemented (memmap2::Mmap has neither).
Combined with the SA + SAindex mmap (commit 4480806), all three large index
components are now reclaimable file-backed memory; the anonymous floor of a solo
run is just the alignment working buffers.
Validated on the real 10M-read mouse benchmark (GRCm39, 2.79 Gb genome):
matrix.mtx / barcodes.tsv / features.tsv byte-identical to the precomputed-RC
result — i.e. RC-on-access is correct for every reverse-strand alignment —
with OOMKilled=false.
479 lib + 11 integration tests, 0 clippy warnings.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
STARsolo's `--soloFeatures GeneFull` (CellRanger's include-introns default):
a read counts toward a gene if it overlaps the gene's full body (exons +
introns), so purely intronic reads are counted — unlike exonic `Gene`.
- GeneAnnotation gains chr_gene_body (one [min exon start, max exon end) span
per gene) + overlapping_genes_full(); overlapping_genes() refactored to a
shared interval-overlap helper.
- gene.rs: SoloFeature{Gene,GeneFull}; assign_gene_se() takes the feature and
picks exon vs gene-body overlap.
- SoloContext now holds a list of features, each with its own SoloRecorder;
process_read does the shared CB-match + UMI check once and assigns a gene per
feature. write_gene_matrix writes one Solo.out/<feature>/raw/ per feature, so
`--soloFeatures Gene GeneFull` produces both matrices in one pass.
- params: validate soloFeatures (only Gene/GeneFull supported).
480 lib + 11 integration tests (incl. genefull_counts_intronic_read), 0 clippy.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Faithful port of STAR's SoloFeature_emptyDrops_CR.cpp (CellRanger's EmptyDrops
variant) as a standalone `emptydrops` binary that post-processes any raw count
matrix (MatrixMarket genes×cells + barcodes/features tsv, plain or .gz):
1. guaranteed cells from the CellRanger-2.2 knee (nExpectedCells/maxPct/ratio);
2. ambient RNA profile from barcode ranks [indMin,indMax) with a Good-Turing
P0 unseen-mass correction (approximating STAR's SGT smoothing);
3. candidate barcodes (rank >= nSimple, total >= max(umiMin, frac*median));
4. per-candidate multinomial log-likelihood under the ambient profile;
5. Monte-Carlo p-values — simN ambient-drawn barcodes, running log-prob per
count reused across candidates; p = (1+#sim<obs)/(1+simN);
6. Benjamini-Hochberg FDR; cells with padj <= FDR added to the guaranteed set.
Defaults mirror `--soloCellFilter EmptyDrops_CR 3000 0.99 10 45000 90000 500
0.01 20000` with seed 19760110 and simN 10000. Writes barcodes.tsv + cells.txt
+ emptydrops.json. Lets us call cells on rustar/STARsolo/CellRanger raw matrices
with one algorithm for an apples-to-apples filtered comparison.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- solo_genefull_compare.py: GeneFull raw-count parity (total UMI / genes /
per-cell correlation) across rustar/STARsolo/CellRanger + EmptyDrops cell-set
overlap (Jaccard) vs CellRanger's filtered barcodes.
- solo_compare_h5ad.py: optional --{rustar,starsolo,cellranger}-cells to filter
each raw matrix by an EmptyDrops cells.txt instead of the CR2.2 knee, so the
written h5ad reflect EmptyDrops cell calls for the CellRanger comparison.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
solo_genefull_h5_compare.py loads matrices one at a time (memory-careful) and reports the intron effect (Gene vs GeneFull total UMI), raw-count parity vs CellRanger, EmptyDrops cell-set agreement (same-algorithm + vs CR native filtered), per-cell UMI correlation on shared cells, and writes the EmptyDrops-filtered h5ad for both tools. Result on mouse 5k-PBMC (10M subsample): GeneFull +18% UMI over Gene, within 2.8% of CellRanger; rustar-ED captures 3857/3858 CR native cells; per-cell UMI r=0.9995; same-algorithm Jaccard 0.9749. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
CellRanger appends '-1' to barcodes; STARsolo/rustar do not. Normalize on load so the rustar/STARsolo vs CellRanger per-cell correlations match cells instead of reporting shared=0. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…fers
Two behavior-preserving speedups (count matrices verified byte-identical on the
mouse 5k-PBMC 10M-read set):
1. `--outSAMtype None` (previously errored "not yet implemented") now routes to
NullWriter, and the SE/PE/solo loops skip building SAM RecordBufs entirely
via params.emits_alignments(). Quant/solo runs that only need the count
matrix no longer pay SAM text formatting + the multi-GB Aligned.out.sam
write. ~12-15% faster natively; far more in a container where the SAM write
crosses virtiofs.
2. Gene assignment (assign_gene_se) ran once per feature per read — twice/read
for `Gene GeneFull` — allocating fresh Vecs each call. It now reuses two
thread-local scratch buffers, and GeneAnnotation gains
overlapping_genes{,_full}_into(transcript, &mut buf) so the overlap query
writes into a caller-provided buffer instead of allocating.
480 lib + 11 integration tests pass, 0 clippy.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
`memmap2::Advice` / `Mmap::advise` (madvise) are Unix-only, so the SA + SAindex mmap load failed to build on windows-x86_64. Wrap the best-effort `advise(Advice::Random)` in a cfg(unix) `advise_random` helper with a no-op cfg(not(unix)) stub. No behavior change on Unix. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Each `Solo.out/<feature>/Summary.csv` now reports the sequencing/mapping funnel and per-cell statistics, matching STARsolo's format: - Number of Reads, Reads With Valid Barcodes, Sequencing Saturation - Reads Mapped to Genome (Unique+Multiple / Unique), Reads Mapped to <feature> - genome -> Exonic / Intronic / Intergenic confident-mapping split (computed from the Gene vs GeneFull read counts: exonic=Gene, intronic=GeneFull-Gene, intergenic=genome_unique-GeneFull) - Estimated Number of Cells (CR2.2 knee), reads/UMIs in cells, fraction in cells, mean/median reads/UMI/genes per cell, total genes detected Mechanism: SoloContext gains per-feature `feature_reads` atomics (incremented in process_read on a unique gene assignment); stream_matrix returns per-cell (reads, UMIs, genes) + genes-detected; write_gene_matrix takes AlignmentStats and writes the summary. Matrix output is unchanged (byte-identical verified). test/solo_summary_compare.py cross-tabulates rustar / STARsolo Summary.csv and CellRanger metrics_summary.csv. On mouse 5k-PBMC (10M): saturation 14.5/14.5/14.6%, genome 92.0/92.0/91.9%, median genes/cell 601/601/599, median UMI/cell 915/916/908; rustar exonic/intronic/intergenic 49.5/8.8/16.9% vs CellRanger 53.5/11.4/15.9%. 482 lib tests (+median/knee helpers), 0 clippy. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…sense) The Summary.csv mapping funnel previously derived exonic/intronic from the unique-gene-assignment counts (Gene, GeneFull), which dropped multi-gene (ambiguous) reads into the intergenic residual and folded antisense reads in too. Replace it with CellRanger's method: classify each uniquely-mapped read by position (exonic if it overlaps any exon, else intronic if in a gene body, else intergenic) independent of strand, and report antisense (maps to a gene body on the opposite strand) as a separate orientation metric. `classify_read` does this in a single pass, sharing the two overlap queries with the per-feature gene assignment, so process_read does no more gene-model work than before (the two assign_gene_se calls are replaced by one classify_read). Region/antisense are counted over uniquely-mapped reads regardless of barcode (CellRanger's "confidently mapped" = MAPQ 255); matrix output is unchanged (byte-identical verified). Result vs CellRanger (mouse 5k-PBMC, 10M): intronic 10.8% vs 11.4%, intergenic 16.1% vs 15.9%, antisense 5.9% vs 6.0% — all within ~0.5pp (were 8.8/16.9/—). The residual exonic gap (48.4 vs 53.5%) is the genome-mapping denominator (STAR unique 75.2% vs CellRanger confident 80.8%) + CellRanger's >=50%-exon / splice-compatibility rule, not the binning method. Runtime unchanged (~78s/10M), 483 lib tests, 0 clippy. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Phase 14.6 — write Solo.out/<feature>/filtered/ (the called cells), not just
raw/. `--soloCellFilter`:
- CellRanger2.2 <nExp> <maxPct> <ratio> (default) — CR2.2 knee
- TopCells <N> — top-N barcodes by UMI
- None — skip filtered/
- EmptyDrops_CR — knee-guaranteed cells (Monte-Carlo rescue stays in the
standalone `emptydrops` binary; logged)
Mechanism: stream_matrix is split into build_matrix_body (per-cell dedup → shared
plain temp body + per-cell CellStat{cb,reads,umis,genes}) and finalize_matrix
(header + verbatim body for raw, or cb-remapped/renumbered body for filtered).
Raw and filtered share the one streaming pass; the filtered matrix re-reads the
temp body filtering by called cb. Raw output byte-identical (verified).
`--soloOutGzip yes` gzips matrix.mtx/barcodes.tsv/features.tsv (raw + filtered)
and appends .gz (CellRanger-style); default `no` keeps STARsolo's plain files so
the byte-for-byte STARsolo comparison still holds. write_file() finishes the gzip
stream explicitly.
Verified on mouse 5k-PBMC (10M): GeneFull filtered 3821 cells vs STARsolo 3820
(Jaccard 0.9997, 0 STARsolo-only); filtered UMI sum == Summary "UMIs in Cells";
all 6 .gz files pass gunzip -t; runtime unchanged (~85s/10M). 484 lib + 12
integration tests, 0 clippy.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
`--soloFeatures SJ` writes Solo.out/SJ/raw/{features.tsv,barcodes.tsv,matrix.mtx}
where rows are the SJ.out.tab junctions (STARsolo symlinks features → SJ.out.tab;
rustar writes the identical 9-column lines in the same chr/start/end-sorted
order). Per uniquely-mapped read the crossed junctions (absolute intron coords
from extract_junction_keys) are recorded with the resolved CB+UMI, then collapsed
per (cell, junction) by --soloUMIdedup; junctions filtered out of SJ.out.tab are
dropped.
Plumbing: SoloContext gains sj_enabled + sj_records; process_read takes the read's
junctions and emits SjCountRecord; the solo loop extracts junctions for unique
spliced reads. SpliceJunctionStats gains sj_feature_order() (row order) and a
shared write_sj_lines() (used by both SJ.out.tab and the SJ features.tsv). The
post-run solo output moved into run_single_pass (write_solo_output) where sj_stats
is live. Respects --soloOutGzip.
Verified on mouse 5k-PBMC (10M): SJ features.tsv byte-identical to SJ.out.tab
(100,970 junctions), matrix 100,970 × 3,686,400, 1.41M entries / 1.87M molecules,
all rows/cols in range; runtime unchanged (~82s). Integration test
test_starsolo_sj_feature plants a spliced read across the GT-AG intron and asserts
the one-junction matrix. 501 tests, 0 clippy.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Reads mapping to multiple genes (gene-ambiguous), previously dropped, are now
distributed across their gene set and written as real-valued
UniqueAndMult-<method>.mtx alongside the unique matrix.mtx in raw/.
- classify_read gains want_multi → returns the sense gene set for ambiguous reads
(Gene + GeneFull); process_read records a MultiGeneRecord (resolved CB) per
ambiguous read into recorder.multi_gene.
- build_multi_matrices re-reads the raw matrix body (per-cell unique counts u_g,
cb-ascending) and merges each cell with its ambiguous molecules (deduped by UMI,
gene set = union), then per method:
Uniform — 1/N to each gene in the set
PropUnique — proportional to unique counts (uniform fallback if none)
Rescue — proportional to unique + a uniform multi prior
EM — iterate theta_g = u_g + (multi ∝ theta) to convergence
Each ambiguous molecule contributes total mass 1.0, so every method conserves
the grand total (only the spread differs). Cells with only multi reads are
skipped (no unique row).
Verified on mouse 5k-PBMC (10M, Gene): all four matrices sum to 4,527,229
(= 4,219,608 unique + 307,621 multi molecules); Uniform/Rescue spread wider (2.80M
nnz) than PropUnique/EM (2.77M); values real (e.g. 0.5, 1.33333). Runtime
unchanged (~79s, multi is a second pass over the raw body). Unit test
distribute_multi_methods + integration test test_starsolo_multimappers (overlapping
genes G1/G3). 503 tests, 0 clippy.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
solo_sj_multi_compare.py compares the SJ matrix (junctions coord-matched, since each tool has its own SJ.out.tab; STARsolo symlinks features→SJ.out.tab) and the UniqueAndMult-*.mtx matrices (genes/barcodes align by GTF/whitelist order). Result on mouse 5k-PBMC (10M, CellRanger params): multiMappers correlate r=0.98-0.998 vs STARsolo; SJ junction sets overlap 99.9% with per-junction r=0.93. Totals run ~4% (Gene) / ~10% (SJ) higher — the known "rustar maps slightly more reads" gap, amplified for splice-spanning reads — not a distribution-logic difference. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Plate-well full-length protocols (Smart-seq2): no cell barcodes or UMIs in the reads. A --readFilesManifest TSV (read1<TAB>read2<TAB>cellID) lists one library per cell; reads are aligned, gene-assigned (Gene/exon), and counted per cell with no UMI dedup (each uniquely-gene-assigned read = a count). - src/solo/smartseq.rs: manifest parser (single-end MVP; read2 must be '-'), SmartSeqCounts (per-cell gene→count), and the genes×cells matrix writer (barcodes.tsv = manifest cell IDs; respects --soloOutGzip). - run_smartseq in lib.rs builds the gene model from --sjdbGTFfile, iterates each cell's reads (rayon per batch), and writes Solo.out/Gene/raw/. Dispatched ahead of the droplet solo_ctx (which needs a whitelist). - params: --readFilesManifest; SmartSeq exempt from the --readFilesIn / whitelist-correction / two-read-file rules; requires the manifest. Verified on mouse cDNA reads (2 synthetic cells → per-cell read counts) and an integration test test_starsolo_smartseq (Exon1 reads → G1 counts 5/3). 504 tests, 0 clippy. PE SmartSeq is a follow-up. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Chemistries whose cell barcode is split into several fixed-position segments (sci-RNA-seq, SPLiT-seq, …), each with its own whitelist. - SoloBarcodeLayout is now an enum: Simple (CB_UMI_Simple, unchanged) and Complex, which parses --soloCBposition/--soloUMIposition (startAnchor_startDist_endAnchor_ endDist; read-start anchoring) and assembles the CB by concatenating the segment slices from the barcode read. - CbWhitelist::load_complex builds the combined whitelist as the cartesian product of the per-segment whitelists (concatenated, packed). Matching the assembled CB against this is equivalent to STARsolo's per-segment matching for both Exact and 1MM (a 1MM in the concatenation is a 1MM in exactly one segment), so the rest of the pipeline (correction, UMI dedup, matrix) is reused unchanged. - params: --soloCBposition / --soloUMIposition; CB_UMI_Complex requires one position + one whitelist per segment. Adapter-anchored / variable positions are a follow-up (read-start only for now). Unit tests (parse_position, complex_layout_assembles_segments) + integration test test_starsolo_cb_umi_complex (2 segments × 2 whitelists → 4-cell product, CB AAGG matched, 1 molecule). 507 tests, 0 clippy. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
--soloCellFilter EmptyDrops_CR now writes filtered/ with the full rescue (knee guaranteed cells + cells whose profile is significantly non-ambient), instead of just the knee. emptydrops_called re-reads the raw matrix body for the ambient (rank [indMin,indMax)) and candidate (rank >= nSimple, total >= minUMI) cell profiles, then runs the same multinomial Monte-Carlo + Benjamini-Hochberg as the standalone `emptydrops` binary (seed 19760110, Good-Turing P0 ambient). Params: EmptyDrops_CR nExpected maxPct maxMinRatio indMin indMax umiMin umiMinFracMedian candMaxN FDR [simN]. Verified on mouse 5k-PBMC (10M, GeneFull, umiMin=100): 3821 knee + 629 rescued = 4450 filtered cells — identical to the standalone binary's result on the same matrix. With STAR's default umiMin=500 there are no sub-knee candidates so it reduces to the knee. 507 tests, 0 clippy. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
--readFilesManifest read2 may now be a mate-2 file (not just '-'). For PE cells, run_smartseq reads mate pairs in lockstep (PairedFastqReader), aligns each with align_paired_read, and counts the fragment once toward the gene from the union of both mates' overlaps (both-mapped → both transcripts; half-mapped → the mapped mate). SE manifests are unchanged (read counts). Verified end-to-end on real Smart-seq2 mouse data (GEO GSE228456 / SRP429940, 3 sorted single monocytes, 1M reads each): PE fragment counts run slightly below SE read counts (stricter proper-pairing, one count/fragment) and detect more genes (read2 covers extra regions) — as expected. Integration test test_starsolo_smartseq_paired (mate1 Exon1 + mate2 rc(Exon2) → proper FR pair on G1, 4 fragments). 508 tests, 0 clippy. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What & why
Adds STARsolo single-cell support to rustar-aligner (Phase 14): the 10x
Chromium count-matrix pipeline (
Gene+GeneFull), the CellRanger 4.x/5.xflag set ported faithfully from STAR source and verified byte-identical against
real STARsolo, an in-Rust EmptyDrops_CR cell caller, and a 3-way benchmark
(CellRanger vs STARsolo vs rustar) that drove a set of accuracy, performance, and
memory improvements.
All counting changes are verified behavior-preserving (count matrices
byte-identical before/after) and the branch is green on every CI platform
(linux x86_64 / aarch64 / v3, macos-aarch64, windows-x86_64, clippy, fmt, MSRV,
security audit).
1. STARsolo MVP + CellRanger matching (Phases 14.1–14.4)
New
src/solo/:mod.rs—--soloTypeparams + validation, barcode-read input(
SoloReadReader),SoloContext, per-read processing,CellRanger4adapter clip.whitelist.rs— 2-bit barcode packing, sorted-whitelist load, read-stage CBmatching (exact / 1MM / 1MM_multi) + UMI checks (STAR
SoloReadBarcode_getCBandUMI.cpp).gene.rs— per-read gene assignment (--soloStrand,Gene/GeneFull).count.rs— UMI dedup (Exact/NoDedup/1MM_All/1MM_Directional/1MM_CR),MultiGeneUMI(_CR)filtering,1MM_multi_Nbase_pseudocountsCB posterior,raw
matrix.mtx/barcodes.tsv/features.tsvwriter.align_reads_soloinlib.rsreads cDNA + barcode in lockstep, aligns the cDNA,and quantifies per cell. The CellRanger-matching flags (
--clipAdapterType CellRanger4,--outFilterScoreMin,--soloCBmatchWLtype 1MM_multi_Nbase_pseudocounts,--soloUMIfiltering MultiGeneUMI_CR,--soloUMIdedup 1MM_CR) produce a matrix byte-identical to real STARsolo (3/3 deterministic).2. GeneFull — intron-inclusive counting (
--soloFeatures Gene GeneFull)CellRanger's default include-introns mode.
GeneAnnotationgains a per-genegene-body interval index (
overlapping_genes_full),SoloContextholds a list offeatures each with its own recorder, and
process_readdoes the shared CB-match +UMI check once then assigns a gene per feature — so one pass writes both
Solo.out/Gene/raw/andSolo.out/GeneFull/raw/. A purely intronic read now countsunder
GeneFullbut notGene.3. EmptyDrops_CR cell caller — in Rust (
src/bin/emptydrops.rs)Standalone binary porting STAR's
SoloFeature_emptyDrops_CR.cpp/ CellRanger'sEmptyDrops: CR2.2 knee for guaranteed cells → ambient profile (Good-Turing P0) →
per-candidate multinomial log-likelihood → Monte-Carlo p-values → Benjamini-Hochberg.
Reads any raw MatrixMarket matrix (plain or
.gz), so it can call cells onrustar / STARsolo / CellRanger output with one algorithm for an apples-to-apples
filtered comparison. Defaults mirror
--soloCellFilter EmptyDrops_CR 3000 0.99 10 45000 90000 500 0.01 20000.Accuracy results (mouse 5k-PBMC, 5′ GEM-X, 10M-read subsample)
GeneFull raw counts — rustar ≈ STARsolo, and the intron gap to CellRanger closes:
Gene→ within 2.8% of CellRanger (Gene-only was ~14% short).EmptyDrops cell calling — my Rust port vs CellRanger's native filtered set:
Performance & memory
Index mmap (no OOM). Loading used
std::fs::read, pinning the ~21 GB SA +~1.8 GB SAindex + ~5.6 GB genome as un-reclaimable anonymous
Vecs (froze thehost on a real run).
PackedArrayandGenome.sequenceare nowOwned(Vec) | Mapped(Arc<Mmap>); load memory-maps the SA, SAindex, and forward genome strand(reverse-complement computed on access). File-backed pages are reclaimable → a
multi-GB index no longer pins RAM. The run completes under a 24 GB cap
(
OOMKilled=false) where the old ~37 GB anonymous footprint was OOM-killed;output byte-identical.
Matrix output. Buffered the raw-matrix writers (was one
write(2)per line,~3.7M syscalls): raw-matrix write 1306 s → 3 s. Streaming per-cell
build_matrixbounds output memory to a single cell. Both byte-identical.Two new speedups (commit
perf:, matrices verified byte-identical):--outSAMtype None(previously errored "not yet implemented") now skipsbuilding SAM records entirely for count-only runs via
params.emits_alignments().~12% faster natively (clean 3M: 29.5 s vs 33.4 s) and avoids the multi-GB
Aligned.out.sam— a much larger win in a container where that write crosses virtiofs.Gene GeneFull, allocating fresh Vecs each call);GeneAnnotationgainsoverlapping_genes{,_full}_into(tr, &mut buf).On the "rustar looked ~5× slower" question. That ratio came from running
everything in an amd64 Docker container (CellRanger is x86-only) under Rosetta,
with the index mmap'd over virtiofs. Native arm64 rustar does the same 10M reads
in ~85–97 s (~100k+ reads/s) — ~10–15× faster than its own emulated run, so the
true native gap vs STAR is far smaller than 5×. (STAR can't run native on Apple
Silicon, so a native-vs-native number isn't available.)
Testing — what's verified and how
cargo fmtclean, MSRV check +
cargo auditgreen. Integration tests build a synthetic 20 kbgenome with a planted GT-AG intron and exercise BAM/PE/spliced/BySJout/GeneCounts/
unmapped/two-pass;
genefull_counts_intronic_readcovers GeneFull semantics.(
test/solo_cellranger_diff.py+Dockerfile.solodiff) confirms the CellRanger-flagmatrix is byte-identical (3/3 deterministic).
test/solo_bench.py(CellRanger/STARsolo/rustarruntime + counts),
test/solo_genefull_compare.py/solo_genefull_h5_compare.py(GeneFull parity + EmptyDrops cell-set overlap),
test/solo_compare_h5ad.py(knee/EmptyDrops-filtered h5ad + correlations).
Gene/GeneFullmatrices and the--outSAMtype None/ scratch-buffer changes were re-run on the full 10M set anddiff-checked byte-identical against the prior output.
madvise(MADV_RANDOM)build break onwindows-x86_64(gated behindcfg(unix)with a no-op stub) — all CI platforms green.Notable files
src/solo/{mod,whitelist,gene,count}.rs,src/bin/emptydrops.rs,src/quant/mod.rs(gene-body index),src/genome/mod.rs+src/index/{io,packed_array}.rs(mmap),
src/lib.rs(align_reads_solo,--outSAMtype None),test/solo_*.py,docs-old/phase14_starsolo.md+phase14_benchmark.md.🤖 Generated with Claude Code