Skip to content

Phase 14: STARsolo single-cell (Gene + GeneFull, EmptyDrops_CR, CellRanger matching)#90

Open
iandriver wants to merge 23 commits into
scverse:mainfrom
iandriver:phase14-starsolo
Open

Phase 14: STARsolo single-cell (Gene + GeneFull, EmptyDrops_CR, CellRanger matching)#90
iandriver wants to merge 23 commits into
scverse:mainfrom
iandriver:phase14-starsolo

Conversation

@iandriver

@iandriver iandriver commented Jun 13, 2026

Copy link
Copy Markdown

What & why

Adds STARsolo single-cell support to rustar-aligner (Phase 14): the 10x
Chromium count-matrix pipeline (Gene + GeneFull), the CellRanger 4.x/5.x
flag set ported faithfully from STAR source and verified byte-identical against
real STARsolo
, an in-Rust EmptyDrops_CR cell caller, and a 3-way benchmark
(CellRanger vs STARsolo vs rustar) that drove a set of accuracy, performance, and
memory improvements.

All counting changes are verified behavior-preserving (count matrices
byte-identical before/after) and the branch is green on every CI platform
(linux x86_64 / aarch64 / v3, macos-aarch64, windows-x86_64, clippy, fmt, MSRV,
security audit).


1. STARsolo MVP + CellRanger matching (Phases 14.1–14.4)

New src/solo/:

  • mod.rs--soloType params + validation, barcode-read input
    (SoloReadReader), SoloContext, per-read processing, CellRanger4 adapter clip.
  • whitelist.rs — 2-bit barcode packing, sorted-whitelist load, read-stage CB
    matching (exact / 1MM / 1MM_multi) + UMI checks (STAR SoloReadBarcode_getCBandUMI.cpp).
  • gene.rs — per-read gene assignment (--soloStrand, Gene/GeneFull).
  • count.rs — UMI dedup (Exact/NoDedup/1MM_All/1MM_Directional/1MM_CR),
    MultiGeneUMI(_CR) filtering, 1MM_multi_Nbase_pseudocounts CB posterior,
    raw matrix.mtx/barcodes.tsv/features.tsv writer.

align_reads_solo in lib.rs reads cDNA + barcode in lockstep, aligns the cDNA,
and quantifies per cell. The CellRanger-matching flags (--clipAdapterType CellRanger4, --outFilterScoreMin, --soloCBmatchWLtype 1MM_multi_Nbase_pseudocounts, --soloUMIfiltering MultiGeneUMI_CR, --soloUMIdedup 1MM_CR) produce a matrix byte-identical to real STARsolo (3/3 deterministic).

2. GeneFull — intron-inclusive counting (--soloFeatures Gene GeneFull)

CellRanger's default include-introns mode. GeneAnnotation gains a per-gene
gene-body interval index (overlapping_genes_full), SoloContext holds a list of
features each with its own recorder, and process_read does the shared CB-match +
UMI check once then assigns a gene per feature — so one pass writes both
Solo.out/Gene/raw/ and Solo.out/GeneFull/raw/. A purely intronic read now counts
under GeneFull but not Gene.

3. EmptyDrops_CR cell caller — in Rust (src/bin/emptydrops.rs)

Standalone binary porting STAR's SoloFeature_emptyDrops_CR.cpp / CellRanger's
EmptyDrops: CR2.2 knee for guaranteed cells → ambient profile (Good-Turing P0) →
per-candidate multinomial log-likelihood → Monte-Carlo p-values → Benjamini-Hochberg.
Reads any raw MatrixMarket matrix (plain or .gz), so it can call cells on
rustar / STARsolo / CellRanger output with one algorithm for an apples-to-apples
filtered comparison. Defaults mirror --soloCellFilter EmptyDrops_CR 3000 0.99 10 45000 90000 500 0.01 20000.


Accuracy results (mouse 5k-PBMC, 5′ GEM-X, 10M-read subsample)

GeneFull raw counts — rustar ≈ STARsolo, and the intron gap to CellRanger closes:

GeneFull raw rustar STARsolo CellRanger
total UMI 4,981,160 4,965,170 4,843,682
genes detected 17,539 17,448 17,258
  • rustar ↔ STARsolo per-cell UMI r = 0.999997, 93.0% exact-total match.
  • rustar ↔ CellRanger r = 0.999792; STARsolo ↔ CellRanger r = 0.999805.
  • GeneFull adds +18% UMI over Gene → within 2.8% of CellRanger (Gene-only was ~14% short).

EmptyDrops cell calling — my Rust port vs CellRanger's native filtered set:

  • rustar-ED ↔ STARsolo-ED: Jaccard 0.9993.
  • Each captures 3,857 / 3,858 of CellRanger's native filtered cells; per-cell UMI r = 0.9995 on shared cells.

Caveat: on this subsample STAR's default umiMin=500 leaves no sub-knee
candidates, so --umi-min 100 was used to exercise the Monte-Carlo rescue
(over-calls ~600 low-count droplets beyond CellRanger's tuned cutoff — a
threshold choice, not a counting discrepancy).


Performance & memory

Index mmap (no OOM). Loading used std::fs::read, pinning the ~21 GB SA +
~1.8 GB SAindex + ~5.6 GB genome as un-reclaimable anonymous Vecs (froze the
host on a real run). PackedArray and Genome.sequence are now Owned(Vec) | Mapped(Arc<Mmap>); load memory-maps the SA, SAindex, and forward genome strand
(reverse-complement computed on access). File-backed pages are reclaimable → a
multi-GB index no longer pins RAM. The run completes under a 24 GB cap
(OOMKilled=false) where the old ~37 GB anonymous footprint was OOM-killed;
output byte-identical.

Matrix output. Buffered the raw-matrix writers (was one write(2) per line,
~3.7M syscalls): raw-matrix write 1306 s → 3 s. Streaming per-cell
build_matrix bounds output memory to a single cell. Both byte-identical.

Two new speedups (commit perf:, matrices verified byte-identical):

  • --outSAMtype None (previously errored "not yet implemented") now skips
    building SAM records entirely for count-only runs via params.emits_alignments().
    ~12% faster natively (clean 3M: 29.5 s vs 33.4 s) and avoids the multi-GB
    Aligned.out.sam — a much larger win in a container where that write crosses virtiofs.
  • Thread-local scratch buffers in gene assignment (ran 2×/read for Gene GeneFull, allocating fresh Vecs each call); GeneAnnotation gains
    overlapping_genes{,_full}_into(tr, &mut buf).

On the "rustar looked ~5× slower" question. That ratio came from running
everything in an amd64 Docker container (CellRanger is x86-only) under Rosetta,
with the index mmap'd over virtiofs. Native arm64 rustar does the same 10M reads
in ~85–97 s (~100k+ reads/s)
— ~10–15× faster than its own emulated run, so the
true native gap vs STAR is far smaller than 5×. (STAR can't run native on Apple
Silicon, so a native-vs-native number isn't available.)


Testing — what's verified and how

  • Unit/integration: 480 lib + 11 integration tests pass, 0 clippy, cargo fmt
    clean
    , MSRV check + cargo audit green. Integration tests build a synthetic 20 kb
    genome with a planted GT-AG intron and exercise BAM/PE/spliced/BySJout/GeneCounts/
    unmapped/two-pass; genefull_counts_intronic_read covers GeneFull semantics.
  • Differential vs real STARsolo: Linux-container harness
    (test/solo_cellranger_diff.py + Dockerfile.solodiff) confirms the CellRanger-flag
    matrix is byte-identical (3/3 deterministic).
  • 3-way benchmark + h5 comparison: test/solo_bench.py (CellRanger/STARsolo/rustar
    runtime + counts), test/solo_genefull_compare.py / solo_genefull_h5_compare.py
    (GeneFull parity + EmptyDrops cell-set overlap), test/solo_compare_h5ad.py
    (knee/EmptyDrops-filtered h5ad + correlations).
  • Behavior preservation: the GeneFull Gene/GeneFull matrices and the
    --outSAMtype None / scratch-buffer changes were re-run on the full 10M set and
    diff-checked byte-identical against the prior output.
  • Cross-platform: fixed a Unix-only madvise(MADV_RANDOM) build break on
    windows-x86_64 (gated behind cfg(unix) with a no-op stub) — all CI platforms green.

Notable files

src/solo/{mod,whitelist,gene,count}.rs, src/bin/emptydrops.rs,
src/quant/mod.rs (gene-body index), src/genome/mod.rs + src/index/{io,packed_array}.rs
(mmap), src/lib.rs (align_reads_solo, --outSAMtype None), test/solo_*.py,
docs-old/phase14_starsolo.md + phase14_benchmark.md.

🤖 Generated with Claude Code

Ian Driver and others added 13 commits June 12, 2026 22:22
Implements STARsolo Phases 14.1–14.4 (the 10x Chromium Gene-count MVP) plus
the CellRanger 4.x/5.x-matching flag set, all ported faithfully from STAR's
source and verified byte-identical against real STARsolo.

New `src/solo/`:
- mod.rs    — SoloType/params plumbing, barcode-read input (SoloReadReader),
              SoloContext, per-read processing, CellRanger4 adapter clip
- whitelist.rs — 2-bit barcode packing + sorted-whitelist load + read-stage
              CB matching (exact/1MM/1MM_multi) + UMI checks (STAR
              SoloReadBarcode_getCBandUMI.cpp)
- gene.rs   — per-read gene assignment (--soloStrand), reuses
              quant::overlapping_genes
- count.rs  — UMI dedup (Exact/NoDedup/1MM_All/1MM_Directional/1MM_CR),
              MultiGeneUMI(_CR) filtering, 1MM_multi_Nbase_pseudocounts CB
              posterior, raw matrix.mtx/barcodes.tsv/features.tsv writer

Driver: new align_reads_solo loop in lib.rs (reads cDNA + barcode in lockstep,
aligns cDNA, quantifies per cell); solo params + validation in params/mod.rs.

CellRanger-matching flags (--clipAdapterType CellRanger4, --outFilterScoreMin,
--soloCBmatchWLtype 1MM_multi_Nbase_pseudocounts, --soloUMIfiltering
MultiGeneUMI_CR, --soloUMIdedup 1MM_CR) produce a matrix byte-identical to real
STARsolo (3/3 deterministic) — verified via a Linux-container differential
harness (test/solo_cellranger_diff.py + Dockerfile.solodiff + solo_diff_docker.sh),
since STAR 2.7.11b reads 0 reads on Apple-Silicon macOS.

Also adds a CellRanger-vs-STARsolo-vs-rustar runtime/stats benchmark scaffold
(test/solo_bench.py + Dockerfile.bench).

Tests: 479 lib + 11 integration (incl. test_starsolo_cellranger_style_matrix),
0 clippy warnings. Docs in docs-old/phase14_starsolo.md + ROADMAP.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Performance + scalability fixes for STARsolo matrix output, validated by a
CellRanger-vs-STARsolo-vs-rustar benchmark on a real 5' mouse 10x dataset.

- Buffered I/O: write_barcodes / write_matrix_mtx / write_features wrote to a
  raw std::fs::File (one syscall per line). For the full 3.69M-barcode
  barcodes.tsv that was ~3.7M syscalls, dominating runtime (esp. over virtiofs).
  Wrap in BufWriter + add CbWhitelist::unpack_barcode_into (no per-line String
  alloc). On the mouse bench, raw-matrix write dropped 1306s -> 3s, output
  byte-identical.
- build_matrix Step 1 (per-cell): sort the flat record list by cell barcode and
  process one cell's contiguous slice at a time, so peak memory is a single
  cell's umi->gene maps rather than a global cell->umi->gene nested map over all
  records (the previous version's tens-of-GB allocator blowup). Mirrors STAR
  SoloFeature_collapseUMIall.cpp. Output identical.
- Benchmark harness test/solo_bench.py (+ Dockerfile.bench): runs all three
  tools under /usr/bin/time -v and reports runtime/peak-RSS/matrix stats;
  results + method in docs-old/phase14_benchmark.md.

479 lib + 11 integration tests, 0 clippy warnings.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
build_matrix previously materialized a global cell -> (gene -> count) map for
the whole raw matrix before writing. Replace it with stream_matrix: process one
cell at a time (records already sorted by cb) and write each cell's entries
straight to a temporary MatrixMarket body, counting nnz on the fly, then emit
matrix.mtx as the header (rows cols nnz) followed by the body (the BySJout
temp-file pattern). The global output map is never built, so matrix-output
memory is bounded by a single cell regardless of how many cells the raw
whitelist matrix spans.

Verified byte-identical to the prior output on the real 10M-read mouse benchmark
(matrix.mtx / barcodes.tsv / features.tsv all identical, 2,650,591 entries).
Note: this bounds matrix-output memory for large raw matrices but does not cut
the current ~37 GB peak, which is dominated by the loaded index (~27 GB SA) plus
alignment buffers — mmap'ing the index is the lever for that.

479 lib + 11 integration tests, 0 clippy warnings.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Loading the suffix array and SAindex with std::fs::read materialized them as
anonymous Vecs (~21 GB + ~1.8 GB for mouse). Anonymous memory is not
reclaimable, so under pressure the OS swaps it — which froze the host (Jetsam)
on a real solo run.

Back PackedArray with PackedBytes { Owned(Vec) | Mapped(Arc<Mmap>) }: build still
uses Owned (write() requires it); load now memory-maps the SA and the SAindex
packed-data region (read normally past the small header) with MADV_RANDOM. The
read path is unchanged (operates on &[u8] either way). Arc<Mmap> keeps the
two-pass GenomeIndex clone cheap.

mmap pages are file-backed: demand-loaded and reclaimable (dropped, never
swapped) under pressure, so a multi-GB SA no longer pins anonymous RAM and the
run degrades to paging instead of OOM-killing.

Validated on the real 10M-read mouse benchmark (GRCm39, 21 GB SA): matrix.mtx /
barcodes.tsv / features.tsv byte-identical to the in-RAM result, and the run
COMPLETES under a 24 GB container cap (peak RSS 23.4 GB, OOMKilled=false) where
the previous ~37 GB anonymous footprint would have been OOM-killed. (Genome's
forward+RC buffer, ~5.6 GB, is still an owned Vec — Phase 2.)

479 lib + 11 integration tests, 0 clippy warnings.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…se 2)

load_genome built a 2*n_genome Vec (~5.6 GB for mouse) holding the forward
strand plus a precomputed reverse complement. Replace Genome.sequence's type
with GenomeSeq { Owned(Vec) | Mapped { fwd: Arc<Mmap>, n_genome } }:

- Build stays Owned (the full forward+RC buffer is needed for SA construction
  and the on-disk write, and only Owned supports slicing/mutation).
- Load memory-maps the on-disk Genome file, which holds ONLY the forward strand
  (n_genome bytes). The reverse-complement half is computed on access in
  GenomeSeq::base(i) = complement(fwd[2n-1-i]) for i >= n. So the ~2.8 GB RC
  buffer is never materialized, and the forward bytes are reclaimable
  file-backed pages instead of an anonymous Vec.

Hot alignment access was already single-byte (seed extension, SAindex lookup),
so it routes through base()/get(); build-time slicing routes through as_slice()
(always Owned). PartialEq/Debug are hand-implemented (memmap2::Mmap has neither).

Combined with the SA + SAindex mmap (commit 4480806), all three large index
components are now reclaimable file-backed memory; the anonymous floor of a solo
run is just the alignment working buffers.

Validated on the real 10M-read mouse benchmark (GRCm39, 2.79 Gb genome):
matrix.mtx / barcodes.tsv / features.tsv byte-identical to the precomputed-RC
result — i.e. RC-on-access is correct for every reverse-strand alignment —
with OOMKilled=false.

479 lib + 11 integration tests, 0 clippy warnings.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
STARsolo's `--soloFeatures GeneFull` (CellRanger's include-introns default):
a read counts toward a gene if it overlaps the gene's full body (exons +
introns), so purely intronic reads are counted — unlike exonic `Gene`.

- GeneAnnotation gains chr_gene_body (one [min exon start, max exon end) span
  per gene) + overlapping_genes_full(); overlapping_genes() refactored to a
  shared interval-overlap helper.
- gene.rs: SoloFeature{Gene,GeneFull}; assign_gene_se() takes the feature and
  picks exon vs gene-body overlap.
- SoloContext now holds a list of features, each with its own SoloRecorder;
  process_read does the shared CB-match + UMI check once and assigns a gene per
  feature. write_gene_matrix writes one Solo.out/<feature>/raw/ per feature, so
  `--soloFeatures Gene GeneFull` produces both matrices in one pass.
- params: validate soloFeatures (only Gene/GeneFull supported).

480 lib + 11 integration tests (incl. genefull_counts_intronic_read), 0 clippy.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Faithful port of STAR's SoloFeature_emptyDrops_CR.cpp (CellRanger's EmptyDrops
variant) as a standalone `emptydrops` binary that post-processes any raw count
matrix (MatrixMarket genes×cells + barcodes/features tsv, plain or .gz):

  1. guaranteed cells from the CellRanger-2.2 knee (nExpectedCells/maxPct/ratio);
  2. ambient RNA profile from barcode ranks [indMin,indMax) with a Good-Turing
     P0 unseen-mass correction (approximating STAR's SGT smoothing);
  3. candidate barcodes (rank >= nSimple, total >= max(umiMin, frac*median));
  4. per-candidate multinomial log-likelihood under the ambient profile;
  5. Monte-Carlo p-values — simN ambient-drawn barcodes, running log-prob per
     count reused across candidates; p = (1+#sim<obs)/(1+simN);
  6. Benjamini-Hochberg FDR; cells with padj <= FDR added to the guaranteed set.

Defaults mirror `--soloCellFilter EmptyDrops_CR 3000 0.99 10 45000 90000 500
0.01 20000` with seed 19760110 and simN 10000. Writes barcodes.tsv + cells.txt
+ emptydrops.json. Lets us call cells on rustar/STARsolo/CellRanger raw matrices
with one algorithm for an apples-to-apples filtered comparison.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- solo_genefull_compare.py: GeneFull raw-count parity (total UMI / genes /
  per-cell correlation) across rustar/STARsolo/CellRanger + EmptyDrops cell-set
  overlap (Jaccard) vs CellRanger's filtered barcodes.
- solo_compare_h5ad.py: optional --{rustar,starsolo,cellranger}-cells to filter
  each raw matrix by an EmptyDrops cells.txt instead of the CR2.2 knee, so the
  written h5ad reflect EmptyDrops cell calls for the CellRanger comparison.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
solo_genefull_h5_compare.py loads matrices one at a time (memory-careful) and
reports the intron effect (Gene vs GeneFull total UMI), raw-count parity vs
CellRanger, EmptyDrops cell-set agreement (same-algorithm + vs CR native
filtered), per-cell UMI correlation on shared cells, and writes the
EmptyDrops-filtered h5ad for both tools.

Result on mouse 5k-PBMC (10M subsample): GeneFull +18% UMI over Gene, within
2.8% of CellRanger; rustar-ED captures 3857/3858 CR native cells; per-cell
UMI r=0.9995; same-algorithm Jaccard 0.9749.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
CellRanger appends '-1' to barcodes; STARsolo/rustar do not. Normalize on load
so the rustar/STARsolo vs CellRanger per-cell correlations match cells instead
of reporting shared=0.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…fers

Two behavior-preserving speedups (count matrices verified byte-identical on the
mouse 5k-PBMC 10M-read set):

1. `--outSAMtype None` (previously errored "not yet implemented") now routes to
   NullWriter, and the SE/PE/solo loops skip building SAM RecordBufs entirely
   via params.emits_alignments(). Quant/solo runs that only need the count
   matrix no longer pay SAM text formatting + the multi-GB Aligned.out.sam
   write. ~12-15% faster natively; far more in a container where the SAM write
   crosses virtiofs.

2. Gene assignment (assign_gene_se) ran once per feature per read — twice/read
   for `Gene GeneFull` — allocating fresh Vecs each call. It now reuses two
   thread-local scratch buffers, and GeneAnnotation gains
   overlapping_genes{,_full}_into(transcript, &mut buf) so the overlap query
   writes into a caller-provided buffer instead of allocating.

480 lib + 11 integration tests pass, 0 clippy.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
`memmap2::Advice` / `Mmap::advise` (madvise) are Unix-only, so the SA + SAindex
mmap load failed to build on windows-x86_64. Wrap the best-effort
`advise(Advice::Random)` in a cfg(unix) `advise_random` helper with a no-op
cfg(not(unix)) stub. No behavior change on Unix.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@iandriver iandriver changed the title Phase 14: STARsolo single-cell support (MVP + CellRanger matching) Phase 14: STARsolo single-cell (Gene + GeneFull, EmptyDrops_CR, CellRanger matching) Jun 16, 2026
Ian Driver and others added 10 commits June 16, 2026 21:06
Each `Solo.out/<feature>/Summary.csv` now reports the sequencing/mapping funnel
and per-cell statistics, matching STARsolo's format:
- Number of Reads, Reads With Valid Barcodes, Sequencing Saturation
- Reads Mapped to Genome (Unique+Multiple / Unique), Reads Mapped to <feature>
- genome -> Exonic / Intronic / Intergenic confident-mapping split (computed from
  the Gene vs GeneFull read counts: exonic=Gene, intronic=GeneFull-Gene,
  intergenic=genome_unique-GeneFull)
- Estimated Number of Cells (CR2.2 knee), reads/UMIs in cells, fraction in cells,
  mean/median reads/UMI/genes per cell, total genes detected

Mechanism: SoloContext gains per-feature `feature_reads` atomics (incremented in
process_read on a unique gene assignment); stream_matrix returns per-cell
(reads, UMIs, genes) + genes-detected; write_gene_matrix takes AlignmentStats and
writes the summary. Matrix output is unchanged (byte-identical verified).

test/solo_summary_compare.py cross-tabulates rustar / STARsolo Summary.csv and
CellRanger metrics_summary.csv. On mouse 5k-PBMC (10M): saturation 14.5/14.5/14.6%,
genome 92.0/92.0/91.9%, median genes/cell 601/601/599, median UMI/cell
915/916/908; rustar exonic/intronic/intergenic 49.5/8.8/16.9% vs CellRanger
53.5/11.4/15.9%.

482 lib tests (+median/knee helpers), 0 clippy.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…sense)

The Summary.csv mapping funnel previously derived exonic/intronic from the
unique-gene-assignment counts (Gene, GeneFull), which dropped multi-gene
(ambiguous) reads into the intergenic residual and folded antisense reads in too.
Replace it with CellRanger's method: classify each uniquely-mapped read by
position (exonic if it overlaps any exon, else intronic if in a gene body, else
intergenic) independent of strand, and report antisense (maps to a gene body on
the opposite strand) as a separate orientation metric.

`classify_read` does this in a single pass, sharing the two overlap queries with
the per-feature gene assignment, so process_read does no more gene-model work
than before (the two assign_gene_se calls are replaced by one classify_read).
Region/antisense are counted over uniquely-mapped reads regardless of barcode
(CellRanger's "confidently mapped" = MAPQ 255); matrix output is unchanged
(byte-identical verified).

Result vs CellRanger (mouse 5k-PBMC, 10M): intronic 10.8% vs 11.4%, intergenic
16.1% vs 15.9%, antisense 5.9% vs 6.0% — all within ~0.5pp (were 8.8/16.9/—).
The residual exonic gap (48.4 vs 53.5%) is the genome-mapping denominator (STAR
unique 75.2% vs CellRanger confident 80.8%) + CellRanger's >=50%-exon /
splice-compatibility rule, not the binning method. Runtime unchanged (~78s/10M),
483 lib tests, 0 clippy.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Phase 14.6 — write Solo.out/<feature>/filtered/ (the called cells), not just
raw/. `--soloCellFilter`:
  - CellRanger2.2 <nExp> <maxPct> <ratio> (default) — CR2.2 knee
  - TopCells <N> — top-N barcodes by UMI
  - None — skip filtered/
  - EmptyDrops_CR — knee-guaranteed cells (Monte-Carlo rescue stays in the
    standalone `emptydrops` binary; logged)

Mechanism: stream_matrix is split into build_matrix_body (per-cell dedup → shared
plain temp body + per-cell CellStat{cb,reads,umis,genes}) and finalize_matrix
(header + verbatim body for raw, or cb-remapped/renumbered body for filtered).
Raw and filtered share the one streaming pass; the filtered matrix re-reads the
temp body filtering by called cb. Raw output byte-identical (verified).

`--soloOutGzip yes` gzips matrix.mtx/barcodes.tsv/features.tsv (raw + filtered)
and appends .gz (CellRanger-style); default `no` keeps STARsolo's plain files so
the byte-for-byte STARsolo comparison still holds. write_file() finishes the gzip
stream explicitly.

Verified on mouse 5k-PBMC (10M): GeneFull filtered 3821 cells vs STARsolo 3820
(Jaccard 0.9997, 0 STARsolo-only); filtered UMI sum == Summary "UMIs in Cells";
all 6 .gz files pass gunzip -t; runtime unchanged (~85s/10M). 484 lib + 12
integration tests, 0 clippy.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
`--soloFeatures SJ` writes Solo.out/SJ/raw/{features.tsv,barcodes.tsv,matrix.mtx}
where rows are the SJ.out.tab junctions (STARsolo symlinks features → SJ.out.tab;
rustar writes the identical 9-column lines in the same chr/start/end-sorted
order). Per uniquely-mapped read the crossed junctions (absolute intron coords
from extract_junction_keys) are recorded with the resolved CB+UMI, then collapsed
per (cell, junction) by --soloUMIdedup; junctions filtered out of SJ.out.tab are
dropped.

Plumbing: SoloContext gains sj_enabled + sj_records; process_read takes the read's
junctions and emits SjCountRecord; the solo loop extracts junctions for unique
spliced reads. SpliceJunctionStats gains sj_feature_order() (row order) and a
shared write_sj_lines() (used by both SJ.out.tab and the SJ features.tsv). The
post-run solo output moved into run_single_pass (write_solo_output) where sj_stats
is live. Respects --soloOutGzip.

Verified on mouse 5k-PBMC (10M): SJ features.tsv byte-identical to SJ.out.tab
(100,970 junctions), matrix 100,970 × 3,686,400, 1.41M entries / 1.87M molecules,
all rows/cols in range; runtime unchanged (~82s). Integration test
test_starsolo_sj_feature plants a spliced read across the GT-AG intron and asserts
the one-junction matrix. 501 tests, 0 clippy.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Reads mapping to multiple genes (gene-ambiguous), previously dropped, are now
distributed across their gene set and written as real-valued
UniqueAndMult-<method>.mtx alongside the unique matrix.mtx in raw/.

- classify_read gains want_multi → returns the sense gene set for ambiguous reads
  (Gene + GeneFull); process_read records a MultiGeneRecord (resolved CB) per
  ambiguous read into recorder.multi_gene.
- build_multi_matrices re-reads the raw matrix body (per-cell unique counts u_g,
  cb-ascending) and merges each cell with its ambiguous molecules (deduped by UMI,
  gene set = union), then per method:
    Uniform    — 1/N to each gene in the set
    PropUnique — proportional to unique counts (uniform fallback if none)
    Rescue     — proportional to unique + a uniform multi prior
    EM         — iterate theta_g = u_g + (multi ∝ theta) to convergence
  Each ambiguous molecule contributes total mass 1.0, so every method conserves
  the grand total (only the spread differs). Cells with only multi reads are
  skipped (no unique row).

Verified on mouse 5k-PBMC (10M, Gene): all four matrices sum to 4,527,229
(= 4,219,608 unique + 307,621 multi molecules); Uniform/Rescue spread wider (2.80M
nnz) than PropUnique/EM (2.77M); values real (e.g. 0.5, 1.33333). Runtime
unchanged (~79s, multi is a second pass over the raw body). Unit test
distribute_multi_methods + integration test test_starsolo_multimappers (overlapping
genes G1/G3). 503 tests, 0 clippy.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
solo_sj_multi_compare.py compares the SJ matrix (junctions coord-matched, since
each tool has its own SJ.out.tab; STARsolo symlinks features→SJ.out.tab) and the
UniqueAndMult-*.mtx matrices (genes/barcodes align by GTF/whitelist order).

Result on mouse 5k-PBMC (10M, CellRanger params): multiMappers correlate r=0.98-0.998
vs STARsolo; SJ junction sets overlap 99.9% with per-junction r=0.93. Totals run
~4% (Gene) / ~10% (SJ) higher — the known "rustar maps slightly more reads" gap,
amplified for splice-spanning reads — not a distribution-logic difference.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Plate-well full-length protocols (Smart-seq2): no cell barcodes or UMIs in the
reads. A --readFilesManifest TSV (read1<TAB>read2<TAB>cellID) lists one library
per cell; reads are aligned, gene-assigned (Gene/exon), and counted per cell with
no UMI dedup (each uniquely-gene-assigned read = a count).

- src/solo/smartseq.rs: manifest parser (single-end MVP; read2 must be '-'),
  SmartSeqCounts (per-cell gene→count), and the genes×cells matrix writer
  (barcodes.tsv = manifest cell IDs; respects --soloOutGzip).
- run_smartseq in lib.rs builds the gene model from --sjdbGTFfile, iterates each
  cell's reads (rayon per batch), and writes Solo.out/Gene/raw/. Dispatched ahead
  of the droplet solo_ctx (which needs a whitelist).
- params: --readFilesManifest; SmartSeq exempt from the --readFilesIn /
  whitelist-correction / two-read-file rules; requires the manifest.

Verified on mouse cDNA reads (2 synthetic cells → per-cell read counts) and an
integration test test_starsolo_smartseq (Exon1 reads → G1 counts 5/3). 504 tests,
0 clippy. PE SmartSeq is a follow-up.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Chemistries whose cell barcode is split into several fixed-position segments
(sci-RNA-seq, SPLiT-seq, …), each with its own whitelist.

- SoloBarcodeLayout is now an enum: Simple (CB_UMI_Simple, unchanged) and Complex,
  which parses --soloCBposition/--soloUMIposition (startAnchor_startDist_endAnchor_
  endDist; read-start anchoring) and assembles the CB by concatenating the segment
  slices from the barcode read.
- CbWhitelist::load_complex builds the combined whitelist as the cartesian product
  of the per-segment whitelists (concatenated, packed). Matching the assembled CB
  against this is equivalent to STARsolo's per-segment matching for both Exact and
  1MM (a 1MM in the concatenation is a 1MM in exactly one segment), so the rest of
  the pipeline (correction, UMI dedup, matrix) is reused unchanged.
- params: --soloCBposition / --soloUMIposition; CB_UMI_Complex requires one
  position + one whitelist per segment. Adapter-anchored / variable positions are
  a follow-up (read-start only for now).

Unit tests (parse_position, complex_layout_assembles_segments) + integration test
test_starsolo_cb_umi_complex (2 segments × 2 whitelists → 4-cell product, CB AAGG
matched, 1 molecule). 507 tests, 0 clippy.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
--soloCellFilter EmptyDrops_CR now writes filtered/ with the full rescue (knee
guaranteed cells + cells whose profile is significantly non-ambient), instead of
just the knee. emptydrops_called re-reads the raw matrix body for the ambient
(rank [indMin,indMax)) and candidate (rank >= nSimple, total >= minUMI) cell
profiles, then runs the same multinomial Monte-Carlo + Benjamini-Hochberg as the
standalone `emptydrops` binary (seed 19760110, Good-Turing P0 ambient). Params:
EmptyDrops_CR nExpected maxPct maxMinRatio indMin indMax umiMin umiMinFracMedian
candMaxN FDR [simN].

Verified on mouse 5k-PBMC (10M, GeneFull, umiMin=100): 3821 knee + 629 rescued =
4450 filtered cells — identical to the standalone binary's result on the same
matrix. With STAR's default umiMin=500 there are no sub-knee candidates so it
reduces to the knee. 507 tests, 0 clippy.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
--readFilesManifest read2 may now be a mate-2 file (not just '-'). For PE cells,
run_smartseq reads mate pairs in lockstep (PairedFastqReader), aligns each with
align_paired_read, and counts the fragment once toward the gene from the union of
both mates' overlaps (both-mapped → both transcripts; half-mapped → the mapped
mate). SE manifests are unchanged (read counts).

Verified end-to-end on real Smart-seq2 mouse data (GEO GSE228456 / SRP429940, 3
sorted single monocytes, 1M reads each): PE fragment counts run slightly below SE
read counts (stricter proper-pairing, one count/fragment) and detect more genes
(read2 covers extra regions) — as expected. Integration test
test_starsolo_smartseq_paired (mate1 Exon1 + mate2 rc(Exon2) → proper FR pair on
G1, 4 fragments). 508 tests, 0 clippy.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@Psy-Fer Psy-Fer self-assigned this Jun 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants