Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
9fed182
Phase 14: STARsolo single-cell support (MVP + CellRanger matching)
Jun 13, 2026
5bbdeeb
solo: buffered matrix output + per-cell build_matrix + 3-way benchmark
Jun 15, 2026
2088e16
solo: stream count matrix to temp file (Step 2; bounded output memory)
Jun 15, 2026
7ba3a9d
Merge branch 'main' into phase14-starsolo
iandriver Jun 15, 2026
4480806
index: mmap the SA + SAindex on load (reclaimable, no OOM)
Jun 15, 2026
9ec634e
index: mmap the genome too, compute reverse-complement on access (Pha…
Jun 16, 2026
6757905
solo: GeneFull feature (intron-inclusive) + multi-feature output
Jun 16, 2026
d2da941
solo: EmptyDrops_CR cell caller (standalone Rust binary)
Jun 16, 2026
63dacb3
test: GeneFull + EmptyDrops comparison scripts
Jun 16, 2026
20d839a
test: rustar-vs-CellRanger GeneFull + EmptyDrops h5 comparison
Jun 16, 2026
b9ebc4b
test: strip 10x '-1' suffix in solo_genefull_compare barcode matching
Jun 16, 2026
0c74045
perf: implement --outSAMtype None + reuse gene-assignment scratch buf…
Jun 16, 2026
e7fb2df
fix(windows): gate mmap MADV_RANDOM behind cfg(unix)
Jun 16, 2026
0e16f59
solo: CellRanger-style Summary.csv per feature + cross-tool compare
Jun 17, 2026
d334cae
solo: CellRanger-style region funnel (exonic/intronic/intergenic/anti…
Jun 17, 2026
ca8f579
solo: filtered/ cell-called matrix output + --soloOutGzip
Jun 17, 2026
f16c1dd
solo: SJ (splice-junction) feature matrix
Jun 17, 2026
441d4a3
solo: --soloMultiMappers (Uniform/PropUnique/Rescue/EM)
Jun 17, 2026
5bb7df5
test: rustar-vs-STARsolo SJ + multiMapper diff harness
Jun 18, 2026
19a8355
solo: --soloType SmartSeq (plate-based, manifest, no UMI)
Jun 18, 2026
76fb51a
solo: --soloType CB_UMI_Complex (multi-segment barcodes)
Jun 18, 2026
1cfc5dd
solo: EmptyDrops_CR Monte-Carlo rescue in the filtered/ writer
Jun 18, 2026
2257e29
solo: paired-end SmartSeq (fragment counts)
Jun 18, 2026
b66b377
solo: Velocyto feature (spliced/unspliced/ambiguous) per Sullivan 2025
Jun 19, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 7 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -36,4 +36,10 @@ target
*.out
*.log
*.tab
*.sam
*.sam

# Linux build dir used by the solo Docker benchmark/diff (CARGO_TARGET_DIR)
/target-linux/

# amd64 Linux build dir for the benchmark container
/target-amd64/
39 changes: 35 additions & 4 deletions ROADMAP.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ Phase 1 (CLI) ✅
└→ Phase 17.B (per-mate seeding) [planned]
└→ Phase 17.1 (Log.final.out) ✅
└→ Phase 17.2+ (features + polish)
└→ Phase 14 (STARsolo) [DEFERRED]
└→ Phase 14 (STARsolo) 🚧 14.1 done
```

**Phase ordering rationale**: Threading (Phase 9) done first to establish parallel architecture.
Expand Down Expand Up @@ -55,7 +55,7 @@ Paired-end (Phase 8) builds on threaded infrastructure. GTF/junctions (Phase 7)
| [15](docs-old/phase15_sam_tags.md) | SAM Tags + PE Fix | ✅ | 235 | NH/HI/AS/NM/nM/XS/jM/jI/MD, PE fix |
| [16](docs-old/phase16_algorithm.md) | Algorithm Parity | ✅* | 268 | SE: **8613/8926 (0 STAR-only, 99.815% tie-adj)**, 2.2% splice; PE: **8390/8390 exact**, **99.883% tie-adj PE faithfulness**, 0 MAPQ inflate/deflate, 0 NH diffs (Phase G2) |
| [17](docs-old/phase17_features.md) | Features + Polish | ✅* | 396 | Log.final.out, GeneCounts, TranscriptomeSAM, SJDB insertion, --outSAMattrRGline, --runRNGseed, combined-read PE seeding (Phase E2), scoreSeedBest (17.A), sorted BAM (17.2), outReadsUnmapped (17.4), outStd (17.6), PE chimeric (17.3), WithinBAM (17.11), GTF tag params (17.7), outBAMcompression+limitBAMsortRAM (17.9), chimeric Tier 1b soft-clip re-seed (12.2), chimeric Tier 3 residual re-seed (17.10) |
| 14 | STARsolo | DEFERRED | — | Waiting for accuracy parity |
| [14](docs-old/phase14_starsolo.md) | STARsolo (single-cell) | 🚧 In progress | 475 | **MVP done (14.1–14.4)**: 10x Gene count matrix end-to-end (barcode plumbing, CB correction, gene assignment, UMI dedup, raw matrix.mtx) |

*Partially complete — see linked docs for sub-phase status.

Expand Down Expand Up @@ -308,6 +308,37 @@ See [docs-old/phase17_features.md](docs-old/phase17_features.md) for sub-phase t

---

## Phase 14: STARsolo (Single-Cell) — DEFERRED
## Phase 14: STARsolo (Single-Cell) — IN PROGRESS

Waiting for accuracy parity (position agreement >99%).
**Prerequisite met**: position agreement >99% (SE 99.815% tie-adj, PE 99.883%). Phase unblocked 2026-06-10.

Single-cell quantification layered around the existing aligner: the cDNA read aligns through the normal SE path; a paired **barcode read** (R1 = cell barcode + UMI) is parsed, corrected against a whitelist, assigned to a gene, UMI-deduplicated, and emitted as a sparse per-cell count matrix. Target: faithful port of STARsolo (all features). See [docs-old/phase14_starsolo.md](docs-old/phase14_starsolo.md) for the full design and sub-phase tracking.

| Sub-phase | Description | Status |
|-----------|-------------|--------|
| 14.1 | `--solo*` params + barcode-read input plumbing (`src/solo/`, CB/UMI extraction, SE dispatch) | ✅ Complete |
| 14.2 | Whitelist load + CB correction (`--soloCBmatchWLtype`) + UMI checks | ✅ Complete |
| 14.3 | Per-read gene assignment + CB/UMI threaded into the alignment loop | ✅ Complete |
| 14.4 | UMI dedup + raw `matrix.mtx` (**MVP complete**) | ✅ Complete |
| 14.CR | CellRanger 4/5-matching flags (`1MM_CR`, `MultiGeneUMI_CR`, `1MM_multi_Nbase_pseudocounts`, `CellRanger4` clip) | ✅ Complete |
| 14.5 | `Summary.csv` / `Barcodes.stats` / `Features.stats` | ⬜ Planned |
| 14.6 | Cell filtering (`--soloCellFilter`: CellRanger2.2, EmptyDrops_CR) | ⬜ Planned |
| 14.7 | `CB`/`UB`/`GX`/`GN` SAM tags + `CB_samTagOut` | ⬜ Planned |
| 14.8 | More features: GeneFull, SJ, Velocyto | ⬜ Planned |
| 14.9 | Multi-gene resolution (`--soloMultiMappers`) | ⬜ Planned |
| 14.10 | Other chemistries: CB_UMI_Complex, SmartSeq | ⬜ Planned |
| 14.11 | Differential test harness vs STARsolo + synthetic integration tests | ⬜ Planned |

**Phase 14.1** (2026-06-10): `SoloType` enum + 12 `--solo*` params in `src/params/mod.rs`; new `src/solo/mod.rs` (`SoloBarcodeLayout` geometry, `CellBarcode` CB/UMI extraction, `SoloReadReader` lockstep cDNA+barcode FASTQ reader); solo validation (2 read files, GTF for Gene/GeneFull, CB/UMI length); `run_single_pass` + `run_pass1` dispatch routes solo runs to the SE cDNA path (file 0). 447 lib tests (+6 solo), 0 clippy warnings.

**Phase 14.2** (2026-06-11): new `src/solo/whitelist.rs` — faithful port of STAR's `SoloReadBarcode_getCBandUMI.cpp` read stage. 2-bit barcode packing (`seq[0]` high bits, N-detection: 0/1/>1), sorted-array whitelist load (plain/gz), `match_cb` (exact → single-N → 1MM enumeration) honoring `--soloCBmatchWLtype` (Exact/1MM/1MM_multi/…); multi-match reads record all candidate WL indices + mismatch quality (`CbMatch::Multi`) for the Phase 14.4 posterior; exact-match count table accumulated as the posterior prior; UMI checks (N → reject, homopolymer → reject); `CbMatchStats` with STAR's cbMatch categories. Params: `--soloCBmatchWLtype` validation, `solo_cb_match_type()` / `solo_cb_whitelist_path()` helpers, None-whitelist-requires-Exact rule, CBlen≤32 guard. 460 lib tests (+13 solo), 0 clippy warnings.

**Phase 14.3** (2026-06-11): per-read gene assignment + barcode threading into the alignment loop. New `src/solo/gene.rs` — `SoloStrand` (`--soloStrand`), `assign_gene_se` (union of strand-filtered `overlapping_genes` across all loci → `Gene`/`NoFeature`/`Ambiguous`/`Unmapped`; multi-locus-same-gene stays unique). `src/solo/mod.rs` gains `SoloContext` (whitelist + gene model + stats + recorder, `build()` from params), `SoloRecorder` (thread-safe `SoloCountRecord` / deferred `SoloMultiRecord`), and `process_read` (CB match → UMI check → gene assign → record). New `align_reads_solo` loop in `lib.rs` reads cDNA + barcode in lockstep (`SoloReadReader`), aligns the cDNA, writes SAM/BAM, and collects per-cell records; `run_single_pass`/`run_two_pass` thread `solo_ctx`. 467 lib + 10 integration tests, 0 clippy warnings.

**Phase 14.CR — CellRanger 4.x/5.x matching** (2026-06-12): implemented the STARsolo.md CellRanger-matching flag set faithfully from STAR source. `--soloUMIdedup 1MM_CR` (`umiArrayCorrect_CR`: each UMI corrected to its highest-count 1MM neighbor, non-transitive, count = distinct corrected). `--soloUMIfiltering MultiGeneUMI_CR` (keep the top-read-count gene of a multi-gene UMI) + `MultiGeneUMI`; `build_matrix` restructured to per-cell `umi → gene → readcount`. `--soloCBmatchWLtype 1MM_multi_Nbase_pseudocounts` adds a +1 pseudocount to the CB posterior prior. `--clipAdapterType CellRanger4` (TSO 5' clip + polyA 3' trim, conservative no-op on adapter-free reads). All validated in params. Differential harness `test/solo_cellranger_diff.py` runs the full CellRanger flag set on both rustar-aligner and real STAR and compares decoded `{(barcode, gene_id): count}` matrices; committed cargo test `test_starsolo_cellranger_style_matrix` asserts the matrix (incl. 1MM_CR collapse) always.

**Three-way benchmark** (see [docs-old/phase14_benchmark.md](docs-old/phase14_benchmark.md)): CellRanger 10.0.0 vs STARsolo 2.7.10b vs rustar-aligner on 10M reads of a real 5′ mouse 10x dataset (GRCm39-2024-A), all x86_64 in Docker. rustar produces a correct matrix (4.22M UMIs, exonic Gene, ~4% above STARsolo's 4.07M; CellRanger's 4.84M includes introns). After a buffered-I/O fix (raw-matrix write 1306s → 3s; barcodes.tsv was unbuffered), rustar's count is 670s vs STARsolo 152s / CellRanger 356s; index build 2801s (faster than STAR's 3626s under emulation). Peak RSS 37GB (index-dominated). `build_matrix` Step 1 (per-cell processing) bounds matrix-build memory.

**Live verification — PASS:** rustar-aligner's `Gene/raw` matrix is **byte-identical to real STARsolo's** for the CellRanger-style run, confirmed deterministically (3/3 runs). The reference STAR (2.7.10b) and a Linux build of rustar-aligner run in a consistent Linux container (`test/Dockerfile.solodiff` + `test/solo_diff_docker.sh`, via colima — no Docker Desktop). This was necessary because STAR 2.7.11b reads 0 input reads on Apple-Silicon macOS (a known STAR/macOS bug, `nextChar=-1`). 479 lib + 11 integration tests, 0 clippy warnings.

**Phase 14.4 — MVP COMPLETE** (2026-06-11): UMI deduplication + raw count-matrix output. New `src/solo/count.rs`: `UmiDedup` (`--soloUMIdedup`: Exact / NoDedup / 1MM_All [default, connected-components within Hamming-1] / 1MM_Directional / 1MM_Directional_UMItools, `dirCountAdd` 0/−1); deferred 1MM_multi CB resolution via STAR's count+quality posterior (weight = `exactCount·10^(−q/10)`, prior from `whitelist.exact_count_snapshot()`); `build_matrix` groups reads by (cell,gene), collapses UMIs, and `write_gene_matrix` writes `Solo.out/Gene/raw/{matrix.mtx, barcodes.tsv, features.tsv}` (MatrixMarket `nFeatures nBarcodes nEntries`, entries `gene+1 cell+1 count`, 1-based; CellRanger-v3 3-column features.tsv; whitelist-sorted barcodes.tsv). Wired into `align_reads` post-alignment. `--soloUMIdedup` validation in params. End-to-end test (`test_starsolo_gene_matrix`): 8 reads, one cell, two Hamming-distant UMI clouds → 2 deduped molecules → matrix `1 1 2`. **A working 10x Chromium Gene count matrix.** 475 lib + 10 integration tests, 0 clippy warnings.
88 changes: 88 additions & 0 deletions docs-old/phase14_benchmark.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
[← Back to ROADMAP](../ROADMAP.md) · [Phase 14](phase14_starsolo.md)

# Phase 14 Benchmark: CellRanger vs STARsolo vs rustar-aligner

Runtime + output-stats comparison of the three single-cell quantifiers on a real
10x mouse dataset, run in one consistent Linux/x86_64 environment.

## Setup

- **Reference**: CellRanger mouse `refdata-gex-GRCm39-2024-A` (genome 2.79 Gb, 61
contigs, 33,696 genes). STAR + rustar build their indexes from the refdata
`fasta/genome.fa` + `genes/genes.gtf` (`--sjdbOverhang 89`); CellRanger uses
the refdata directly.
- **Data**: 5k Mouse PBMCs, **5′ GEM-X** (SC5P-R2-v3); first **10,000,000 read
pairs** of the GEX library — identical reads for all three tools.
- **Solo params** (CellRanger-matching, 5′): `--soloType CB_UMI_Simple`,
CB 16 / UMI 12, `--soloStrand Reverse`, whitelist `3M-5pgex-jan-2023`,
`--soloFeatures Gene`, `--soloCBmatchWLtype 1MM_multi_Nbase_pseudocounts`,
`--soloUMIfiltering MultiGeneUMI_CR`, `--soloUMIdedup 1MM_CR`.
- **Environment**: Docker (colima) on Apple-Silicon macOS, **everything x86_64
via Rosetta** (CellRanger is x86_64-only), 14 cores / 40 GB. All absolute
times are inflated ~2–3× by emulation; the *relative* picture holds.
- **Tooling**: CellRanger 10.0.0, STAR 2.7.10b, rustar-aligner (this branch).
Driver: [`test/solo_bench.py`](../test/solo_bench.py) (each step under
`/usr/bin/time -v`), image [`test/Dockerfile.bench`](../test/Dockerfile.bench).

## Results

| Tool | Index build | Count (align+quant) | Peak RSS | Raw barcodes | Genes | Total UMIs |
|------|------------:|--------------------:|---------:|-------------:|------:|-----------:|
| **CellRanger 10.0.0** | (prebuilt) | 356 s | 12.5 GB | 161,465 | 17,258 | 4,843,682 |
| **STARsolo 2.7.10b** | 3,626 s | 152 s | 30 GB | 143,490 | 15,675 | 4,067,946 |
| **rustar-aligner** | 2,801 s | **670 s** | 37 GB | 156,258 | 16,278 | 4,219,582 |

CellRanger reported: 3,858 cells, 599 median genes/cell, 88.5 % valid barcodes,
58.5 % reads mapped to transcriptome.

### Correctness

On identical reads, rustar's raw matrix is in line with the references:
**4,219,582 UMIs** (exonic `Gene`), ~4 % above STARsolo's 4,067,946 (also exonic
`Gene`). CellRanger's 4,843,682 is higher because it counts **intronic** reads by
default (`include-introns`), whereas `--soloFeatures Gene` is exonic-only.
rustar's read-stage barcode match rate was **86 % exact** on this real data.

### The buffered-I/O fix

The first rustar count run took 1,774 s. A breakdown showed the raw-matrix write
dominated:

```
before after
matrix write: 1,306 s → 3 s (~435×; byte-identical output)
align (10M): 402 s → 627 s (unchanged logic; emulation variance)
count total: 1,774 s → 670 s
```

Cause: `write_barcodes` / `write_matrix_mtx` wrote to a raw `std::fs::File`
(unbuffered) — one `write(2)` syscall per line, so `barcodes.tsv` (the full
3,686,400-barcode whitelist) cost ~3.7M syscalls, amplified by Rosetta+virtiofs.
Fix: wrap the writers in `BufWriter` + a no-alloc barcode unpack
(`unpack_barcode_into`). The write dropped to ~3 s.

## Notes & limitations

- **Index build**: rustar (2,801 s) was *faster* than STARsolo (3,626 s) under
emulation; CellRanger ships a prebuilt index (its 356 s "count" includes the
internal STAR alignment + cell calling + full metrics).
- **Memory**: rustar's 37 GB peak is dominated by the **loaded index (~27 GB:
5.4 B-entry SA for the 2.79 Gb genome)** plus the alignment working set — *not*
the matrix build (Step 1 per-cell `build_matrix` already bounds that). Reducing
the peak further is about the SA representation and alignment buffers, not the
matrix.
- **Read count**: 10M (of ~200M total) keeps the run tractable and memory under
the 40 GB cap. Stats scale with depth (CellRanger called 3,858 cells at this
subsample vs the dataset's ~4,725).

## Reproduce

```bash
brew install colima docker && colima start --cpu 14 --memory 40 --vm-type vz --vz-rosetta
# build the amd64 image (colima can't build amd64 directly; run+commit a base):
docker run --platform linux/amd64 --name b rust:1-bookworm \
bash -c "apt-get update -qq && apt-get install -y -qq rna-star python3 procps time"
docker commit b rustar-bench-amd64 && docker rm -f b
# then run test/solo_bench.py inside it with the ref/whitelist/fastqs mounted
# (see test/solo_bench.py header for the full argument list).
```
Loading
Loading