V4 by JesusGF1 · Pull Request #6 · braingeneers/SIMS

JesusGF1 · 2026-04-08T04:22:57Z

No description provided.

Phase 1 (packaging): - Replace setup.py / setup.cfg with PEP 621 pyproject.toml - Trim runtime deps from 106 over-pinned packages down to the 10 the package actually imports (numpy, pandas, scipy, scikit-learn, torch, lightning, torchmetrics, pytorch-tabnet, anndata, tqdm) - Use lower bounds (>=) instead of strict pins so the resolver can pick current wheels on Python 3.10/3.11/3.12/3.13 - Move boto3 to an optional `s3` extra; scanpy to a `test` extra - Bump version 3.0.6 -> 4.0.0 - Fix license metadata: setup.py advertised GPL v2 but LICENSE is MIT - Drop dead Python 3.6/3.7/3.8 classifiers; require Python >= 3.10 - requirements.txt becomes a thin shim that installs the package itself Phase 2 (source upgrades for the modern stack): - Switch all `import pytorch_lightning as pl` to `import lightning.pytorch as pl` (canonical path in lightning >= 2.x) - Replace removed private import torchmetrics.functional.classification. stat_scores._stat_scores_update with the public stat_scores API. The tp/fp/fn it produced were dead values never read by the training loop, so the call is dropped entirely. - Move metric tracking to torchmetrics.MetricCollection registered as submodules, so Lightning auto-moves them to the correct device. Drops the manual `.to(device)` pattern and the module-level `device = torch. device("cuda:0" ...)` constant. - Fix `aggregate_metrics` to handle binary tasks correctly: torchmetrics >= 1.0 rejects `average=` for binary metrics. Multiclass metric names unchanged; binary now exposes a single `accuracy` instead of three invalid micro/macro/weighted entries. - Fix latent mutation bug in `configure_optimizers`: previous code did `self.optim_params.pop(...)` and `self.scheduler_params.pop(...)`, which broke any second call (e.g. re-fitting in the same process). Now copies before popping. - Wrap pytorch-tabnet 4.x's new `group_attention_matrix` parameter: the default `[]` triggers a crash inside EmbeddingGenerator. Build a proper identity grouping via `create_group_matrix` so each gene is its own group when no explicit grouping is supplied. - Add `grouped_features` kwarg to SIMSClassifier.__init__ for users who want explicit feature grouping. Backwards-compatible: default None reproduces the previous one-feature-per-group behaviour. - Tolerate torch >= 2.6's `weights_only=True` default in load_from_ checkpoint by passing `weights_only=False` from the SIMS facade. SIMS checkpoints serialize a sklearn LabelEncoder + numpy arrays via Lightning's save_hyperparameters(); they're trusted artifacts. Verified against the legacy MGE_cortex.ckpt shipped with sims_app. - Fix model_size docstring/error mismatch in SIMS.setup_model: documented values (tall/grande/venti/trenta) didn't match the assertion message. Replaced the assertion with a real ValueError. - Fix np.array -> np.ndarray annotation on SIMSClassifier.predict. - Make UploadCallback import optional in scsims/__init__.py since boto3 is now an optional extra. Phase 5 will move it to scsims.contrib; this is the interim shim. - Add scsims.__version__ = "4.0.0". - Drop dead `_inference_device` attribute (set once, never read). - Add a deprecation shim for `DataModule(device=...)`: kept as an ignored kwarg with DeprecationWarning so code passing it through `SIMS(**kwargs)` doesn't break. Verified end-to-end on Python 3.13 with torch 2.11, lightning 2.6.1, torchmetrics 1.9, numpy 2.4, pandas 2.3, anndata 0.12: - `from scsims import SIMS` works without boto3 installed - SIMS(data=...).train() runs to completion - The resulting checkpoint round-trips through SIMS(weights_path=...) - .predict() returns the canonical pred_0/prob_0 columns - The legacy MGE_cortex.ckpt from sims_app loads and forward-passes Phases 3-8 (API standardization, real pretraining implementation, contrib reorg, tests, sims_app update, docs) are deferred to follow-up commits on this branch. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Phase 3 — API standardization on SIMSClassifier.predict(): - Add `top_k: int = 3` parameter. Was previously hard-coded to `min(3, num_classes)` in two unrelated places (predict() and predict_step()) which had to stay in sync manually. - Cap top_k at the number of training classes so callers can pass any positive integer without crashing. - Wrap the inference loop in `torch.no_grad()` for a meaningful speedup on CPU and to drop the autograd graph from inference. - Refactor the topk math out of `predict_step` (Lightning's hook, fixed signature) into a private `_inference_batch(batch, top_k)` helper that both `predict()` and `predict_step()` call. This removes the duplication and lets `predict()` actually honor `top_k`. - Tighten the `predict()` docstring to document the canonical output column scheme: `pred_0 .. pred_{top_k-1}` (decoded labels) and `prob_0 .. prob_{top_k-1}` (softmax probabilities). - Hoist `np.empty + nan-fill` into `np.full(..., nan)` for clarity. - Properly forward labels per-row instead of per-batch when the inference dataloader yields (features, labels) tuples — the old per-batch indexing dropped any partial last batch. Phase 4 — Real TabNet self-supervised pretraining: This is the new feature, not just dep modernization. The previous `scsims/pretraining.py` was a non-functional stub: `pretrain_model` had body `pass`, `_compute_metrics` referenced an undefined `metric` symbol and a non-existent `self.output_dim`, `validation_step` was missing entirely, no optimizer was configured. The rewrite: - Wraps `pytorch_tabnet.tab_network.TabNetPretraining` in a real `pl.LightningModule` (`SIMSPretrainer`) so users get the full Lightning training surface (DDP, mixed precision, callbacks, loggers) for the unsupervised stage. - Implements TabNet's reconstruction loss from the original paper (random feature obfuscation, then reconstruct the masked positions weighted by per-feature variance). Kept as a public function `unsupervised_reconstruction_loss` so it can be unit-tested. - Plumbs the same `group_attention_matrix` workaround as SIMSClassifier so pytorch-tabnet 4.x doesn't crash on default args. - Mirrors `SIMSClassifier`'s `__init__` signature so a pretrainer and a classifier built from the same hyperparameters share the same architecture, which is what makes encoder transfer trivial. - Adds `transfer_pretrained_weights(classifier, pretrainer)` that replicates pytorch-tabnet's `load_weights_from_unsupervised`: iterate the pretrainer's state_dict, rewrite `encoder.*` keys to `tabnet.encoder.*` and pass `embedder.*` through unchanged, copy any tensor whose target shape matches. Returns the count of transferred tensors so callers can sanity-check the warm-start. Layers that exist only in the supervised model (the classification head) are left at their random initialization. Wired into the SIMS facade in scvi_api.py: - New `SIMS.pretrain(...)` method. Builds its own pretrainer + Trainer (separate from `setup_trainer`/`train`) and runs unsup fitting on the existing DataModule. Defaults to a sensible `pretrain_loss`-monitored ModelCheckpoint in `./sims_pretrain_checkpoints`. - New `SIMS.load_pretrainer(weights_path)` method for the two-process workflow: pretrain in process A, save checkpoint, load it in process B, fine-tune with `train()`. - `SIMS.train()` now detects an attached pretrainer (either from `pretrain()` or `load_pretrainer()`) and warm-starts the classifier from its encoder weights before fitting, with a log line showing how many tensors were transferred. - Refactored `setup_trainer()` to drop duplicated callback-init branches; `callbacks` no longer needs to exist in `kwargs` for the no-callback path to work. - Added `checkpoint_dir` and `monitor` parameters to `setup_trainer` so different runs can use distinct dirs (the pretrain vs fine-tune workflow needs this). `scsims/__init__.py` exports `SIMSPretrainer` and `transfer_pretrained_weights` at the top level. Verified: - Stage 1 (pretrain) → Stage 2 (fine-tune with warm start) → Stage 3 (predict with top_k) end-to-end on synthetic blobs data. 127 tensors transferred, matching the expected encoder + embedder weight count for `n_steps=3, n_d=8, n_a=8`. - Two-process variant: pretrain → save .ckpt → fresh SIMS in a separate process → `load_pretrainer(ckpt)` → `train()` → same 127-tensor warm start. - top_k smoke test: `top_k=1, 3, 5, 99` all produce correctly-shaped output frames (top_k=99 caps at 5 since synthetic data has 5 classes). - Legacy MGE_cortex.ckpt regression: still loads, still forward-passes to `(batch, 15)` logits. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Create scsims/contrib/ as a home for optional integrations that depend on extras (boto3, in this case). Move the S3 ModelCheckpoint upload helper from scsims/networking.py to scsims/contrib/upload.py. Improvements over the original: - Bucket and endpoint_url are now explicit constructor arguments. The old version hard-coded `bucket="braingeneersdev"` and the Nautilus endpoint, which made the callback unusable for anyone outside the braingeneers group. - AWS credentials are read from explicit kwargs first, then env vars, then boto3's default credential chain. The old version did `os.environ["AWS_SECRET_ACCESS_KEY"]` and KeyError'd at __init__ time if the var wasn't set, which made just *importing* the module from scsims/__init__.py crash for any CI environment without AWS creds. - Fixed a latent bug in on_train_end: the previous code did `os.path.join(self.path, self.checkpoint_callback.best_model_path)` but ModelCheckpoint.best_model_path is already an absolute path rooted at self.path, so the join produced `model_checkpoints/model_checkpoints/...` which didn't exist on disk and made the upload silently fail. - Skip upload gracefully if best_model_path is empty (no checkpoint was actually saved), instead of trying to upload an empty path. Backwards compatibility: - scsims/networking.py is now a deprecation shim. `from scsims.networking import UploadCallback` still works for one release; importing it emits a DeprecationWarning. The shim subclasses the new contrib version with the legacy braingeneersdev defaults pre-filled, so existing scripts that constructed `UploadCallback(desc="foo")` keep working unchanged. - The shim will be deleted in scsims 5.0. Top-level scsims namespace: - `UploadCallback` is no longer re-exported from `scsims`. Callers must use `from scsims.contrib import UploadCallback`. This is a real (small) breaking change, but no consumer in the working tree imported `UploadCallback` from the top-level namespace, and the major version bump is the right time to clean it up. - Removed `UploadCallback` from `scsims.__all__`. - Removed the try/except ModuleNotFoundError shim from scsims/__init__.py since UploadCallback is no longer imported there. Verified: - `from scsims.contrib import UploadCallback` works with boto3 installed - `from scsims.networking import UploadCallback` raises a DeprecationWarning and constructs a working callback with the legacy braingeneersdev defaults - Without boto3, importing `scsims` and `scsims.contrib` succeeds (UploadCallback is silently absent from contrib's namespace) - Without boto3, `from scsims.networking import UploadCallback` raises ModuleNotFoundError instead of a confusing AWS env-var KeyError - Full v4 regression suite (legacy ckpt load, train→reload→predict, pretrain→warm-start fine-tune) still passes after the reorg Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The repo shipped four "test" files that hadn't run since at least 2022: - tests/test_data.py and tests/test_model.py both did `sys.path.append(.../src)` and then `from data import *`. There is no src/ directory in the repo (it was renamed to scsims/ four years ago) and `generate_single_dataset` doesn't exist anywhere in the codebase. - tests/test_model.py builds a Trainer with `gpus=` and `auto_lr_find=`, both of which were removed in pytorch-lightning 2.0. - tests/test_helpers.py is empty. - tests/tests.py is unrelated noise from a different file layout. Plus scsims/tests.py held three "smoke" tests that *did* work but had two latent bugs (referenced columns `first_pred`/`first_prob` that scsims has never produced, and wrote to literal cwd-relative checkpoint dirs which made parallel test runs racy). They also weren't discoverable by pytest because they lived inside the package, not the tests/ tree. Replace all of the above with a real pytest suite under tests/: - conftest.py: shared synthetic-AnnData fixtures using scanpy.datasets .blobs (10 genes / 100 cells / 5 classes by default; small enough that the full suite runs in <5s on CPU). Also silences the firehose of FutureWarnings/DeprecationWarnings the upstream libraries emit. - test_smoke.py: ports the three working tests from scsims/tests.py with the column-name and tmp_path fixes, plus two new tests for the v4 `top_k` parameter and the `model.explain()` matrix. - test_pretraining.py: covers the new v4 self-supervised pretraining feature end-to-end. Tests the unsupervised reconstruction loss in isolation, asserts that `transfer_pretrained_weights` actually copies tensors bit-for-bit (not just reports a count), and exercises the two-process pretrain -> save -> load -> fine-tune workflow. - test_legacy_checkpoint.py: opt-in regression test for the v3-era checkpoint format. Skipped by default; set SIMS_LEGACY_CKPT to a path on disk to enable. Verified locally against the MGE_cortex.ckpt shipped with sims_app (loads cleanly under torch>=2.6's weights_only=True default, forward-passes to the right shape). Add .github/workflows/ci.yml: pytest on Python 3.10, 3.11, 3.12 over ubuntu-latest. Uses the `test` extra defined in pyproject.toml so scanpy and pytest get pulled in. Concurrency-cancels old runs when a new commit lands on the same branch. Verified: 10 passed, 1 skipped (legacy ckpt opt-in), 0 failed, 4.6s. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

CHANGELOG.md - New Keep-a-Changelog file documenting every change in v4. - Added: SIMSPretrainer, SIMS.pretrain / load_pretrainer, transfer_pretrained_weights, top_k parameter, grouped_features parameter, scsims.contrib subpackage, pyproject.toml, GitHub Actions CI, real pytest suite. - Changed: dependency stack lower-bounded to current versions, boto3 and scanpy moved to extras, lightning import path, predict() output columns, MetricCollection-based metric tracking, setup_trainer accepts checkpoint_dir/monitor, model_size raises ValueError, UploadCallback moved to scsims.contrib. - Fixed: pkg_resources crash on Streamlit Cloud, torch.load weights_only on legacy checkpoints, pytorch-tabnet 4.x group_attention_matrix default crash, configure_optimizers mutation bug, UploadCallback AWS env-var crash, UploadCallback double-join bug, np.array type annotation, MIT/GPL license metadata mismatch, Python version classifiers, model_size docstring lie. - Removed: setup.py/setup.cfg, dead tests/test_*.py files, scsims/tests.py, top-level UploadCallback re-export, module-level device constants, dead _inference_device attribute. MIGRATION.md - New v3 -> v4 upgrade guide. TL;DR: most users only need to change `predictions["first_pred"]` to `predictions["pred_0"]`. Covers the install change (now `scsims>=4.0`), the predict() column rename, Lightning import path migration, the new setup_trainer kwargs, the model_size error change, the UploadCallback move, the new working pretraining feature, and explicit "no action required" callouts for legacy checkpoint loading. README.md - Modernized installation section: Python 3.10+, current pip install one-liners with explicit notes about the [s3] and [dev] extras. - Fixed `from pytorch_lightning...` imports in the code samples to use `from lightning.pytorch...`. - Updated the predict() example to use the canonical pred_0/prob_0 column names and demonstrate the new top_k parameter. - Updated explain() example to unpack the (matrix, labels) tuple. - Added a new "Self-supervised pretraining (new in v4)" section with a worked two-stage example. - Added a banner pointer to MIGRATION.md for upgraders. - Fixed `num_epochs=` -> `max_epochs=` in the custom-training example (the old kwarg was wrong; PL Trainer has always taken `max_epochs`). SIMS_tutorial.ipynb - Three targeted edits, preserved as a clean 3-line diff: 1. Updated the "Before you begin" cell to say Python 3.10+ instead of "between 3.8 and 3.11" (which became false starting with the v4 stack on Python 3.13). 2. Switched the EarlyStopping/ModelCheckpoint import from `pytorch_lightning.callbacks` to `lightning.pytorch.callbacks`. 3. Same import path fix in the WandbLogger note in cell 28. - The tutorial already used the canonical pred_0/prob_0 column names, so no other changes were needed there. The full smoke test (train -> predict -> explain) demonstrated in the notebook continues to match what scsims actually does. Verified: full pytest suite (10 passed, 1 skipped) still green after all docs work; sims_app streamlit boots clean against scsims v4. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

v4 has a lot of changes; pyproject.toml now reflects that.

JesusGF1 and others added 6 commits April 7, 2026 17:30

Update pyproject.toml author for v4

3e4d432

v4 has a lot of changes; pyproject.toml now reflects that.

JesusGF1 merged commit b7851e6 into main Apr 8, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

V4#6

V4#6
JesusGF1 merged 6 commits into
mainfrom
v4

JesusGF1 commented Apr 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

JesusGF1 commented Apr 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant