Skip to content

V4#6

Merged
JesusGF1 merged 6 commits into
mainfrom
v4
Apr 8, 2026
Merged

V4#6
JesusGF1 merged 6 commits into
mainfrom
v4

Conversation

@JesusGF1

@JesusGF1 JesusGF1 commented Apr 8, 2026

Copy link
Copy Markdown
Contributor

No description provided.

JesusGF1 and others added 6 commits April 7, 2026 17:30
Phase 1 (packaging):
- Replace setup.py / setup.cfg with PEP 621 pyproject.toml
- Trim runtime deps from 106 over-pinned packages down to the 10 the
  package actually imports (numpy, pandas, scipy, scikit-learn, torch,
  lightning, torchmetrics, pytorch-tabnet, anndata, tqdm)
- Use lower bounds (>=) instead of strict pins so the resolver can pick
  current wheels on Python 3.10/3.11/3.12/3.13
- Move boto3 to an optional `s3` extra; scanpy to a `test` extra
- Bump version 3.0.6 -> 4.0.0
- Fix license metadata: setup.py advertised GPL v2 but LICENSE is MIT
- Drop dead Python 3.6/3.7/3.8 classifiers; require Python >= 3.10
- requirements.txt becomes a thin shim that installs the package itself

Phase 2 (source upgrades for the modern stack):
- Switch all `import pytorch_lightning as pl` to `import lightning.pytorch
  as pl` (canonical path in lightning >= 2.x)
- Replace removed private import torchmetrics.functional.classification.
  stat_scores._stat_scores_update with the public stat_scores API.
  The tp/fp/fn it produced were dead values never read by the training
  loop, so the call is dropped entirely.
- Move metric tracking to torchmetrics.MetricCollection registered as
  submodules, so Lightning auto-moves them to the correct device. Drops
  the manual `.to(device)` pattern and the module-level `device = torch.
  device("cuda:0" ...)` constant.
- Fix `aggregate_metrics` to handle binary tasks correctly: torchmetrics
  >= 1.0 rejects `average=` for binary metrics. Multiclass metric names
  unchanged; binary now exposes a single `accuracy` instead of three
  invalid micro/macro/weighted entries.
- Fix latent mutation bug in `configure_optimizers`: previous code did
  `self.optim_params.pop(...)` and `self.scheduler_params.pop(...)`,
  which broke any second call (e.g. re-fitting in the same process).
  Now copies before popping.
- Wrap pytorch-tabnet 4.x's new `group_attention_matrix` parameter:
  the default `[]` triggers a crash inside EmbeddingGenerator. Build a
  proper identity grouping via `create_group_matrix` so each gene is
  its own group when no explicit grouping is supplied.
- Add `grouped_features` kwarg to SIMSClassifier.__init__ for users
  who want explicit feature grouping. Backwards-compatible: default
  None reproduces the previous one-feature-per-group behaviour.
- Tolerate torch >= 2.6's `weights_only=True` default in load_from_
  checkpoint by passing `weights_only=False` from the SIMS facade.
  SIMS checkpoints serialize a sklearn LabelEncoder + numpy arrays
  via Lightning's save_hyperparameters(); they're trusted artifacts.
  Verified against the legacy MGE_cortex.ckpt shipped with sims_app.
- Fix model_size docstring/error mismatch in SIMS.setup_model:
  documented values (tall/grande/venti/trenta) didn't match the
  assertion message. Replaced the assertion with a real ValueError.
- Fix np.array -> np.ndarray annotation on SIMSClassifier.predict.
- Make UploadCallback import optional in scsims/__init__.py since
  boto3 is now an optional extra. Phase 5 will move it to
  scsims.contrib; this is the interim shim.
- Add scsims.__version__ = "4.0.0".
- Drop dead `_inference_device` attribute (set once, never read).
- Add a deprecation shim for `DataModule(device=...)`: kept as an
  ignored kwarg with DeprecationWarning so code passing it through
  `SIMS(**kwargs)` doesn't break.

Verified end-to-end on Python 3.13 with torch 2.11, lightning 2.6.1,
torchmetrics 1.9, numpy 2.4, pandas 2.3, anndata 0.12:
  - `from scsims import SIMS` works without boto3 installed
  - SIMS(data=...).train() runs to completion
  - The resulting checkpoint round-trips through SIMS(weights_path=...)
  - .predict() returns the canonical pred_0/prob_0 columns
  - The legacy MGE_cortex.ckpt from sims_app loads and forward-passes

Phases 3-8 (API standardization, real pretraining implementation,
contrib reorg, tests, sims_app update, docs) are deferred to follow-up
commits on this branch.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Phase 3 — API standardization on SIMSClassifier.predict():

- Add `top_k: int = 3` parameter. Was previously hard-coded to
  `min(3, num_classes)` in two unrelated places (predict() and
  predict_step()) which had to stay in sync manually.
- Cap top_k at the number of training classes so callers can pass
  any positive integer without crashing.
- Wrap the inference loop in `torch.no_grad()` for a meaningful
  speedup on CPU and to drop the autograd graph from inference.
- Refactor the topk math out of `predict_step` (Lightning's hook,
  fixed signature) into a private `_inference_batch(batch, top_k)`
  helper that both `predict()` and `predict_step()` call. This
  removes the duplication and lets `predict()` actually honor
  `top_k`.
- Tighten the `predict()` docstring to document the canonical
  output column scheme: `pred_0 .. pred_{top_k-1}` (decoded labels)
  and `prob_0 .. prob_{top_k-1}` (softmax probabilities).
- Hoist `np.empty + nan-fill` into `np.full(..., nan)` for clarity.
- Properly forward labels per-row instead of per-batch when the
  inference dataloader yields (features, labels) tuples — the old
  per-batch indexing dropped any partial last batch.

Phase 4 — Real TabNet self-supervised pretraining:

This is the new feature, not just dep modernization. The previous
`scsims/pretraining.py` was a non-functional stub: `pretrain_model`
had body `pass`, `_compute_metrics` referenced an undefined `metric`
symbol and a non-existent `self.output_dim`, `validation_step` was
missing entirely, no optimizer was configured.

The rewrite:

- Wraps `pytorch_tabnet.tab_network.TabNetPretraining` in a real
  `pl.LightningModule` (`SIMSPretrainer`) so users get the full
  Lightning training surface (DDP, mixed precision, callbacks,
  loggers) for the unsupervised stage.
- Implements TabNet's reconstruction loss from the original paper
  (random feature obfuscation, then reconstruct the masked positions
  weighted by per-feature variance). Kept as a public function
  `unsupervised_reconstruction_loss` so it can be unit-tested.
- Plumbs the same `group_attention_matrix` workaround as
  SIMSClassifier so pytorch-tabnet 4.x doesn't crash on default args.
- Mirrors `SIMSClassifier`'s `__init__` signature so a pretrainer
  and a classifier built from the same hyperparameters share the
  same architecture, which is what makes encoder transfer trivial.
- Adds `transfer_pretrained_weights(classifier, pretrainer)` that
  replicates pytorch-tabnet's `load_weights_from_unsupervised`:
  iterate the pretrainer's state_dict, rewrite `encoder.*` keys to
  `tabnet.encoder.*` and pass `embedder.*` through unchanged, copy
  any tensor whose target shape matches. Returns the count of
  transferred tensors so callers can sanity-check the warm-start.
  Layers that exist only in the supervised model (the classification
  head) are left at their random initialization.

Wired into the SIMS facade in scvi_api.py:

- New `SIMS.pretrain(...)` method. Builds its own pretrainer +
  Trainer (separate from `setup_trainer`/`train`) and runs unsup
  fitting on the existing DataModule. Defaults to a sensible
  `pretrain_loss`-monitored ModelCheckpoint in
  `./sims_pretrain_checkpoints`.
- New `SIMS.load_pretrainer(weights_path)` method for the
  two-process workflow: pretrain in process A, save checkpoint,
  load it in process B, fine-tune with `train()`.
- `SIMS.train()` now detects an attached pretrainer (either from
  `pretrain()` or `load_pretrainer()`) and warm-starts the
  classifier from its encoder weights before fitting, with a log
  line showing how many tensors were transferred.
- Refactored `setup_trainer()` to drop duplicated callback-init
  branches; `callbacks` no longer needs to exist in `kwargs` for
  the no-callback path to work.
- Added `checkpoint_dir` and `monitor` parameters to
  `setup_trainer` so different runs can use distinct dirs (the
  pretrain vs fine-tune workflow needs this).

`scsims/__init__.py` exports `SIMSPretrainer` and
`transfer_pretrained_weights` at the top level.

Verified:

- Stage 1 (pretrain) → Stage 2 (fine-tune with warm start) → Stage 3
  (predict with top_k) end-to-end on synthetic blobs data. 127
  tensors transferred, matching the expected encoder + embedder
  weight count for `n_steps=3, n_d=8, n_a=8`.
- Two-process variant: pretrain → save .ckpt → fresh SIMS in a
  separate process → `load_pretrainer(ckpt)` → `train()` → same
  127-tensor warm start.
- top_k smoke test: `top_k=1, 3, 5, 99` all produce correctly-shaped
  output frames (top_k=99 caps at 5 since synthetic data has 5
  classes).
- Legacy MGE_cortex.ckpt regression: still loads, still
  forward-passes to `(batch, 15)` logits.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Create scsims/contrib/ as a home for optional integrations that depend
on extras (boto3, in this case). Move the S3 ModelCheckpoint upload
helper from scsims/networking.py to scsims/contrib/upload.py.

Improvements over the original:

- Bucket and endpoint_url are now explicit constructor arguments. The
  old version hard-coded `bucket="braingeneersdev"` and the Nautilus
  endpoint, which made the callback unusable for anyone outside the
  braingeneers group.
- AWS credentials are read from explicit kwargs first, then env vars,
  then boto3's default credential chain. The old version did
  `os.environ["AWS_SECRET_ACCESS_KEY"]` and KeyError'd at __init__ time
  if the var wasn't set, which made just *importing* the module from
  scsims/__init__.py crash for any CI environment without AWS creds.
- Fixed a latent bug in on_train_end: the previous code did
  `os.path.join(self.path, self.checkpoint_callback.best_model_path)`
  but ModelCheckpoint.best_model_path is already an absolute path
  rooted at self.path, so the join produced
  `model_checkpoints/model_checkpoints/...` which didn't exist on disk
  and made the upload silently fail.
- Skip upload gracefully if best_model_path is empty (no checkpoint
  was actually saved), instead of trying to upload an empty path.

Backwards compatibility:

- scsims/networking.py is now a deprecation shim. `from scsims.networking
  import UploadCallback` still works for one release; importing it emits
  a DeprecationWarning. The shim subclasses the new contrib version
  with the legacy braingeneersdev defaults pre-filled, so existing
  scripts that constructed `UploadCallback(desc="foo")` keep working
  unchanged.
- The shim will be deleted in scsims 5.0.

Top-level scsims namespace:

- `UploadCallback` is no longer re-exported from `scsims`. Callers
  must use `from scsims.contrib import UploadCallback`. This is a real
  (small) breaking change, but no consumer in the working tree
  imported `UploadCallback` from the top-level namespace, and the
  major version bump is the right time to clean it up.
- Removed `UploadCallback` from `scsims.__all__`.
- Removed the try/except ModuleNotFoundError shim from
  scsims/__init__.py since UploadCallback is no longer imported there.

Verified:
- `from scsims.contrib import UploadCallback` works with boto3 installed
- `from scsims.networking import UploadCallback` raises a
  DeprecationWarning and constructs a working callback with the
  legacy braingeneersdev defaults
- Without boto3, importing `scsims` and `scsims.contrib` succeeds
  (UploadCallback is silently absent from contrib's namespace)
- Without boto3, `from scsims.networking import UploadCallback` raises
  ModuleNotFoundError instead of a confusing AWS env-var KeyError
- Full v4 regression suite (legacy ckpt load, train→reload→predict,
  pretrain→warm-start fine-tune) still passes after the reorg

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The repo shipped four "test" files that hadn't run since at least 2022:

- tests/test_data.py and tests/test_model.py both did
  `sys.path.append(.../src)` and then `from data import *`. There is no
  src/ directory in the repo (it was renamed to scsims/ four years ago)
  and `generate_single_dataset` doesn't exist anywhere in the codebase.
- tests/test_model.py builds a Trainer with `gpus=` and
  `auto_lr_find=`, both of which were removed in pytorch-lightning 2.0.
- tests/test_helpers.py is empty.
- tests/tests.py is unrelated noise from a different file layout.

Plus scsims/tests.py held three "smoke" tests that *did* work but had
two latent bugs (referenced columns `first_pred`/`first_prob` that
scsims has never produced, and wrote to literal cwd-relative checkpoint
dirs which made parallel test runs racy). They also weren't discoverable
by pytest because they lived inside the package, not the tests/ tree.

Replace all of the above with a real pytest suite under tests/:

- conftest.py: shared synthetic-AnnData fixtures using scanpy.datasets
  .blobs (10 genes / 100 cells / 5 classes by default; small enough
  that the full suite runs in <5s on CPU). Also silences the firehose
  of FutureWarnings/DeprecationWarnings the upstream libraries emit.
- test_smoke.py: ports the three working tests from scsims/tests.py
  with the column-name and tmp_path fixes, plus two new tests for the
  v4 `top_k` parameter and the `model.explain()` matrix.
- test_pretraining.py: covers the new v4 self-supervised pretraining
  feature end-to-end. Tests the unsupervised reconstruction loss in
  isolation, asserts that `transfer_pretrained_weights` actually
  copies tensors bit-for-bit (not just reports a count), and exercises
  the two-process pretrain -> save -> load -> fine-tune workflow.
- test_legacy_checkpoint.py: opt-in regression test for the v3-era
  checkpoint format. Skipped by default; set SIMS_LEGACY_CKPT to a
  path on disk to enable. Verified locally against the MGE_cortex.ckpt
  shipped with sims_app (loads cleanly under torch>=2.6's
  weights_only=True default, forward-passes to the right shape).

Add .github/workflows/ci.yml: pytest on Python 3.10, 3.11, 3.12 over
ubuntu-latest. Uses the `test` extra defined in pyproject.toml so
scanpy and pytest get pulled in. Concurrency-cancels old runs when a
new commit lands on the same branch.

Verified: 10 passed, 1 skipped (legacy ckpt opt-in), 0 failed, 4.6s.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
CHANGELOG.md
- New Keep-a-Changelog file documenting every change in v4.
- Added: SIMSPretrainer, SIMS.pretrain / load_pretrainer,
  transfer_pretrained_weights, top_k parameter, grouped_features
  parameter, scsims.contrib subpackage, pyproject.toml, GitHub Actions
  CI, real pytest suite.
- Changed: dependency stack lower-bounded to current versions, boto3
  and scanpy moved to extras, lightning import path, predict() output
  columns, MetricCollection-based metric tracking, setup_trainer
  accepts checkpoint_dir/monitor, model_size raises ValueError,
  UploadCallback moved to scsims.contrib.
- Fixed: pkg_resources crash on Streamlit Cloud, torch.load
  weights_only on legacy checkpoints, pytorch-tabnet 4.x
  group_attention_matrix default crash, configure_optimizers mutation
  bug, UploadCallback AWS env-var crash, UploadCallback double-join
  bug, np.array type annotation, MIT/GPL license metadata mismatch,
  Python version classifiers, model_size docstring lie.
- Removed: setup.py/setup.cfg, dead tests/test_*.py files,
  scsims/tests.py, top-level UploadCallback re-export, module-level
  device constants, dead _inference_device attribute.

MIGRATION.md
- New v3 -> v4 upgrade guide. TL;DR: most users only need to change
  `predictions["first_pred"]` to `predictions["pred_0"]`. Covers the
  install change (now `scsims>=4.0`), the predict() column rename,
  Lightning import path migration, the new setup_trainer kwargs, the
  model_size error change, the UploadCallback move, the new working
  pretraining feature, and explicit "no action required" callouts
  for legacy checkpoint loading.

README.md
- Modernized installation section: Python 3.10+, current pip install
  one-liners with explicit notes about the [s3] and [dev] extras.
- Fixed `from pytorch_lightning...` imports in the code samples to use
  `from lightning.pytorch...`.
- Updated the predict() example to use the canonical pred_0/prob_0
  column names and demonstrate the new top_k parameter.
- Updated explain() example to unpack the (matrix, labels) tuple.
- Added a new "Self-supervised pretraining (new in v4)" section with a
  worked two-stage example.
- Added a banner pointer to MIGRATION.md for upgraders.
- Fixed `num_epochs=` -> `max_epochs=` in the custom-training example
  (the old kwarg was wrong; PL Trainer has always taken `max_epochs`).

SIMS_tutorial.ipynb
- Three targeted edits, preserved as a clean 3-line diff:
  1. Updated the "Before you begin" cell to say Python 3.10+ instead
     of "between 3.8 and 3.11" (which became false starting with the
     v4 stack on Python 3.13).
  2. Switched the EarlyStopping/ModelCheckpoint import from
     `pytorch_lightning.callbacks` to `lightning.pytorch.callbacks`.
  3. Same import path fix in the WandbLogger note in cell 28.
- The tutorial already used the canonical pred_0/prob_0 column names,
  so no other changes were needed there. The full smoke test
  (train -> predict -> explain) demonstrated in the notebook continues
  to match what scsims actually does.

Verified: full pytest suite (10 passed, 1 skipped) still green after
all docs work; sims_app streamlit boots clean against scsims v4.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
v4 has a lot of changes; pyproject.toml now reflects that.
@JesusGF1 JesusGF1 merged commit b7851e6 into main Apr 8, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant