Skip to content

optional viz videos on train dataset#476

Open
RyanPCo wants to merge 1 commit into
ryanco/valid-per-taskfrom
ryanco/train-viz
Open

optional viz videos on train dataset#476
RyanPCo wants to merge 1 commit into
ryanco/valid-per-taskfrom
ryanco/train-viz

Conversation

@RyanPCo

@RyanPCo RyanPCo commented May 27, 2026

Copy link
Copy Markdown
Contributor

No description provided.

@RyanPCo RyanPCo mentioned this pull request May 27, 2026

RyanPCo commented May 27, 2026

Copy link
Copy Markdown
Contributor Author

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

@RyanPCo RyanPCo force-pushed the ryanco/train-viz branch from 7ce68e7 to d67ed04 Compare June 11, 2026 17:45
@RyanPCo RyanPCo force-pushed the ryanco/valid-per-task branch from 8ad9764 to 35ef260 Compare June 11, 2026 17:45
@github-actions

Copy link
Copy Markdown

Claude Code Review

Summary

Adds an optional second validation pass that runs the configured evaluator against a train-sampled dataloader, writing videos to videos_train_viz/ and prefixing metrics with train_viz/. Also adds a standalone norm-stats CLI, a CPU Submitit launcher, two Mecka data configs, and an authoritative task override on ZarrDataset from the SQL registry.

Key concerns

  1. check_val_every_n_epoch halved (200 → 100) in trainer/ddp_pi.yaml. This is a global change to the canonical Pi trainer config — every active Pi run now validates twice as often, which doubles eval cost and changes wall-clock for in-flight experiments. This seems unrelated to the PR's stated purpose. If it's intentional (because train-viz makes val more useful), it deserves a call-out; otherwise revert.

  2. Backward-incompat return type change in S3EpisodeResolver.sync_from_filters and _get_filtered_paths. Both now return a tuple (paths, hash_to_task) instead of paths. These are public-looking classmethods — if anything outside zarr_dataset_multi.py (scripts, notebooks, downstream tools) calls them, it will break silently with a confusing unpacking error. Quick rg "sync_from_filters\|_get_filtered_paths" to confirm, and consider returning a small dataclass or making hash_to_task an attribute set on the resolver instead.

  3. mecka_pi_eval.yaml mode flip: train/validtotal on both train and valid datasets. This means the train and valid loaders now iterate the same episodes (the full set), which defeats train/val separation for that eval config. Is this intentional for an eval-only config? If so, add a comment; if not, this is a silent leak.

  4. CombinedLoader unwrap in validation_step is fragile. The isinstance(batch, tuple) and len(batch) == 3 and isinstance(batch[0], dict) heuristic relies on Lightning's internal behavior when val_dataloader() returns a list of CombinedLoaders. This is undocumented and has changed across Lightning versions. At minimum:

    • Pin/document the Lightning version this was tested against.
    • Add an assertion or test that exercises this path on a tiny dummy datamodule.
    • Consider returning a dict of CombinedLoaders or using CombinedLoader([...], "sequential") explicitly at the outer level rather than relying on Lightning to wrap a list.
  5. compute_norm_stats.py reuses train_zarr_cartesian.yaml as config but only consumes cfg.data + cfg.norm_stats. It still requires model= / trainer= to resolve (Hydra will instantiate the schema), which is fine, but the docstring claim "model is never instantiated" should be verified — extras(cfg) and OmegaConf resolution can pull on ${...} interpolations. Worth a smoke test that confirms no GPU is touched.

Suggestions

  • Revert the check_val_every_n_epoch change or split it into its own PR with rationale.
  • In pl_data_utils.py, the val_dataloader() returning [valid, train_viz] couples to dataloader_idx=0/1 magic numbers in ModelWrapper.validation_step. Use a named structure (e.g., a dict {"valid": ..., "train_viz": ...} — Lightning supports this) so the wiring is explicit and dataloader_idx == 1 doesn't become load-bearing.
  • TrainVizEvalVideo calls super().__init__(...) with a fresh EvalVideo, but then delegates compute_metrics_and_viz to self.base. The wrapper's own EvalVideo state (buffers, etc.) and base's state will both exist — double-check that on_validation_start/on_validation_end don't double-flush or double-init wandb tables.
  • Add a unit test for the new task override path in ZarrDataset.__getitem__ — easy to test in isolation, and it's a correctness-critical change (downstream one_video_per_task grouping depends on it).
  • submitit_cpu_pace.yaml has a # confirm before first submit comment on the partition name — confirm before merge, since a wrong partition will fail at submit time, not config time.
  • In _get_filtered_paths, the str(t) != "nan" check for pandas NaN is brittle (it relies on string repr); use pd.isna(t).

Verdict: Request Changes

Main blockers: the trainer config change to check_val_every_n_epoch, the mecka_pi_eval mode flip to total, and the unannounced return-signature change on sync_from_filters. Everything else is fixable in a follow-up but the validation_step unwrap deserves a test.


Reviewed by Claude · Review workflow

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant