optional viz videos on train dataset by RyanPCo · Pull Request #476 · GaTech-RL2/EgoVerse

RyanPCo · 2026-05-27T23:58:57Z

No description provided.

RyanPCo · 2026-05-27T23:59:13Z

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

remove rotation bounds checks for norm stats #480 : 2 dependent PRs (#481 , #490 )
optional viz videos on train dataset #476 👈 (View in Graphite)
optional flag to generate validation videos for each task #475
pi train human new #473
main

This stack of pull requests is managed by Graphite. Learn more about stacking.

github-actions · 2026-06-11T17:46:34Z

Claude Code Review

Summary

Adds an optional second validation pass that runs the configured evaluator against a train-sampled dataloader, writing videos to videos_train_viz/ and prefixing metrics with train_viz/. Also adds a standalone norm-stats CLI, a CPU Submitit launcher, two Mecka data configs, and an authoritative task override on ZarrDataset from the SQL registry.

Key concerns

check_val_every_n_epoch halved (200 → 100) in trainer/ddp_pi.yaml. This is a global change to the canonical Pi trainer config — every active Pi run now validates twice as often, which doubles eval cost and changes wall-clock for in-flight experiments. This seems unrelated to the PR's stated purpose. If it's intentional (because train-viz makes val more useful), it deserves a call-out; otherwise revert.
Backward-incompat return type change in S3EpisodeResolver.sync_from_filters and _get_filtered_paths. Both now return a tuple (paths, hash_to_task) instead of paths. These are public-looking classmethods — if anything outside zarr_dataset_multi.py (scripts, notebooks, downstream tools) calls them, it will break silently with a confusing unpacking error. Quick rg "sync_from_filters\|_get_filtered_paths" to confirm, and consider returning a small dataclass or making hash_to_task an attribute set on the resolver instead.
mecka_pi_eval.yaml mode flip: train/valid → total on both train and valid datasets. This means the train and valid loaders now iterate the same episodes (the full set), which defeats train/val separation for that eval config. Is this intentional for an eval-only config? If so, add a comment; if not, this is a silent leak.
CombinedLoader unwrap in validation_step is fragile. The isinstance(batch, tuple) and len(batch) == 3 and isinstance(batch[0], dict) heuristic relies on Lightning's internal behavior when val_dataloader() returns a list of CombinedLoaders. This is undocumented and has changed across Lightning versions. At minimum:
- Pin/document the Lightning version this was tested against.
- Add an assertion or test that exercises this path on a tiny dummy datamodule.
- Consider returning a dict of CombinedLoaders or using CombinedLoader([...], "sequential") explicitly at the outer level rather than relying on Lightning to wrap a list.
compute_norm_stats.py reuses train_zarr_cartesian.yaml as config but only consumes cfg.data + cfg.norm_stats. It still requires model= / trainer= to resolve (Hydra will instantiate the schema), which is fine, but the docstring claim "model is never instantiated" should be verified — extras(cfg) and OmegaConf resolution can pull on ${...} interpolations. Worth a smoke test that confirms no GPU is touched.

Suggestions

Revert the check_val_every_n_epoch change or split it into its own PR with rationale.
In pl_data_utils.py, the val_dataloader() returning [valid, train_viz] couples to dataloader_idx=0/1 magic numbers in ModelWrapper.validation_step. Use a named structure (e.g., a dict {"valid": ..., "train_viz": ...} — Lightning supports this) so the wiring is explicit and dataloader_idx == 1 doesn't become load-bearing.
TrainVizEvalVideo calls super().__init__(...) with a fresh EvalVideo, but then delegates compute_metrics_and_viz to self.base. The wrapper's own EvalVideo state (buffers, etc.) and base's state will both exist — double-check that on_validation_start/on_validation_end don't double-flush or double-init wandb tables.
Add a unit test for the new task override path in ZarrDataset.__getitem__ — easy to test in isolation, and it's a correctness-critical change (downstream one_video_per_task grouping depends on it).
submitit_cpu_pace.yaml has a # confirm before first submit comment on the partition name — confirm before merge, since a wrong partition will fail at submit time, not config time.
In _get_filtered_paths, the str(t) != "nan" check for pandas NaN is brittle (it relies on string repr); use pd.isna(t).

Verdict: Request Changes

Main blockers: the trainer config change to check_val_every_n_epoch, the mecka_pi_eval mode flip to total, and the unannounced return-signature change on sync_from_filters. Everything else is fixable in a follow-up but the validation_step unwrap deserves a test.

Reviewed by Claude · Review workflow

RyanPCo mentioned this pull request May 27, 2026

pi train human new #473

Open

RyanPCo mentioned this pull request May 27, 2026

optional flag to generate validation videos for each task #475

Open

RyanPCo force-pushed the ryanco/train-viz branch 2 times, most recently from 99afe9e to d62b8a6 Compare May 29, 2026 01:37

This was referenced Jun 1, 2026

add custom configs to specify tasks for viz #481

Closed

remove rotation bounds checks for norm stats #480

Open

6D rotation normalization to prevent large changes from euler #490

Open

RyanPCo force-pushed the ryanco/train-viz branch from d62b8a6 to dafbc54 Compare June 6, 2026 21:45

RyanPCo force-pushed the ryanco/valid-per-task branch 2 times, most recently from d6ea265 to 561fa3d Compare June 7, 2026 19:43

RyanPCo force-pushed the ryanco/train-viz branch from dafbc54 to 4e9d681 Compare June 7, 2026 19:43

RyanPCo mentioned this pull request Jun 8, 2026

exp short state history memory #492

Draft

RyanPCo force-pushed the ryanco/valid-per-task branch from 561fa3d to 8ad9764 Compare June 8, 2026 20:45

RyanPCo force-pushed the ryanco/train-viz branch from 4e9d681 to 7ce68e7 Compare June 8, 2026 20:45

RyanPCo marked this pull request as ready for review June 8, 2026 20:47

RyanPCo mentioned this pull request Jun 8, 2026

local: personal env overrides — Do not merge #493

Draft

optional viz videos on train dataset

d67ed04

RyanPCo force-pushed the ryanco/train-viz branch from 7ce68e7 to d67ed04 Compare June 11, 2026 17:45

RyanPCo force-pushed the ryanco/valid-per-task branch from 8ad9764 to 35ef260 Compare June 11, 2026 17:45

This was referenced Jun 11, 2026

data-quality: episode review tool + mecka folding_clothes labels #494

Draft

exp task segment labels #498

Draft

better val metrics #499

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optional viz videos on train dataset#476

optional viz videos on train dataset#476
RyanPCo wants to merge 1 commit into
ryanco/valid-per-taskfrom
ryanco/train-viz

RyanPCo commented May 27, 2026

Uh oh!

RyanPCo commented May 27, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

RyanPCo commented May 27, 2026

Uh oh!

RyanPCo commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Jun 11, 2026

Claude Code Review

Summary

Key concerns

Suggestions

Verdict: Request Changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

RyanPCo commented May 27, 2026 •

edited

Loading