Skip to content

optional flag to generate validation videos for each task#475

Open
RyanPCo wants to merge 1 commit into
ryanco/pi-train-human-newfrom
ryanco/valid-per-task
Open

optional flag to generate validation videos for each task#475
RyanPCo wants to merge 1 commit into
ryanco/pi-train-human-newfrom
ryanco/valid-per-task

Conversation

@RyanPCo

@RyanPCo RyanPCo commented May 27, 2026

Copy link
Copy Markdown
Contributor

No description provided.

RyanPCo commented May 27, 2026

Copy link
Copy Markdown
Contributor Author

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

@RyanPCo RyanPCo force-pushed the ryanco/pi-train-human-new branch from 9fcdebd to c66446a Compare June 11, 2026 17:45
@RyanPCo RyanPCo force-pushed the ryanco/valid-per-task branch from 8ad9764 to 35ef260 Compare June 11, 2026 17:45
@github-actions

Copy link
Copy Markdown

Claude Code Review

Summary

Adds an opt-in one_video_per_task mode to EvalVideo that buckets validation frames by (embodiment, task) and emits one mp4 per task, plumbs a per-sample task string from ZarrDataset through the PI batch processor, and ships a corresponding eval-only Mecka data config plus a small SQL reporting script.

Key concerns

  1. task collation through the DataLoader is fragile. ZarrDataset.__getitem__ now adds data["task"] = self.metadata.get("task_name", "unknown") (a Python str). PyTorch's default collate will turn a batch of strings into a tuple[str, ...] of length B, which is what eval_video.py iterates over — good. But other consumers (e.g., anything that does batch[emb].to(device) or tensor-only assumptions) may now choke on a non-tensor key. In pi.py you guard with if "task" in _batch, but the loop right below it (if isinstance(value, torch.Tensor)) correctly skips non-tensors — confirm the same is true for act.py/hpt.py paths, otherwise non-PI runs may break. Worth a quick grep for .items() over processed_batch.

  2. Key name mismatch: task_name vs task. The script top_tasks_by_hours.py groups by df["task"] (SQL column), but ZarrDataset reads self.metadata.get("task_name", ...). Please confirm the zarr attr is actually called task_name (matches CONTRIBUTING_DATA.md schema) — if the canonical name is task, this will silently produce "unknown" for every sample and you'll get a single unknown.mp4 per embodiment with no error.

  3. Filter lambdas hard-code episode hashes inline. The mecka_pi_eval.yaml filter pins 5 specific episode_hash values via a lambda. This is fine for a one-off eval recipe, but (a) the hashes look truncated (24 chars vs the usual 26-char UTC-derived format YYYY-MM-DD-HH-MM-SS-ffffff) — please double-check these are real episode_hash values and not Mongo-style ids from another table, and (b) the comment maps them to task names — consider filtering by task directly so the config self-documents and survives episode re-ingestion.

  4. max_frames_per_task=1000 default + no mid-flush. In per-task mode you removed the 1000-frame mid-flush "because limit_val_batches bounds it." With limit_val_batches=400 and batch_size=32 that's up to 12,800 frames per task before the cap. The 1000 cap saves you, but if a user sets max_frames_per_task=null to "disable" they'll buffer all 12k+ frames per task in CPU RAM as a Python list of tensors before the single torch.stack at on_validation_end. Worth a comment in the config that null is dangerous, or just keep a hard mid-flush ceiling.

  5. CLAUDE.md committed to repo. Not a blocker, but confirm the team wants this tracked rather than gitignored — it duplicates CONTRIBUTING_DATA.md content and will drift.

Suggestions

  • In eval_video.py, after tasks = batch[embodiment_id].get("task"), assert len(tasks) == frames_tensor.shape[0] — silent mis-alignment here would assign frames to the wrong task bucket with no error.
  • The _sanitize_task fallback to "unknown" will collide all unlabeled episodes into one mp4. Consider including episode_hash[:8] in the filename when task is "unknown" so debugging is easier.
  • mecka_pi_eval.yaml: the valid_datasets block re-uses _target_ via interpolation but redeclares filters. Cleaner to use Hydra's ${...} for the resolver + override only filters and mode — which is what you did, good. But _target_: ${data.train_datasets.mecka_bimanual._target_} is unusual; just write the literal string to avoid resolver-order surprises with +evaluator=.
  • top_tasks_by_hours.py: add --lab / --embodiment as first-class flags rather than free-form --filter col=val — protects against typos like lab=Mecka silently returning empty.
  • The train filter and valid filter in mecka_pi_eval.yaml are identical. Either (a) the train side should be narrower (the docstring says "train side is unchanged (single task)" but it's not — it's all 5), or (b) update the comment. As written, train and valid sample the same episodes, which will produce optimistic eval numbers.

Verdict

Request Changes — primarily to resolve (2) the task_name vs task key check and (5) the train/valid filter mismatch with the stated intent. The core EvalVideo change is clean and well-scoped.


Reviewed by Claude · Review workflow

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant