Skip to content

Pi rollout: load tokenizer-collate flags from .hydra/config.yaml#401

Open
rl2aloha wants to merge 1 commit into
elmo/norm-collatefrom
elmo/pi-rollout-fix
Open

Pi rollout: load tokenizer-collate flags from .hydra/config.yaml#401
rl2aloha wants to merge 1 commit into
elmo/norm-collatefrom
elmo/pi-rollout-fix

Conversation

@rl2aloha

@rl2aloha rl2aloha commented May 8, 2026

Copy link
Copy Markdown
Contributor

The trained pi prompt format (Task: ...; Action: ...) is governed by
data.{proprio,embodiment_label,control_mode,state_num_bins} in the
hydra config. The .ckpt only stores the model: subtree (see
trainHydra._build_model_config_tree), so read the flags from the
run-output snapshot at <run_dir>/.hydra/config.yaml instead.

  • rollout.py: build_tokenized_collate now wired from the run snapshot
    rather than hardcoded args. data["embodiment"] is set to the
    integer id; previously it was the string "eva_bimanual", which
    _embodiment_name's int(eid) cast would have crashed on.
  • hydra_utils.py: new find_run_snapshot_path / load_run_snapshot
    helpers for reading hydra run-output config.yaml without
    re-resolving runtime interpolations like
    ${data.dataset.data_schematic}.

Debug prints kept (TODO-marked) to ease verification on the rollout
host.

Co-Authored-By: Claude Opus 4.7 (1M context) noreply@anthropic.com

rl2aloha commented May 8, 2026

Copy link
Copy Markdown
Contributor Author

@rl2aloha rl2aloha changed the base branch from main to graphite-base/401 May 24, 2026 07:44
@Boey-li Boey-li force-pushed the elmo/pi-rollout-fix branch from 25554ba to 1ce450b Compare May 24, 2026 07:44
@rl2aloha rl2aloha changed the base branch from graphite-base/401 to elmo/norm-collate May 24, 2026 07:44
@rl2aloha rl2aloha marked this pull request as ready for review May 24, 2026 07:44
The trained pi prompt format (`Task: ...; Action: ...`) is governed by
`data.{proprio,embodiment_label,control_mode,state_num_bins}` in the
hydra config. The `.ckpt` only stores the `model:` subtree (see
`trainHydra._build_model_config_tree`), so read the flags from the
run-output snapshot at `<run_dir>/.hydra/config.yaml` instead.

- rollout.py: `build_tokenized_collate` now wired from the run snapshot
  rather than hardcoded args. `data["embodiment"]` is set to the
  integer id; previously it was the string `"eva_bimanual"`, which
  `_embodiment_name`'s `int(eid)` cast would have crashed on.
- hydra_utils.py: new `find_run_snapshot_path` / `load_run_snapshot`
  helpers for reading hydra run-output `config.yaml` without
  re-resolving runtime interpolations like
  `${data.dataset.data_schematic}`.

Debug prints kept (TODO-marked) to ease verification on the rollout
host.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions

github-actions Bot commented Jun 2, 2026

Copy link
Copy Markdown

Claude Code Review

PR Review: Pi rollout tokenizer-collate flags from snapshot

Summary

Reads the pi prompt-formatting flags (proprio, embodiment_label, control_mode, state_num_bins) from the run's .hydra/config.yaml snapshot rather than hardcoding them at rollout time, and fixes a latent crash where data["embodiment"] was a string instead of the integer embodiment id expected by _embodiment_name.

Key concerns

  1. _load_policy calls the snapshot loader twice. find_run_snapshot_path(...) is called once for the log line, then load_run_snapshot(self.policy_path) walks up again. Minor, but you already have snapshot_path — just OmegaConf.to_container(OmegaConf.load(snapshot_path)) directly, or have load_run_snapshot accept a precomputed path. Avoids a possible inconsistency if the two walks ever diverge.

  2. Silent fallback to defaults when snapshot is missing. If .hydra/config.yaml isn't found, _ckpt_cfg is None and _build_collate_from_checkpoint_cfg silently uses proprio=False, embodiment_label=False, control_mode=None, state_num_bins=256. For a model trained with proprio=True, this will tokenize a structurally different prompt than training and the rollout will look like it works but produce garbage actions. This should be a hard error, not a warning — at minimum when any of these flags can't be inferred.

  3. model_name lookup path is wrong. data_cfg.get("model_name")model_name is a model-side config, not under data:. The default "google/paligemma-3b-mix-224" will always be used. If that's intentional (pi 0.5 always uses paligemma tokenizer), drop the lookup to avoid implying it's configurable; otherwise pull it from cfg["model"].

  4. proprio_keys default is None. If proprio=True was set during training but proprio_keys isn't present at this path in the snapshot, build_tokenized_collate will get None and likely crash or silently skip proprio tokens. Worth asserting proprio_keys is not None when proprio is true.

  5. Debug prints in hot path. The print(f"[rollout][debug] sampled_prompt: ...") inside what looks like the per-step rollout loop (line ~421) will spam stdout every control step. The TODO is noted, but please gate behind a self.debug flag or remove before merging — easy to forget and it'll degrade rollout latency on the robot.

Suggestions

  • Make missing snapshot fatal (or at least fatal when arm/embodiment implies pi 0.5):
    if snapshot_path is None:
        raise RuntimeError(f"No .hydra/config.yaml found near {self.policy_path}; cannot reconstruct tokenizer collate flags")
  • Pass the resolved snapshot_path into load_run_snapshot to avoid the double walk.
  • Add a brief sanity log of the constructed collate config (the flags actually used), not just what was read — so it's obvious in logs when defaults kicked in.
  • Consider a small unit test for find_run_snapshot_path (checkpoint nested under checkpoints/, snapshot at run root) — it's the kind of path-walking code that breaks quietly.
  • The embodiment_id fix is correct and good catch; worth a one-line comment pointing at build_tokenized_collate._embodiment_name so the next reader doesn't "fix" it back.

Verdict: Request Changes

Mainly for concern #2 (silent defaults are dangerous for prompt format — a mismatched tokenization is exactly the kind of bug that wastes a day of rollout debugging) and #3 (model_name lookup is in the wrong subtree). The core approach — reading from .hydra/config.yaml rather than the ckpt — is right, and the embodiment-id fix is a real bugfix.


Reviewed by Claude · Review workflow

@github-actions

github-actions Bot commented Jun 2, 2026

Copy link
Copy Markdown

Claude Code Review

Review

Summary

Fixes pi rollout by loading tokenizer-collate flags from the hydra run snapshot (since .ckpt only stores the model: subtree) and corrects an embodiment field type bug that would have crashed _embodiment_name's int() cast.

Key concerns

  1. load_run_snapshot re-walks the tree. In _load_policy you call find_run_snapshot_path(self.policy_path) for the log message, then call load_run_snapshot(self.policy_path) which walks again. Minor, but you can just pass snapshot_path directly to OmegaConf.load, or have load_run_snapshot accept a pre-resolved path. Also a race: if the snapshot disappears between the two calls you'd log a path then return None. Not realistic but trivially avoidable.

  2. proprio_keys silently defaults to None. data_cfg.get("proprio_keys") will return None if the key is missing from the snapshot (e.g. older runs). If proprio=True but proprio_keys is None, does build_tokenized_collate handle that gracefully, or will it crash deep in collation at rollout time? Worth either asserting up front or logging a clear warning when proprio=True and proprio_keys is None.

  3. Silent fallback when snapshot is missing. If .hydra/config.yaml isn't found, you print a warning and build the collate with all defaults (proprio=False, embodiment_label=False, etc.). For a model trained with proprio=True, this will silently produce wrong prompts and bad rollouts. Recommend hard-failing when the snapshot is missing for a pi checkpoint — silent prompt-format mismatch is exactly the kind of thing that wastes a day of debugging on the robot.

  4. self.embodiment_id provenance. The diff assumes self.embodiment_id is set and is an int matching what training used. Worth confirming this is set in __init__ for all three arm modes (both/right/left) and that the int matches the embodiment enum the model was trained against — otherwise the prompt's embodiment token will be wrong.

  5. Debug prints. Acknowledged as TODO, but please file a follow-up issue or wrap them behind a --verbose flag before this lands on a shared rollout host — print on every step (the sampled_prompt print is inside the action loop) will spam stdout.

Suggestions

  • Pass the resolved snapshot_path into load_run_snapshot to avoid the double walk, or just inline OmegaConf.to_container(OmegaConf.load(snapshot_path), resolve=False) in _load_policy.
  • Raise (not warn) when snapshot is missing on a pi checkpoint — gate by checking the algo type if you want it lenient for non-pi rollouts.
  • Add an assertion: if data_cfg.get("proprio") and not data_cfg.get("proprio_keys"): raise ....
  • Move the in-loop sampled_prompt print behind if self.debug: or log it once at startup.
  • Consider a small unit test for find_run_snapshot_path (root, missing, nested ckpt dir) — it's pure path logic and trivially testable.
  • Nit: _load_checkpoint_cfg loads the full checkpoint just to read the config tree; _patch_checkpoint_paths then re-saves it. Fine, but since _ckpt_cfg is now sourced from the snapshot, the only remaining reason to read the ckpt's config_tree is the weight-path patch — that's the right factoring, just noting it.

Verdict

Request Changes — mainly to convert the silent-fallback behavior (concerns #2, #3) into a hard failure, and to demote the per-step debug print. The core fix (snapshot loading + int embodiment id) is correct and addresses a real crash.


Reviewed by Claude · Review workflow

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant