Skip to content

Add compute_episodic_return_on_last_step#1830

Open
QuantuMope wants to merge 1 commit intopytorchfrom
PR/andrew/compute_episodic_return_on_last_step
Open

Add compute_episodic_return_on_last_step#1830
QuantuMope wants to merge 1 commit intopytorchfrom
PR/andrew/compute_episodic_return_on_last_step

Conversation

@QuantuMope
Copy link
Copy Markdown
Contributor

Currently, MC returns only get computed when encountering a discount of 0. This makes it so that MC returns cannot be computed if we do not terminate on success. This PR adds a flag compute_episodic_return_on_last_step to instead compute returns upon step_type.LAST instead.

@emailweixu
Copy link
Copy Markdown
Contributor

Currently, MC returns only get computed when encountering a discount of 0. This makes it so that MC returns cannot be computed if we do not terminate on success. This PR adds a flag compute_episodic_return_on_last_step to instead compute returns upon step_type.LAST instead.

Why don't we set discount to 0 for LAST step?

@Haichao-Zhang
Copy link
Copy Markdown
Contributor

This makes it so that MC returns cannot be computed if we do not terminate on success

  1. keep adding more flags might not be ideal and might be a bit confusing to users?
  2. This makes it so that MC returns cannot be computed if we do not terminate on success --> not sure what kind of use case you are considering, but typically, upon task success, we set step type as LAST + discount as 0?

@QuantuMope
Copy link
Copy Markdown
Contributor Author

Currently, MC returns only get computed when encountering a discount of 0. This makes it so that MC returns cannot be computed if we do not terminate on success. This PR adds a flag compute_episodic_return_on_last_step to instead compute returns upon step_type.LAST instead.

Why don't we set discount to 0 for LAST step?

This is for RL setups where we don't terminate on success and only terminate upon timeout.
For EmbodiedGen resets, batch resetting all environments at the same step is often more efficient and sometimes the only possible strategy (e.g., if lighting randomization is turned on, cannot reset individual envs on GPU).

@QuantuMope
Copy link
Copy Markdown
Contributor Author

QuantuMope commented Mar 27, 2026

This makes it so that MC returns cannot be computed if we do not terminate on success

  1. keep adding more flags might not be ideal and might be a bit confusing to users?
  2. This makes it so that MC returns cannot be computed if we do not terminate on success --> not sure what kind of use case you are considering, but typically, upon task success, we set step type as LAST + discount as 0?
  1. Would making this the default behavior be better?
  2. Most of our OpenVLAPPO, PPOFlow, SACFlow RL training uses timeout termination only. This would model an infinite horizon RL formulation.

@emailweixu
Copy link
Copy Markdown
Contributor

emailweixu commented Mar 28, 2026

Why can't we set discount to 0 at the time of setting step type to LAST?

I think in practice we could, but it would be less correct to do so? We're essentially modeling an infinite time horizon problem. Applying a discount of 0 at time out when there were numerous success idle steps prior may cause confusion to the model?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants