You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, MC returns only get computed when encountering a discount of 0. This makes it so that MC returns cannot be computed if we do not terminate on success. This PR adds a flag compute_episodic_return_on_last_step to instead compute returns upon step_type.LAST instead.
Currently, MC returns only get computed when encountering a discount of 0. This makes it so that MC returns cannot be computed if we do not terminate on success. This PR adds a flag compute_episodic_return_on_last_step to instead compute returns upon step_type.LAST instead.
This makes it so that MC returns cannot be computed if we do not terminate on success
keep adding more flags might not be ideal and might be a bit confusing to users?
This makes it so that MC returns cannot be computed if we do not terminate on success --> not sure what kind of use case you are considering, but typically, upon task success, we set step type as LAST + discount as 0?
Currently, MC returns only get computed when encountering a discount of 0. This makes it so that MC returns cannot be computed if we do not terminate on success. This PR adds a flag compute_episodic_return_on_last_step to instead compute returns upon step_type.LAST instead.
Why don't we set discount to 0 for LAST step?
This is for RL setups where we don't terminate on success and only terminate upon timeout.
For EmbodiedGen resets, batch resetting all environments at the same step is often more efficient and sometimes the only possible strategy (e.g., if lighting randomization is turned on, cannot reset individual envs on GPU).
This makes it so that MC returns cannot be computed if we do not terminate on success
keep adding more flags might not be ideal and might be a bit confusing to users?
This makes it so that MC returns cannot be computed if we do not terminate on success --> not sure what kind of use case you are considering, but typically, upon task success, we set step type as LAST + discount as 0?
Would making this the default behavior be better?
Most of our OpenVLAPPO, PPOFlow, SACFlow RL training uses timeout termination only. This would model an infinite horizon RL formulation.
Why can't we set discount to 0 at the time of setting step type to LAST?
I think in practice we could, but it would be less correct to do so? We're essentially modeling an infinite time horizon problem. Applying a discount of 0 at time out when there were numerous success idle steps prior may cause confusion to the model?
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Currently, MC returns only get computed when encountering a discount of 0. This makes it so that MC returns cannot be computed if we do not terminate on success. This PR adds a flag
compute_episodic_return_on_last_stepto instead compute returns uponstep_type.LASTinstead.