Skip to content

Add compute_episodic_return_on_last_step#1830

Merged
QuantuMope merged 2 commits intopytorchfrom
PR/andrew/compute_episodic_return_on_last_step
Apr 15, 2026
Merged

Add compute_episodic_return_on_last_step#1830
QuantuMope merged 2 commits intopytorchfrom
PR/andrew/compute_episodic_return_on_last_step

Conversation

@QuantuMope
Copy link
Copy Markdown
Contributor

Currently, MC returns only get computed when encountering a discount of 0. This makes it so that MC returns cannot be computed if we do not terminate on success. This PR adds a flag compute_episodic_return_on_last_step to instead compute returns upon step_type.LAST instead.

@emailweixu
Copy link
Copy Markdown
Contributor

Currently, MC returns only get computed when encountering a discount of 0. This makes it so that MC returns cannot be computed if we do not terminate on success. This PR adds a flag compute_episodic_return_on_last_step to instead compute returns upon step_type.LAST instead.

Why don't we set discount to 0 for LAST step?

@Haichao-Zhang
Copy link
Copy Markdown
Contributor

This makes it so that MC returns cannot be computed if we do not terminate on success

  1. keep adding more flags might not be ideal and might be a bit confusing to users?
  2. This makes it so that MC returns cannot be computed if we do not terminate on success --> not sure what kind of use case you are considering, but typically, upon task success, we set step type as LAST + discount as 0?

@QuantuMope
Copy link
Copy Markdown
Contributor Author

Currently, MC returns only get computed when encountering a discount of 0. This makes it so that MC returns cannot be computed if we do not terminate on success. This PR adds a flag compute_episodic_return_on_last_step to instead compute returns upon step_type.LAST instead.

Why don't we set discount to 0 for LAST step?

This is for RL setups where we don't terminate on success and only terminate upon timeout.
For EmbodiedGen resets, batch resetting all environments at the same step is often more efficient and sometimes the only possible strategy (e.g., if lighting randomization is turned on, cannot reset individual envs on GPU).

@QuantuMope
Copy link
Copy Markdown
Contributor Author

QuantuMope commented Mar 27, 2026

This makes it so that MC returns cannot be computed if we do not terminate on success

  1. keep adding more flags might not be ideal and might be a bit confusing to users?
  2. This makes it so that MC returns cannot be computed if we do not terminate on success --> not sure what kind of use case you are considering, but typically, upon task success, we set step type as LAST + discount as 0?
  1. Would making this the default behavior be better?
  2. Most of our OpenVLAPPO, PPOFlow, SACFlow RL training uses timeout termination only. This would model an infinite horizon RL formulation.

@emailweixu
Copy link
Copy Markdown
Contributor

emailweixu commented Mar 28, 2026

Why can't we set discount to 0 at the time of setting step type to LAST?

@QuantuMope
Copy link
Copy Markdown
Contributor Author

QuantuMope commented Apr 1, 2026

Why can't we set discount to 0 at the time of setting step type to LAST?

Accidentally editted the question comment instead of replying. Simply fixing that.

I think in practice we could, but it would be less correct to do so? We're essentially modeling an infinite time horizon problem. Applying a discount of 0 at time out when there were numerous success idle steps prior may cause confusion to the model?

@QuantuMope
Copy link
Copy Markdown
Contributor Author

After some investigating, this flag is necessary to maintain correctness during TD bootstrapping for the infinite horizon RL formulation.

There is an important caveat where if compute_episodic_return_on_last_step==True, the computed MC return can be heavily biased. Added an explanation of why in commit 35a6b6d.

@QuantuMope QuantuMope merged commit a02762b into pytorch Apr 15, 2026
2 checks passed
@QuantuMope QuantuMope deleted the PR/andrew/compute_episodic_return_on_last_step branch April 15, 2026 05:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants