Add compute_episodic_return_on_last_step by QuantuMope · Pull Request #1830 · HorizonRobotics/alf

QuantuMope · 2026-03-27T17:07:34Z

Currently, MC returns only get computed when encountering a discount of 0. This makes it so that MC returns cannot be computed if we do not terminate on success. This PR adds a flag compute_episodic_return_on_last_step to instead compute returns upon step_type.LAST instead.

emailweixu · 2026-03-27T20:54:02Z

Currently, MC returns only get computed when encountering a discount of 0. This makes it so that MC returns cannot be computed if we do not terminate on success. This PR adds a flag compute_episodic_return_on_last_step to instead compute returns upon step_type.LAST instead.

Why don't we set discount to 0 for LAST step?

Haichao-Zhang · 2026-03-27T21:00:47Z

This makes it so that MC returns cannot be computed if we do not terminate on success

keep adding more flags might not be ideal and might be a bit confusing to users?
This makes it so that MC returns cannot be computed if we do not terminate on success --> not sure what kind of use case you are considering, but typically, upon task success, we set step type as LAST + discount as 0?

QuantuMope · 2026-03-27T21:01:20Z

Currently, MC returns only get computed when encountering a discount of 0. This makes it so that MC returns cannot be computed if we do not terminate on success. This PR adds a flag compute_episodic_return_on_last_step to instead compute returns upon step_type.LAST instead.

Why don't we set discount to 0 for LAST step?

This is for RL setups where we don't terminate on success and only terminate upon timeout.
For EmbodiedGen resets, batch resetting all environments at the same step is often more efficient and sometimes the only possible strategy (e.g., if lighting randomization is turned on, cannot reset individual envs on GPU).

QuantuMope · 2026-03-27T21:02:58Z

This makes it so that MC returns cannot be computed if we do not terminate on success

keep adding more flags might not be ideal and might be a bit confusing to users?

This makes it so that MC returns cannot be computed if we do not terminate on success --> not sure what kind of use case you are considering, but typically, upon task success, we set step type as LAST + discount as 0?

Would making this the default behavior be better?
Most of our OpenVLAPPO, PPOFlow, SACFlow RL training uses timeout termination only. This would model an infinite horizon RL formulation.

emailweixu · 2026-03-28T00:18:09Z

Why can't we set discount to 0 at the time of setting step type to LAST?

QuantuMope · 2026-04-01T20:41:11Z

Why can't we set discount to 0 at the time of setting step type to LAST?

Accidentally editted the question comment instead of replying. Simply fixing that.

I think in practice we could, but it would be less correct to do so? We're essentially modeling an infinite time horizon problem. Applying a discount of 0 at time out when there were numerous success idle steps prior may cause confusion to the model?

QuantuMope · 2026-04-01T20:53:41Z

After some investigating, this flag is necessary to maintain correctness during TD bootstrapping for the infinite horizon RL formulation.

There is an important caveat where if compute_episodic_return_on_last_step==True, the computed MC return can be heavily biased. Added an explanation of why in commit 35a6b6d.

add compute_episodic_return_on_last_step

9ffa313

QuantuMope requested review from Haichao-Zhang and emailweixu March 27, 2026 17:07

Add caveat of new arg

35a6b6d

emailweixu approved these changes Apr 9, 2026

View reviewed changes

QuantuMope merged commit a02762b into pytorch Apr 15, 2026
2 checks passed

QuantuMope deleted the PR/andrew/compute_episodic_return_on_last_step branch April 15, 2026 05:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add compute_episodic_return_on_last_step#1830

Add compute_episodic_return_on_last_step#1830
QuantuMope merged 2 commits intopytorchfrom
PR/andrew/compute_episodic_return_on_last_step

QuantuMope commented Mar 27, 2026

Uh oh!

emailweixu commented Mar 27, 2026

Uh oh!

Haichao-Zhang commented Mar 27, 2026

Uh oh!

QuantuMope commented Mar 27, 2026

Uh oh!

QuantuMope commented Mar 27, 2026 •

edited

Loading

Uh oh!

emailweixu commented Mar 28, 2026 •

edited by QuantuMope

Loading

Uh oh!

QuantuMope commented Apr 1, 2026 •

edited

Loading

Uh oh!

QuantuMope commented Apr 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

QuantuMope commented Mar 27, 2026

Uh oh!

emailweixu commented Mar 27, 2026

Uh oh!

Haichao-Zhang commented Mar 27, 2026

Uh oh!

QuantuMope commented Mar 27, 2026

Uh oh!

QuantuMope commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

emailweixu commented Mar 28, 2026 • edited by QuantuMope Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

QuantuMope commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

QuantuMope commented Apr 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

QuantuMope commented Mar 27, 2026 •

edited

Loading

emailweixu commented Mar 28, 2026 •

edited by QuantuMope

Loading

QuantuMope commented Apr 1, 2026 •

edited

Loading