Add compute_episodic_return_on_last_step#1830
Conversation
Why don't we set discount to 0 for LAST step? |
|
This is for RL setups where we don't terminate on success and only terminate upon timeout. |
|
|
Why can't we set discount to 0 at the time of setting step type to LAST? |
Accidentally editted the question comment instead of replying. Simply fixing that. I think in practice we could, but it would be less correct to do so? We're essentially modeling an infinite time horizon problem. Applying a discount of 0 at time out when there were numerous success idle steps prior may cause confusion to the model? |
|
After some investigating, this flag is necessary to maintain correctness during TD bootstrapping for the infinite horizon RL formulation. There is an important caveat where if |
Currently, MC returns only get computed when encountering a discount of 0. This makes it so that MC returns cannot be computed if we do not terminate on success. This PR adds a flag
compute_episodic_return_on_last_stepto instead compute returns uponstep_type.LASTinstead.