Fatal bug in implementation of GAE

https://github.com/uvipen/Super-mario-bros-PPO-pytorch/blob/ab4248d715346c6adc33c2157455e2b98c130bcc/train.py#L119
It should be
```
gae = gae * opt.gamma * opt.tau*(1 - done)

```
Suppose worker 1 has to sample 500 steps. The game prematurely ends at 250 steps, the worker will restart the game and continue sampling 250 steps. The trajectory would be s1,s2,...,s250,s1',s2',...s250'.
The wrong implementation forgets to reset GAE to zero when calculating GAE of s250. It will make GAE bigger than expected. This will cause the advantage of s250 become bigger and bigger, which will make the network think you should output a250 when seeing s250. (However, this is not true, performing s250 at a250 make you die). 

Therefore, the critic loss diverges (advantage becomes bigger and bigger, network can't predict it right). Stuck at action that make you die. The agent does not learn anything.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fatal bug in implementation of GAE #21

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Fatal bug in implementation of GAE #21

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions