World Models #30

nagataka · 2019-10-22T01:27:01Z

Summary

Link

Author/Institution

David Ha, Jürgen Schmidhuber

What is this

Proposed a novel reinforcement learning agent architecture consists of world model which learn a compressed spatial and temporal representation of the environment in an unsupervised manner, and controller.

Network architecture is the following:

VAE (Vision)
- compress what the agent sees at each time frame
- encodes the high-dimensional observation into a low-dimensional latent vector $z$
MDN-RNN (Memory)
- compress what happens over time
- The M model serves as a predictive model of the future z vectors that V is expected to produce
- MDN: Mixture Density Network combined with a RNN
  - MDN outputs the parameters of a mixture of Gaussian distribution used to sample a prediction of the next latent vector z
- integrates the historical codes to create a representation that can predict future states
Controller model
- select good actions using the representations from both V and M
- $a_t = W_c [z_t h_t] + b_c$
  - very compact representation as most of the model’s complexity, and model parameters to reside in V and M
- Covariance-Matrix Adaptation Evolution Strategy (CMA-ES) (Hansen, 2016; Hansen & Ostermeier, 2001) is used to optimize the parameters of C since it is known to work well for solution spaces of up to a few thousand parameters

Comparison with previous researches. What are the novelties/good points?

Most existing model-based approaches learn a model of the RL environment, but still train on the actual environment. In this work, they explored fully replacing an actual RL environment with a generated one, training our agent’s controller only inside of the environment generated by its own internal world model, and transfer this policy back into the actual environment.

Key points

Collect 10,000 rollouts from a random policy
Train VAE(V) to encode each frame into a latent vector $z \in R^{32}$
Train MDN-RNN(M) to model $P(z_{t+1}|a_t, z_t, h_t)$
Define Controller(C) as $a_t = W_c[z_t h_t] + b_c$
Use CMA-ES to solve for a $W_c$ and $b_c$ that maximizes the expected cumulative reward

In this experiment, the world model (V and M) has no knowledge about the actual reward signals from the environment. Its task is simply to compress and predict the sequence of image frames observed. Only the Controller (C) Model has access to the reward information from the environment.

How the author proved effectiveness of the proposal?

Any discussions?

Scalability?
How far this model can predict?

What should I read next?

nagataka added Evolutionary Algorithms Generative Model Model-Based RL Reinforcement Learning labels Oct 22, 2019

nagataka self-assigned this Oct 22, 2019

nagataka added the Dissertation label Jan 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

World Models #30

World Models #30

nagataka commented Oct 22, 2019 •

edited

Loading

World Models #30

World Models #30

Comments

nagataka commented Oct 22, 2019 • edited Loading

Summary

Link

Author/Institution

What is this

Comparison with previous researches. What are the novelties/good points?

Key points

How the author proved effectiveness of the proposal?

Any discussions?

What should I read next?

nagataka commented Oct 22, 2019 •

edited

Loading