Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

World Models #30

Open
nagataka opened this issue Oct 22, 2019 · 0 comments
Open

World Models #30

nagataka opened this issue Oct 22, 2019 · 0 comments

Comments

@nagataka
Copy link
Owner

nagataka commented Oct 22, 2019

Summary

Link

Author/Institution

David Ha, Jürgen Schmidhuber

What is this

Proposed a novel reinforcement learning agent architecture consists of world model which learn a compressed spatial and temporal representation of the environment in an unsupervised manner, and controller.

Network architecture is the following:

  • VAE (Vision)
    • compress what the agent sees at each time frame
    • encodes the high-dimensional observation into a low-dimensional latent vector $z$
  • MDN-RNN (Memory)
    • compress what happens over time
    • The M model serves as a predictive model of the future z vectors that V is expected to produce
    • MDN: Mixture Density Network combined with a RNN
      • MDN outputs the parameters of a mixture of Gaussian distribution used to sample a prediction of the next latent vector z
    • integrates the historical codes to create a representation that can predict future states
  • Controller model
    • select good actions using the representations from both V and M
    • $a_t = W_c [z_t h_t] + b_c$
      • very compact representation as most of the model’s complexity, and model parameters to reside in V and M
    • Covariance-Matrix Adaptation Evolution Strategy (CMA-ES) (Hansen, 2016; Hansen & Ostermeier, 2001) is used to optimize the parameters of C since it is known to work well for solution spaces of up to a few thousand parameters

WorldModels_Architecture

Comparison with previous researches. What are the novelties/good points?

Most existing model-based approaches learn a model of the RL environment, but still train on the actual environment. In this work, they explored fully replacing an actual RL environment with a generated one, training our agent’s controller only inside of the environment generated by its own internal world model, and transfer this policy back into the actual environment.

Key points

  1. Collect 10,000 rollouts from a random policy
  2. Train VAE(V) to encode each frame into a latent vector $z \in R^{32}$
  3. Train MDN-RNN(M) to model $P(z_{t+1}|a_t, z_t, h_t)$
  4. Define Controller(C) as $a_t = W_c[z_t h_t] + b_c$
  5. Use CMA-ES to solve for a $W_c$ and $b_c$ that maximizes the expected cumulative reward

In this experiment, the world model (V and M) has no knowledge about the actual reward signals from the environment. Its task is simply to compress and predict the sequence of image frames observed. Only the Controller (C) Model has access to the reward information from the environment.

How the author proved effectiveness of the proposal?

Any discussions?

Scalability?
How far this model can predict?

What should I read next?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant