You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Proposed a novel reinforcement learning agent architecture consists of world model which learn a compressed spatial and temporal representation of the environment in an unsupervised manner, and controller.
Network architecture is the following:
VAE (Vision)
compress what the agent sees at each time frame
encodes the high-dimensional observation into a low-dimensional latent vector $z$
MDN-RNN (Memory)
compress what happens over time
The M model serves as a predictive model of the future z vectors that V is expected to produce
MDN: Mixture Density Network combined with a RNN
MDN outputs the parameters of a mixture of Gaussian distribution used to sample a prediction of the next latent vector z
integrates the historical codes to create a representation that can predict future states
Controller model
select good actions using the representations from both V and M
$a_t = W_c [z_t h_t] + b_c$
very compact representation as most of the model’s complexity, and model parameters to reside in V and M
Covariance-Matrix Adaptation Evolution Strategy (CMA-ES) (Hansen, 2016; Hansen & Ostermeier, 2001) is used to optimize the parameters of C since it is known to work well for solution spaces of up to a few thousand parameters
Comparison with previous researches. What are the novelties/good points?
Most existing model-based approaches learn a model of the RL environment, but still train on the actual environment. In this work, they explored fully replacing an actual RL environment with a generated one, training our agent’s controller only inside of the environment generated by its own internal world model, and transfer this policy back into the actual environment.
Key points
Collect 10,000 rollouts from a random policy
Train VAE(V) to encode each frame into a latent vector $z \in R^{32}$
Train MDN-RNN(M) to model $P(z_{t+1}|a_t, z_t, h_t)$
Define Controller(C) as $a_t = W_c[z_t h_t] + b_c$
Use CMA-ES to solve for a $W_c$ and $b_c$ that maximizes the expected cumulative reward
In this experiment, the world model (V and M) has no knowledge about the actual reward signals from the environment. Its task is simply to compress and predict the sequence of image frames observed. Only the Controller (C) Model has access to the reward information from the environment.
How the author proved effectiveness of the proposal?
Any discussions?
Scalability?
How far this model can predict?
What should I read next?
The text was updated successfully, but these errors were encountered:
Summary
Link
Author/Institution
David Ha, Jürgen Schmidhuber
What is this
Proposed a novel reinforcement learning agent architecture consists of world model which learn a compressed spatial and temporal representation of the environment in an unsupervised manner, and controller.
Network architecture is the following:
Comparison with previous researches. What are the novelties/good points?
Most existing model-based approaches learn a model of the RL environment, but still train on the actual environment. In this work, they explored fully replacing an actual RL environment with a generated one, training our agent’s controller only inside of the environment generated by its own internal world model, and transfer this policy back into the actual environment.
Key points
In this experiment, the world model (V and M) has no knowledge about the actual reward signals from the environment. Its task is simply to compress and predict the sequence of image frames observed. Only the Controller (C) Model has access to the reward information from the environment.
How the author proved effectiveness of the proposal?
Any discussions?
Scalability?
How far this model can predict?
What should I read next?
The text was updated successfully, but these errors were encountered: