World Models

Ha, David; Schmidhuber, Jürgen

doi:10.5281/zenodo.1207631

Acknowledgments

We would like to thank Blake Richards, Kory Mathewson, Kyle McDonald, Kai Arulkumaran, Ankur Handa, Denny Britz, Elwin Ha and Natasha Jaques for their thoughtful feedback on this article, and for offering their valuable perspectives and insights from their areas of expertise.

The interative demos in this article were all built using p5.js. Deploying all of these machine learning models in a web browser was made possible with deeplearn.js, a hardware-accelerated machine learning framework for the browser, developed by the People+AI Research Initiative (PAIR) team at Google. A special thanks goes to Nikhil Thorat and Daniel Smilkov for their support.

We would like to thank Chris Olah and the rest of the Distill editorial team for their valuable feedback and generous editorial support, in addition to supporting the use of their distill.pub technology.

We would to extend our thanks to Alex Graves, Douglas Eck, Mike Schuster, Rajat Monga, Vincent Vanhoucke, Jeff Dean and the Google Brain team for helpful feedback and for encouraging us to explore this area of research.

Any errors here are our own and do not reflect opinions of our proofreaders and colleagues. If you see mistakes or want to suggest changes, feel free to contribute feedback by participating in the discussion forum for this article.

The experiments in this article were performed on both a P100 GPU and a 64-core CPU Ubuntu Linux virtual machine provided by Google Cloud Platform, using TensorFlow and OpenAI Gym.

Citation

For attribution in academic contexts, please cite this work as

Ha and Schmidhuber, "World Models", 2018. https://doi.org/10.5281/zenodo.1207631

BibTeX citation

@article{Ha2018WorldModels,
  author = {Ha, D. and Schmidhuber, J.},
  title  = {World Models},
  eprint = {arXiv:1803.10122},
  doi    = {10.5281/zenodo.1207631},
  url    = {https://worldmodels.github.io},
  year   = {2018}
}

Open Source Code

The code to reproduce experiments in this work, as well as IPython notebooks for training and visualizing VAE and MDN-RNN models will be made available at a later date.

Reuse

Diagrams and text are licensed under Creative Commons Attribution CC-BY 4.0 with the source available on GitHub, unless noted otherwise. The figures that have been reused from other sources don’t fall under this license and can be recognized by the citations in their caption.

Appendix

In this section we will describe in more details the models and training methods used in this work.

Variational Autoencoder

We trained a Convolutional Variational Autoencoder (ConvVAE) model as the V Model of our agent. Unlike vanilla autoencoders, enforcing a Gaussian prior over the latent vector $z$ also limits the amount its information capacity for compressing each frame, but this Gaussian prior also makes the world model more robust to unrealistic $z$ vectors generated by the M Model. As the environment may give us observations as high dimensional pixel images, we first resize each image to 64x64 pixels before as use this resized image as the V Model’s observation. Each pixel is stored as three floating point values between 0 and 1 to represent each of the RGB channels. The ConvVAE takes in this 64x64x3 input tensor and passes this data through 4 convolutional layers to encode it into low dimension vectors $\mu$ and $\sigma$ , each of size $N_z$ . The latent vector $z$ is sampled from the Gaussian prior $N(\mu, \sigma I)$ . In the Car Racing task [48], $N_z$ is 32 while for the Doom task $N_z$ is 64. The latent vector $z$ is passed through 4 of deconvolution layers used to decode and reconstruct the image.

In the following diagram, we describe the shape of our tensor at each layer of the ConvVAE and also describe the details of each layer:

Convolutional Variational Autoencoder

Each convolution and deconvolution layer uses a stride of 2. The layers are indicated in the diagram in Italics as Activation-type Output Channels x Filter Size. All convolutional and deconvolutional layers use relu activations except for the output layer as we need the output to be between 0 and 1. We trained the model for 1 epoch over the data collected from a random policy, using $L^2$ distance between the input image and the reconstruction to quantify the reconstruction loss we optimize for, in addition to the KL loss.

Recurrent Neural Network

For the M Model, we use an LSTM [115] recurrent neural network combined with a Mixture Density Network[38][39] as the output layer. We use this network to model the probability distribution of the next $z$ in the next time step as a Mixture of Gaussian distribution. This approach is very similar to Graves’ Generating Sequences with RNNs [40] in the Unconditional Handwriting Generation section and also the decoder-only section of Sketch-RNN [36]. The only difference in the approach used is that we did not model the correlation parameter between each element of $z$ , and instead had the MDN-RNN output a diagonal covariance matrix of a factored Gaussian distribution.

MDN-RNN[36]

Unlike the handwriting and sketch generation works, rather than using the MDN-RNN to model the pdf of the next pen stroke, we model instead the pdf of the next latent vector $z$ . We would sample from this pdf at each timestep to generate the hallucinated environments. In the Doom task, we also also use the MDN-RNN to predict the probability of whether the agent has died in this frame. If that probability is above 50%, then we set done to be True in the virtual dream environment. Given that death is a low probability event at each timestep, we find the cutoff approach to more stable compared to sampling from the Bernoulli distribution.

The MDN-RNNs were trained for 20 epochs on the data collected from a random policy agent. In the Car Racing task, the LSTM used 256 hidden units, while the Doom task used 512 hidden units. In both tasks, we used 5 Gaussian mixtures and did not model the correlation $\rho$ parameter, hence $z$ is sampled from a factored mixture of Gaussian distribution.

When training the MDN-RNN using teacher forcing from the recorded data, we store a pre-computed set of $\mu$ and $\sigma$ for each of the frames, and sample an input $z \sim N(\mu, \sigma)$ each time we construct a training batch, to prevent overfitting our MDN-RNN to a specific sampled $z$ .

Controller

For both environments, we applied $\tanh$ nonlinearities to clip and bound the action space to the appropriate ranges. For instance, in the Car Racing task, the steering wheel has a range from -1 to 1, the acceleration pedal from 0 to 1, and the brakes from 0 to 1. In the Doom environment, we converted the discrete actions into a continuous action space between -1 to 1, and divided this range into thirds to indicate whether the agent is moving left, staying where it is, or moving to the right. We would give the C Model a feature vector as its input, consisting of $z$ and the hidden state of the MDN-RNN. In the Car Racing task, this hidden state is the output vector $h$ of the LSTM, while for the Doom task it is both the cell vector $c$ and the output vector $h$ of the LSTM.

Evolution Strategies

We used Covariance-Matrix Adaptation Evolution Strategy (CMA-ES) [46], an Evolution Strategy [45] to evolve the weights for our C Model. Following the approach described in Evolving Stable Strategies [100], we used a population size of 64, and had each agent perform the task 16 times with different initial random seeds. The fitness value for the agent is the average cumulative reward of the 16 random rollouts. The diagram below charts the best performer, worst performer, and mean fitness of the population of 64 agents at each generation:

Training of CarRacing-v0 [48]

Since the requirement of this environment is to have an agent achieve an average score above 900 over 100 random rollouts, we took the best performing agent at the end of every 25 generations, and tested that agent over 1024 random rollout scenarios to record this average on the red line. After 1800 generations, an agent was able to achieve an average score of 900.46 over 1024 random rollouts. We used 1024 random rollouts rather than 100 because each process of the 64 core machine had been configured to run 16 times already, effectively using a full generation of compute after every 25 generations to evaluate the best agent 1024 times. Below, we plot the results of same agent evaluated over 100 rollouts:

Histogram of cumulative rewards. Average score is 906 ± 21.

We also experimented with an agent that has access to only the $z$ vector from the VAE, and not letting it see the RNN’s hidden states. We tried 2 variations, where in the first variation, the C Model mapped $z$ directly to the action space $a$ . In second variation, we attempted to add a hidden layer with 40 $tanh$ activations between $z$ and $a$ , increasing the number of model parameters of the C Model to 1443, making it more comparable with the original setup.

When agent sees only

z_t

, average score is 632 ± 251.

When agent sees only

z_t

, with a hidden layer, average score is 788 ± 141.

DoomRNN

We conducted a similar experiment on the hallucinated Doom environment we called DoomRNN. Please note that we have not actually attempted to train our agent on the actual VizDoom [34] environment, and had only used VizDoom for the purpose of collecting training data using a random policy. DoomRNN is more computationally efficient compared to VizDoom as it only operates in latent space without the need to render a screenshot at each timestep, and does not require running the actual Doom game engine.

Training of DoomRNN

In the virtual DoomRNN environment we constructed, we increased the temperature slightly and used $\tau=1.15$ to make the agent learn in a more challenging environment. The best agent managed to obtain an average score of 959 over 1024 random rollouts (the highest score of the red line in the diagram). This same agent achieved an average score of 1092 $\pm$ 556 over 100 random rollouts when deployed to the actual environment DoomTakeCover-v0 [35].

Histogram of timesteps survived in the actual environment over 100 consecutive trials.

Footnotes

In many RL problems, the feedback (positive or negative reward) is given at end of a sequence of steps. The credit assignment problem tackles the problem of figuring out which steps caused the resulting feedback—which steps should receive credit or blame for a final result?
Typical model-free RL models have in the order of $10^3$ to $10^6$ model parameters. We look at training models in the order of $10^7$ parameters, which is still rather small compared to state-of-the-art deep learning models with $10^8$ to even $10^{9}$ parameters. In principle, the procedure described in this article can take advantage of these larger networks if we wanted to use them.
To be clear, the prediction of $z_{t+1}$ is not fed into the controller C directly — just the hidden state $h_t$ and $z_t$ . This is because $h_t$ has all the information needed to generate the parameters of a mixture of Gaussian distribution, if we want to sample $z_{t+1}$ to make a prediction.
We find this task interesting because although it is not difficult to train an agent to wobble around randomly generated tracks and obtain a mediocre score, CarRacing-v0 defines “solving” as getting average reward of 900 over 100 consecutive trials, which means the agent can only afford very few driving mistakes.
We will discuss an iterative training procedure later on for more complicated environments where a random policy is not sufficient.
In principle, we can train both models together in an end-to-end manner, although we found that training each separately is more practical, and also achieves satisfactory results. Training each model only required less than an hour of computation time using a single NVIDIA P100 GPU. We can also train individual VAE and MDN-RNN models without having to exhaustively tune hyperparameters.
We will discuss how this score compares to other models later on.
In Learning to Think, it is acceptable that the RNN M isn’t always a reliable predictor. A (potentially evolution-based) RNN C can in principle learn to ignore a flawed M, or exploit certain useful parts of M for arbitrary computational purposes including hierarchical planning etc. This is not what we do here though — our present approach is still closer to some of the old systems, where a RNN M is used to predict and plan ahead step by step. Unlike this early work, however, we use evolution for C (like in Learning to Think) rather than traditional RL combined with RNNs, which has the advantage of both simplicity and generality.
Another related connection is to muscle memory. For instance, as you learn to do something like play the piano, you no longer have to spend working memory capacity on translating individual notes to finger motions — this all becomes encoded at a subconscious level.

References

OpenAI Gym [PDF]
Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J. and Zaremba, W., 2016. ArXiv preprint.
Understanding Comics: The Invisible Art [link]
McCloud, S., 1993. Tundra Publishing.
More thoughts from Understanding Comics by Scott McCloud [link]
E, M., 2012. Tumblr.
Counterintuitive behavior of social systems [link]
Forrester, J.W., 1971. Technology Review.
The Code for Facial Identity in the Primate Brain [link]
Cheang, L. and Tsao, D., 2017. Cell. DOI: 10.1016/j.cell.2017.05.011
Invariant visual representation by single neurons in the human brain [HTML]
Quiroga, R., Reddy, L., Kreiman, G., Koch, C. and Fried, I., 2005. Nature. DOI: 10.1038/nature03687
Primary Visual Cortex Represents the Difference Between Past and Present [link]
Nortmann, N., Rekauzke, S., Onat, S., König, P. and Jancke, D., 2015. Cerebral Cortex, Vol 25(6), pp. 1427-1440. DOI: 10.1093/cercor/bht318
Motion-Dependent Representation of Space in Area MT+ [link]
Gerrit, M., Fischer, J. and Whitney, D., 2013. Neuron. DOI: 10.1016/j.neuron.2013.03.010
Akiyoshi’s Illusion Pages [HTML]
Kitaoka, A., 2002. Kanzen.
Peripheral drift illusion [link]
Authors, W., 2017. Wikipedia.
Illusory Motion Reproduced by Deep Neural Networks Trained for Prediction [link]
Watanabe, E., Kitaoka, A., Sakamoto, K., Yasugi, M. and Tanaka, K., 2018. Frontiers in Psychology, Vol 9, pp. 345. DOI: 10.3389/fpsyg.2018.00345
Sensorimotor Mismatch Signals in Primary Visual Cortex of the Behaving Mouse [link]
Keller, G., Bonhoeffer, T. and Hübener, M., 2012. Neuron, Vol 74(5), pp. 809 - 815. DOI: https://doi.org/10.1016/j.neuron.2012.03.040
A Sensorimotor Circuit in Mouse Cortex for Visual Flow Predictions [link]
Leinweber, M., Ward, D.R., Sobczak, J.M., Attinger, A. and Keller, G.B., 2017. Neuron, Vol 95(6), pp. 1420 - 1432.e5. DOI: https://doi.org/10.1016/j.neuron.2017.08.036
The ecology of human fear: survival optimization and the nervous system. [link]
Mobbs, D., Hagan, C.C., Dalgleish, T., Silston, B. and Prévost, C., 2015. Frontiers in Neuroscience. DOI: 10.3389/fnins.2015.00055
Baseball Icon Design (CC 3.0) [link]
Sotil, G., 2018. The Noun Project.
Tracking Fastballs [link]
Hirshon, B., 2013. Science Update Interview.
Reinforcement learning: a survey
Kaelbling, L.P., Littman, M.L. and Moore, A.W., 1996. Journal of AI research, Vol 4, pp. 237—285.
Introduction to Reinforcement Learning [PDF]
Sutton, R.S. and Barto, A.G., 1998. MIT Press.
Reinforcement Learning
Wiering, M. and van Otterlo, M., 2012. Springer.
Learning How the World Works: Specifications for Predictive Networks in Robots and Brains
Werbos, P.J., 1987. Proceedings of IEEE International Conference on Systems, Man and Cybernetics, N.Y..
David Silver’s Lecture on Integrating Learning and Planning [PDF]
Silver, D., 2017.
Making the World Differentiable: On Using Self-Supervised Fully Recurrent Neural Networks for Dynamic Reinforcement Learning and Planning in Non-Stationary Environments [PDF]
Schmidhuber, J., 1990.
An on-line algorithm for dynamic reinforcement learning and planning in reactive environments [link]
Schmidhuber, J., 1990. 1990 IJCNN International Joint Conference on Neural Networks, pp. 253-258 vol.2. DOI: 10.1109/IJCNN.1990.137723
Reinforcement Learning in Markovian and Non-Markovian Environments [PDF]
Schmidhuber, J., 1991. Advances in Neural Information Processing Systems 3, pp. 500—506. Morgan-Kaufmann.
The representation of the cumulative rounding error of an algorithm as a Taylor expansion of the local rounding errors
Linnainmaa, S., 1970.
Gradient Theory of Optimal Flight Paths
Kelley, H.J., 1960. ARS Journal, Vol 30(10), pp. 947-954.
Applications of advances in nonlinear sensitivity analysis
Werbos, P.J., 1982. System modeling and optimization, pp. 762—770. Springer.
Deep Reinforcement Learning: A Brief Survey [PDF]
Arulkumaran, K., Deisenroth, M.P., Brundage, M. and Bharath, A.A., 2017. IEEE Signal Processing Magazine, Vol 34(6), pp. 26-38. DOI: 10.1109/MSP.2017.2743240
Deep Learning in Neural Networks: An Overview
Schmidhuber, J., 2015. Neural Networks, Vol 61, pp. 85-117. DOI: 10.1016/j.neunet.2014.09.003
A Possibility for Implementing Curiosity and Boredom in Model-building Neural Controllers [PDF]
Schmidhuber, J., 1990. Proceedings of the First International Conference on Simulation of Adaptive Behavior on From Animals to Animats, pp. 222—227. MIT Press.
On Learning to Think: Algorithmic Information Theory for Novel Combinations of Reinforcement Learning Controllers and Recurrent Neural World Models [PDF]
Schmidhuber, J., 2015. ArXiv preprint.
Auto-Encoding Variational Bayes [PDF]
Kingma, D. and Welling, M., 2013. ArXiv preprint.
Stochastic Backpropagation and Approximate Inference in Deep Generative Models [PDF]
Jimenez Rezende, D., Mohamed, S. and Wierstra, D., 2014. ArXiv preprint.
ViZDoom: A Doom-based AI Research Platform for Visual Reinforcement Learning [PDF]
Kempka, M., Wydmuch, M., Runc, G., Toczek, J. and Jaskowski, W., 2016. IEEE Conference on Computational Intelligence and Games, pp. 341—348. IEEE.
DoomTakeCover-v0 [link]
Paquette, P., 2016.
A Neural Representation of Sketch Drawings [link]
Ha, D. and Eck, D., 2017. ArXiv preprint.
Draw Together with a Neural Network [link]
Ha, D., Jongejan, J. and Johnson, I., 2017. Google AI Experiments.
Mixture density networks [link]
Bishop, C.M., 1994. Technical Report. Aston University.
Mixture Density Networks with TensorFlow [link]
Ha, D., 2015. blog.otoro.net.
Generating sequences with recurrent neural networks [PDF]
Graves, A., 2013. ArXiv preprint.
Recurrent Neural Network Tutorial for Artists [link]
Ha, D., 2017. blog.otoro.net.
Experiments in Handwriting with a Neural Network [link]
Carter, S., Ha, D., Johnson, I. and Olah, C., 2016. Distill. DOI: 10.23915/distill.00004
Evolutionsstrategie: optimierung technischer systeme nach prinzipien der biologischen evolution [link]
Rechenberg, I., 1973. Frommann-Holzboog.
Numerical Optimization of Computer Models [link]
Schwefel, H., 1977. John Wiley and Sons, Inc.
A Visual Guide to Evolution Strategies [link]
Ha, D., 2017. blog.otoro.net.
The CMA Evolution Strategy: A Tutorial [PDF]
Hansen, N., 2016. ArXiv preprint.
Completely Derandomized Self-Adaptation in Evolution Strategies [PDF]
Hansen, N. and Ostermeier, A., 2001. Evolutionary Computation, Vol 9(2), pp. 159—195. MIT Press. DOI: 10.1162/106365601750190398
CarRacing-v0 [link]
Klimov, O., 2016.
Self-driving cars in the browser [link]
Hünermann, J., 2017.
Mar I/O Kart [link]
Bling, S., 2015.
Using Keras and Deep Deterministic Policy Gradient to play TORCS [HTML]
Lau, B., 2016.
Car Racing using Reinforcement Learning [PDF]
Khan, M. and Elibol, O., 2016.
Reinforcement Car Racing with A3C [link]
Jang, S., Min, J. and Lee, C., 2017.
Deep-Q Learning for Box2D Racecar RL problem. [link]
Prieur, L., 2017. “GitHub”.
Video Game Exploits [link]
Wikipedia, A., 2017. Wikipedia.
Action-Conditional Video Prediction using Deep Networks in Atari Games [PDF]
Oh, J., Guo, X., Lee, H., Lewis, R. and Singh, S., 2015. ArXiv preprint.
Recurrent Environment Simulators [PDF]
Chiappa, S., Racaniere, S., Wierstra, D. and Mohamed, S., 2017. ArXiv preprint.
PILCO: A Model-Based and Data-Efficient Approach to Policy Search [PDF]
Deisenroth, M. and Rasmussen, C., 2011. In Proceedings of the International Conference on Machine Learning.
Neural Network Dynamics for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning [PDF]
Nagabandi, A., Kahn, G., Fearing, R. and Levine, S., 2017. ArXiv preprint.
Formal Theory of Creativity, Fun, and Intrinsic Motivation (1990-2010). [HTML]
Schmidhuber, J., 2010. IEEE Trans. Autonomous Mental Development.
Developmental Robotics, Optimal Artificial Curiosity, Creativity, Music, and the Fine Arts
Schmidhuber, J., 2006. Connection Science, Vol 18(2), pp. 173—187.
Curious Model-Building Control Systems
Schmidhuber, J., 1991. In Proc. International Joint Conference on Neural Networks, Singapore, pp. 1458—1463. IEEE.
Curiosity-driven Exploration by Self-supervised Prediction [link]
Pathak, D., Agrawal, P., A., E. and Darrell, T., 2017. ArXiv preprint.
Intrinsic Motivation Systems for Autonomous Mental Development [PDF]
Oudeyer, P., Kaplan, F. and Hafner, V., 2007. Trans. Evol. Comp. IEEE Press. DOI: 10.1109/TEVC.2006.890271
Reinforcement driven information acquisition in nondeterministic environments
Schmidhuber, J., Storck, J. and Hochreiter, S., 1994.
Information-seeking, curiosity, and attention: computational and neural mechanisms [PDF]
Gottlieb, J., Oudeyer, P., Lopes, M. and Baranes, A., 2013. Cell. DOI: 10.1016/j.tics.2013.09.001
Abandoning objectives: Evolution through the search for novelty alone [link]
Lehman, J. and Stanley, K., 2011. Evolutionary Computation, Vol 19(2), pp. 189—223. M I T Press.
Memory Consolidation [link]
Authors, W., 2017. Wikipedia.
Replay Comes of Age [link]
Foster, D.J., 2017. Annual Review of Neuroscience, Vol 40(1), pp. 581-602. DOI: 10.1146/annurev-neuro-072116-031538
Intrinsic Motivation and Automatic Curricula via Asymmetric Self-Play [PDF]
Sukhbaatar, S., Lin, Z., Kostrikov, I., Synnaeve, G., Szlam, A. and Fergus, R., 2017. ArXiv preprint.
Emergent Complexity via Multi-Agent Competition [PDF]
Bansal, T., Pachocki, J., Sidor, S., Sutskever, I. and Mordatch, I., 2017. ArXiv preprint.
Continuous Adaptation via Meta-Learning in Nonstationary and Competitive Environments [PDF]
Al-Shedivat, M., Bansal, T., Burda, Y., Sutskever, I., Mordatch, I. and Abbeel, P., 2017. ArXiv preprint.
PowerPlay: Training an Increasingly General Problem Solver by Continually Searching for the Simplest Still Unsolvable Problem [link]
Schmidhuber, J., 2013. Frontiers in Psychology, Vol 4, pp. 313. DOI: 10.3389/fpsyg.2013.00313
First Experiments with PowerPlay [PDF]
Srivastava, R., Steunebrink, B. and Schmidhuber, J., 2012. ArXiv preprint.
Optimal Ordered Problem Solver [PDF]
Schmidhuber, J., 2002. ArXiv preprint.
A Dual Back-Propagation Scheme for Scalar Reinforcement Learning
Munro, P.W., 1987. Proceedings of the Ninth Annual Conference of the Cognitive Science Society, Seattle, WA, pp. 165-176.
Dynamic Reinforcement Driven Error Propagation Networks with Application to Game Playing
Robinson, T. and Fallside, F., 1989. CogSci 89.
Neural Networks for Control and System Identification
Werbos, P.J., 1989. Proceedings of IEEE/CDC Tampa, Florida.
The truck backer-upper: An example of self learning in neural networks
Nguyen, N. and Widrow, B., 1989. Proceedings of the International Joint Conference on Neural Networks, pp. 357-363. IEEE Press.
Lecture Slides on PILCO [PDF]
Duvenaud, D., 2016. CSC 2541 Course at University of Toronto.
Data-Efficient Reinforcement Learning in Continuous-State POMDPs [PDF]
McAllister, R. and Rasmussen, C., 2016. ArXiv preprint.
Improving PILCO with Bayesian Neural Network Dynamics Models [PDF]
Gal, Y., McAllister, R. and Rasmussen, C., 2016. ICML Workshop on Data-Efficient Machine Learning.
Learning and Policy Search in Stochastic Dynamical Systems with Bayesian Neural Networks [PDF]
Depeweg, S., Hernandez-Lobato, J., Doshi-Velez, F. and Udluft, S., 2016. ArXiv preprint.
A Benchmark Environment Motivated by Industrial Control Problems [PDF]
Hein, D., Depeweg, S., Tokic, M., Udluft, S., Hentschel, A., Runkler, T. and Sterzing, V., 2017. ArXiv preprint.
Learning to Generate Artificial Fovea Trajectories for Target Detection [PDF]
Schmidhuber, J. and Huber, R., 1991. International Journal of Neural Systems, Vol 2(1-2), pp. 125—134. DOI: 10.1142/S012906579100011X
Learning deep dynamical models from image pixels [PDF]
Wahlström, N., Schön, T. and Deisenroth, M., 2014. ArXiv preprint.
From Pixels to Torques: Policy Learning with Deep Dynamical Models [PDF]
Wahlström, N., Schön, T. and Deisenroth, M., 2015. ArXiv preprint.
Deep Spatial Autoencoders for Visuomotor Learning [PDF]
Finn, C., Tan, X., Duan, Y., Darrell, T., Levine, S. and Abbeel, P., 2015. ArXiv preprint.
Embed to Control: A Locally Linear Latent Dynamics Model for Control from Raw Images [PDF]
Watter, M., Springenberg, J., Boedecker, J. and Riedmiller, M., 2015. ArXiv preprint.
Model-Based RL Lecture at Deep RL Bootcamp 2017 [link]
Finn, C., 2017.
Game Engine Learning from Video [link]
Matthew Guzdial, B.L., 2017. Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-17, pp. 3707—3713. DOI: 10.24963/ijcai.2017/518
Learning to Act by Predicting the Future [PDF]
Dosovitskiy, A. and Koltun, V., 2016. ArXiv preprint.
Hallucination with Recurrent Neural Networks [link]
Graves, A., 2015.
Unsupervised Learning of Disentangled Representations from Video [PDF]
Denton, E. and Birodkar, V., 2017. ArXiv preprint.
The Predictron: End-To-End Learning and Planning [PDF]
Silver, D., van Hasselt, H., Hessel, M., Schaul, T., Guez, A., Harley, T., Dulac-Arnold, G., Reichert, D., Rabinowitz, N., Barreto, A. and Degris, T., 2016. ArXiv preprint.
Imagination-Augmented Agents for Deep Reinforcement Learning [PDF]
Weber, T., Racanière, S., Reichert, D., Buesing, L., Guez, A., Rezende, D., Badia, A., Vinyals, O., Heess, N., Li, Y., Pascanu, R., Battaglia, P., Silver, D. and Wierstra, D., 2017. ArXiv preprint.
Visual Interaction Networks [PDF]
Watters, N., Tacchetti, A., Weber, T., Pascanu, R., Battaglia, P. and Zoran, D., 2017. ArXiv preprint.
PathNet: Evolution Channels Gradient Descent in Super Neural Networks [PDF]
Fernando, C., Banarse, D., Blundell, C., Zwols, Y., Ha, D., Rusu, A., Pritzel, A. and Wierstra, D., 2017. ArXiv preprint.
Evolution Strategies as a Scalable Alternative to Reinforcement Learning [PDF]
Salimans, T., Ho, J., Chen, X., Sidor, S. and Sutskever, I., 2017. ArXiv preprint.
Evolving Stable Strategies [link]
Ha, D., 2017. blog.otoro.net.
Welcoming the Era of Deep Neuroevolution [link]
Stanley, K. and Clune, J., 2017. Uber AI Research.
Playing Atari with Deep Reinforcement Learning [PDF]
Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D. and Riedmiller, M., 2013. ArXiv preprint.
Evolving Neural Networks Through Augmenting Topologies [link]
Stanley, K.O. and Miikkulainen, R., 2002. Evolutionary Computation, Vol 10(2), pp. 99-127.
Accelerated Neural Evolution Through Cooperatively Coevolved Synapses [PDF]
Gomez, F., Schmidhuber, J. and Miikkulainen, R., 2008. Journal of Machine Learning Research, Vol 9, pp. 937—965. JMLR.org.
Co-evolving Recurrent Neurons Learn Deep Memory POMDPs [PDF]
Gomez, F. and Schmidhuber, J., 2005. Proceedings of the 7th Annual Conference on Genetic and Evolutionary Computation, pp. 491—498. ACM. DOI: 10.1145/1068009.1068092
Autonomous Evolution of Topographic Regularities in Artificial Neural Networks [PDF]
Gauci, J. and Stanley, K.O., 2010. Neural Computation, Vol 22(7), pp. 1860—1898. MIT Press. DOI: 10.1162/neco.2010.06-09-1042
Parameter-exploring policy gradients [link]
Sehnke, F., Osendorfer, C., Ruckstieb, T., Graves, A., Peters, J. and Schmidhuber, J., 2010. Neural Networks, Vol 23(4), pp. 551—559. DOI: 10.1016/j.neunet.2009.12.004
Evolving Neural Networks [PDF]
Miikkulainen, R., 2013. IJCNN.
Evolving Large-scale Neural Networks for Vision-based Reinforcement Learning [HTML]
Koutnik, J., Cuccu, G., Schmidhuber, J. and Gomez, F., 2013. Proceedings of the 15th Annual Conference on Genetic and Evolutionary Computation, pp. 1061—1068. ACM. DOI: 10.1145/2463372.2463509
A Neuroevolution Approach to General Atari Game Playing [link]
Hausknecht, M., Lehman, J., Miikkulainen, R. and Stone, P., 2013. IEEE Transactions on Computational Intelligence and AI in Games.
Neuro-Visual Control in the Quake II Environment [PDF]
Parker, M. and Bryant, B., 2012. IEEE Transactions on Computational Intelligence and AI in Games.
Autoencoder-augmented Neuroevolution for Visual Doom Playing [PDF]
Alvernaz, S. and Togelius, J., 2017. ArXiv preprint.
Cortical interneurons that specialize in disinhibitory control [link]
Pi, H., Hangya, B., Kvitsiani, D., Sanders, J., Huang, Z. and Kepecs, A., 2013. Nature. DOI: 10.1038/nature12676
SE3-Pose-Nets: Structured Deep Dynamics Models for Visuomotor Planning and Control [PDF]
Byravan, A., Leeb, F., Meier, F. and Fox, D., 2017. ArXiv preprint.
Long short-term memory [PDF]
Hochreiter, S. and Schmidhuber, J., 1997. Neural Computation. MIT Press.
Learning to Forget: Continual Prediction with LSTM [PDF]
Gers, F., Schmidhuber, J. and Cummins, F., 2000. Neural Computation, Vol 12(10), pp. 2451—2471. MIT Press. DOI: 10.1162/089976600300015015
Nanoconnectomic upper bound on the variability of synaptic plasticity [link]
Bartol, T.M., Bromer, C., Kinney, J., Chirillo, M.A., Bourne, J.N., Harris, K.M. and Sejnowski, T.J., 2015. eLife Sciences Publications, Ltd. DOI: 10.7554/eLife.10778
Connectionist models of recognition memory: constraints imposed by learning and forgetting functions.
Ratcliff, R.M., 1990. Psychological review, Vol 97 2, pp. 285-308.
Catastrophic interference in connectionist networks: Can It Be predicted, can It be prevented? [PDF]
French, R.M., 1994. Advances in Neural Information Processing Systems 6, pp. 1176—1177. Morgan-Kaufmann.
Overcoming catastrophic forgetting in neural networks [PDF]
Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A.M., Quan, J., Ramalho, T., Grabska-Barwinska, A., Hassabis, D., Clopath, C., Kumaran, D. and Hadsell, R., 2016. ArXiv preprint.
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer [PDF]
Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G. and Dean, J., 2017. ArXiv preprint.
HyperNetworks [PDF]
Ha, D., Dai, A. and Le, Q., 2016. ArXiv preprint.
Language Modeling with Recurrent Highway Hypernetworks [PDF]
Suarez, J., 2017. Advances in Neural Information Processing Systems 30, pp. 3269—3278. Curran Associates, Inc.
WaveNet: A Generative Model for Raw Audio [PDF]
van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A. and Kavukcuoglu, K., 2016. ArXiv preprint.
Attention Is All You Need [PDF]
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A., Kaiser, L. and Polosukhin, I., 2017. ArXiv preprint.
Generative Temporal Models with Memory [PDF]
Gemici, M., Hung, C., Santoro, A., Wayne, G., Mohamed, S., Rezende, D., Amos, D. and Lillicrap, T., 2017. ArXiv preprint.
One Big Net For Everything [PDF]
Schmidhuber, J., 2018. Preprint arXiv:1802.08864 [cs.AI].
Learning Complex, Extended Sequences Using the Principle of History Compression
Schmidhuber, J., 1992. Neural Computation, Vol 4(2), pp. 234-242.

Method	$\;\;$ Average Score over 100 Random Tracks $\;\;$
DQN[54]	343 $\pm$ 18
A3C (continuous)[53]	591 $\pm$ 45
A3C (discrete)[52]	652 $\pm$ 10
ceobillionaire’s algorithm (unpublished)[48]	838 $\pm$ 11
V model only, $z$ input	632 $\pm$ 251
V model only, $z$ input with a hidden layer	788 $\pm$ 141
Full World Model, $z$ and $h$	906 $\pm$ 21

$\;\;$ Temperature $\;\;$	$\;\;$ Score in Virtual Environment	$\;\;$ Score in Actual Environment $\;\;$
0.10	2086 $\pm$ 140	193 $\pm$ 58
0.50	2060 $\pm$ 277	196 $\pm$ 50
1.00	1145 $\pm$ 690	868 $\pm$ 511
1.15	918 $\pm$ 546	1092 $\pm$ 556
1.30	732 $\pm$ 269	753 $\pm$ 139
Random Policy Baseline	N/A	210 $\pm$ 108
Gym Leaderboard[35]	N/A	820 $\pm$ 58

World Models

Can agents learn inside of their own dreams?

Abstract

Introduction

Agent Model

VAE (V) Model

MDN-RNN (M) Model

Controller (C) Model

Putting Everything Together

Car Racing Experiment: World Model for Feature Extraction

Procedure

Car Racing Experiment Results

V Model Only

Full World Model (V and M)

Car Racing Dreams

VizDoom Experiment: Learning Inside of a Dream

Procedure

Training Inside of the Dream

Transfer Policy to Actual Environment

Cheating the World Model

Iterative Training Procedure

Related Work

Discussion