[Paper Notes] DayDreamer: World Models for Physical Robot Learning - CoRL 2022

1 minute read

Published:

Key information

  • This paper learns two models: a world model trained on off-policy sequences through supervised learning, and an actor-critic model to learn behaviors from trajectories predicted by the learned model.
  • The data collection and learning updates are decoupled, enabling fast training without waiting for the environment. A learner thread continuously trains the world model and actor-critic behavior, while an actor thread in parallel computes actions for environment interaction.

World model learning

  • The world model can be thought of as a fast simulator of the environment that the robot learns autonomously, despite that the physical robot runs in real environment.
  • The world model is based on the Recurrent State-Space Model (RSSM) which consists of encoder, decoder, dynamics and reward networks.
  • The encoder network fuses all sensory inputs $x_t$ together into the stochastic representations $z_t$. The dynamics model learns to predict the sequence of stochastic representations by using its recurrent state $h_t$. The reward network predicts task rewards by letting the robot interact with the real world. (It appears that the decoder network is not in use in this paper.)
  • All components of the world model are jointly optimized by stochastic backpropagation

Actor-critic learning

The actor-critic algorithm learns a behavior that is specific to the task at hand, in which the actor-network decides which action to take in a given state by maximizing returns, whereas the critic-network evaluates the action by regressing the returns.

Experiments

Experiments are carried out on four different robots with different tasks.

My takeaways

The paper gives a method in which RL can be trained in real environments aside from a simulator. A world model is trained and used for quick updates, and the data collection and learning updates are decoupled. These techniques provide insightful ideas for future reinforcement learning architectural design.

Reproduction of this paper

  • Physical robots (May require MuJoCo or Isaac to reproduce due to lack of hardware)
  • Games (Easier to reproduce)