[Paper Notes] DayDreamer: World Models for Physical Robot Learning - CoRL 2022
Published:
Key information
- This paper learns two models: a world model trained on off-policy sequences through supervised learning, and an actor-critic model to learn behaviors from trajectories predicted by the learned model.
- The data collection and learning updates are decoupled, enabling fast training without waiting for the environment. A learner thread continuously trains the world model and actor-critic behavior, while an actor thread in parallel computes actions for environment interaction.
World model learning
- The world model can be thought of as a fast simulator of the environment that the robot learns autonomously, despite that the physical robot runs in real environment.
- The world model is based on the Recurrent State-Space Model (RSSM) which consists of encoder, decoder, dynamics and reward networks.
- The encoder network fuses all sensory inputs $x_t$ together into the stochastic representations $z_t$. The dynamics model learns to predict the sequence of stochastic representations by using its recurrent state $h_t$. The reward network predicts task rewards by letting the robot interact with the real world. (It appears that the decoder network is not in use in this paper.)
- All components of the world model are jointly optimized by stochastic backpropagation
Actor-critic learning
The actor-critic algorithm learns a behavior that is specific to the task at hand, in which the actor-network decides which action to take in a given state by maximizing returns, whereas the critic-network evaluates the action by regressing the returns.
Experiments
Experiments are carried out on four different robots with different tasks.
- Unitree A1 Quadruped Walking
- UR5 Multi-Object Visual Pick and Place
- XArm Visual Pick and Place
- Sphero Navigation
My takeaways
The paper gives a method in which RL can be trained in real environments aside from a simulator. A world model is trained and used for quick updates, and the data collection and learning updates are decoupled. These techniques provide insightful ideas for future reinforcement learning architectural design.
Reproduction of this paper
- Physical robots (May require MuJoCo or Isaac to reproduce due to lack of hardware)
- Games (Easier to reproduce)