This paper learns two models: a world model trained on off-policy sequences through supervised learning, and an actor-critic model to learn behaviors from trajectories predicted by the learned model.
The data collection and learning updates are decoupled, enabling fast training without waiting for the environment. A learner thread continuously trains the world model and actor-critic behavior, while an actor thread in parallel computes actions for environment interaction.