[Paper Notes] Discovering state-of-the-art reinforcement learning algorithms
Published:
This post supports English / 中文 switching via the site language toggle in the top navigation.
TL;DR
DiscoRL turns RL algorithm design itself into a meta-learning problem. Instead of hand-writing TD targets, policy losses, or auxiliary objectives, the paper learns a meta-network that emits update targets for an agent’s policy and internal predictions.
This is not “meta-RL” in the usual fast-adaptation sense. The thing being meta-learned is the learning rule itself.
The headline result is strong:
- Disco57 reaches IQM 13.86 on Atari, outperforming prior hand-designed RL rules on this benchmark
- The discovered rule also transfers to unseen benchmarks such as ProcGen, Crafter, and NetHack
- A broader discovery run, Disco103, gets even stronger by training on a more diverse set of environments
The big message is that RL algorithm discovery can scale with environment diversity and compute, much like model pretraining scales with data.
Paper Info
- Title: Discovering state-of-the-art reinforcement learning algorithms
- Authors: Junhyuk Oh, Gregory Farquhar, Iurii Kemaev, Dan A. Calian, Matteo Hessel, Luisa Zintgraf, Satinder Singh, Hado van Hasselt, David Silver
- Affiliation: Google DeepMind
- Venue: Nature, Vol. 648, 11 December 2025 issue
- Published online: 2025-10-22
- DOI: 10.1038/s41586-025-09761-x
- Code: google-deepmind/disco_rl
1. Motivation
Most RL progress still comes from humans inventing better update rules: TD learning, Q-learning, PPO, auxiliary losses, distributional targets, and so on. Previous “automatic discovery” work usually searched a much narrower space:
- tune a few hyperparameters
- learn a scalar objective
- meta-train in toy environments
This paper asks a more ambitious question:
Can we directly discover the RL rule that updates the agent, using only the cumulative experience of many agents interacting with complex environments?
That framing is what makes the paper interesting. It is not just learning a better policy. It is trying to learn how policies and predictions should be learned.
2. Core Idea
2.1 Agent outputs
The agent produces more than a policy:
- a policy
\pi(s) - an observation-conditioned prediction vector
y(s) - an action-conditioned prediction vector
z(s, a) - an action-value head
q(s, a) - an auxiliary policy prediction
p(s, a)
The important part is that y and z do not have predefined semantics. They are slots in which the meta-learning process can invent useful internal predictions.
2.2 Meta-network as the discovered RL rule
A backward LSTM-based meta-network reads short trajectory segments containing:
- agent outputs over time
- rewards
- episode termination signals
It then emits targets for the current policy and predictions:
\hat{\pi}\hat{y}\hat{z}
The agent is updated to move toward those targets. In simplified form:
\[\mathcal{L}_{\theta} = \mathbb{E}\left[ D(\hat{\pi}, \pi_{\theta}) + D(\hat{y}, y_{\theta}) + D(\hat{z}, z_{\theta}) + \mathcal{L}_{\text{aux}} \right]\]where the paper uses KL-style distances and auxiliary targets for the value and auxiliary-policy heads.
2.3 Why this is more expressive than tuning a loss scalar
The meta-network outputs targets, not just one scalar loss. That matters because it naturally includes:
- bootstrapping from future predictions
- joint policy-and-prediction updates
- the possibility of inventing new prediction semantics beyond value functions
The authors argue this search space is strictly more expressive than only meta-learning a scalar objective.
2.4 Meta-optimization
The meta-parameters \eta are improved so that agents trained under the discovered rule get higher return:
In practice, they backpropagate through a window of agent updates and use a large population of agents across many environments. This is what lets the “algorithm designer” receive gradient signal from actual downstream learning performance.
3. What DiscoRL Seems to Discover
One of the most interesting parts of the paper is that the discovered predictions are not just hidden value functions with different names.
The analysis suggests:
- the discovered predictions often spike before salient events, such as large rewards
- they encode information about future policy entropy
- they attend to observation regions that are different from what the policy or value function focuses on
- future predictions strongly affect current targets, showing an emergent bootstrapping mechanism
The takeaway is that DiscoRL seems to invent internal predictive signals that complement the usual policy/value decomposition rather than simply rediscovering it.
4. Main Empirical Results
4.1 Atari
The strongest single claim is on Atari:
- Disco57 is discovered from the 57 Atari games themselves
- it reaches IQM 13.86
- it outperforms prior published hand-designed RL rules on Atari, including strong model-based and actor-critic baselines
The paper also emphasizes improved wall-clock efficiency relative to MuZero in this setting.
4.2 Generalization to unseen benchmarks
Even though Disco57 is discovered on Atari, it transfers to environments it never saw during discovery:
- ProcGen: beats existing published methods in the paper’s comparison
- Crafter: competitive performance
- NetHack Challenge: reaches third place on the NeurIPS 2021 challenge leaderboard without domain-specific shaping or handcrafted subtasks
This is one of the paper’s most convincing results. A discovered rule is only interesting if it generalizes beyond the exact benchmarks used to create it.
4.3 More diverse discovery environments help
The second discovered rule, Disco103, uses a broader discovery set:
- Atari
- ProcGen
- DMLab-30
Compared with Disco57, it achieves:
- similar Atari performance
- better scores on every other seen and unseen benchmark reported in the main figure
- human-level performance on Crafter
- near-MuZero state-of-the-art performance on Sokoban
This supports the paper’s key scaling claim: stronger and more diverse discovery environments produce a better RL rule.
4.4 Discovery efficiency and compute
The paper frames discovery as surprisingly efficient relative to manual algorithm design:
- the best Atari rule emerges within roughly 600 million steps per game
- this is about three full experiments per Atari game
But the absolute compute is still large:
- Disco57: 128 agents, 1,024 TPUv3 cores for 64 hours
- Disco103: 206 agents, 2,048 TPUv3 cores for 60 hours
So the story is not “cheap discovery”. It is “expensive, but now plausibly worth scaling”.
5. Why the Paper Matters
- It expands the discovery space beyond hyperparameter tuning or minor loss shaping
- It shows real benchmark competitiveness, not just toy-environment meta-learning
- It provides evidence for novel internal predictions, which is more interesting than merely matching known RL rules
- It turns algorithm design into a scaling problem over environments and compute
For me, that last point is the most important. The paper suggests that at least part of RL algorithm research may become a systems-and-scaling problem, not only a human-theory problem.
6. Limitations and Open Questions
- Very high discovery cost: the method is still far from cheap or easy to reproduce
- Handcrafted scaffolding remains: the agent architecture, auxiliary heads, KL-style update form, and meta-optimization recipe are all still designed by humans
- Benchmark domain bias: the strongest evidence is on discrete-action game-like environments; the paper does not yet prove the same effect for continuous control or robotics
- Interpretability remains partial: we know the discovered predictions matter, but we still do not have a clean semantic theory for what each learned signal represents
- Commercial framing: the paper notes pending patent applications and Google ownership, which is worth keeping in mind for long-term openness
7. Takeaways
- Learning rules themselves are now viable targets for scaling. This paper makes RL algorithm discovery feel more like pretraining than like manual feature engineering.
- Searching over target-generating update rules is powerful. It is richer than tuning coefficients on a fixed PPO-style or TD-style loss.
- Complex discovery environments matter. Toy meta-training is not enough if the goal is to invent algorithms that work in hard settings.
- New predictive semantics may be a core ingredient of stronger RL. DiscoRL’s hidden predictions seem to carry information that standard policy/value heads miss.
- This is a milestone, not the endpoint. The paper shows that machine-discovered RL rules can beat strong human-designed baselines, but it does not yet mean human algorithm design is obsolete.
