[Paper Notes] CFGRL: Diffusion Guidance Is a Controllable Policy Improvement Operator
Published:
This post supports English / 中文 switching via the site language toggle in the top navigation.
TL;DR
CFGRL derives a direct, principled connection between classifier-free guidance (CFG) in diffusion models and policy improvement in RL. The key insight: a policy can be factored as prior x optimality, and the guidance weight w in CFG directly controls the degree of policy improvement — provably. This means:
- Train with the simplicity of supervised learning (conditional diffusion/flow matching)
- Get policy improvement for free at test time by tuning the guidance weight
w— no retraining needed - Works as a drop-in replacement for advantage-weighted regression (AWR) in offline RL
- Especially powerful for goal-conditioned BC (GCBC): standard GCBC is just CFGRL with
w=1; settingw>1provably improves the policy, often doubling success rates — without learning any value function
Paper Info
- Title: Diffusion Guidance Is a Controllable Policy Improvement Operator
- Authors: Kevin Frans, Seohong Park, Pieter Abbeel, Sergey Levine
- Affiliation: UC Berkeley
- Date: 2025-05-29
- arXiv: 2505.23458
- Code: github.com/kvfrans/cfgrl
1. Motivation
Two ends of the spectrum for learning from offline data:
| Approach | Training | Optimality | Scalability |
|---|---|---|---|
| Behavioral cloning | Simple (supervised) | Only as good as data | Excellent |
| Offline RL | Complex (value functions, policy gradients) | Can improve beyond data | Notoriously tricky |
Can we get policy improvement while keeping the simplicity of supervised learning?
2. Core Idea: Product Policies
Define an improved policy as a product of two factors:
\[\pi(a|s) \propto \hat{\pi}(a|s) \cdot f(A^{\hat{\pi}}(s,a))\]where $\hat{\pi}$ is the reference (behavior) policy and $f$ is a non-negative, monotonically increasing function of the advantage $A^{\hat{\pi}}(s,a)$.
Theorem 1 (Policy Improvement): If $f$ is non-negative and non-decreasing in advantage, then the product policy $\pi$ is guaranteed to improve over $\hat{\pi}$:
\[J(\pi) \geq J(\hat{\pi})\]| Theorem 2 (Controllable Improvement): For $0 \leq w_1 < w_2$, the attenuated product $\pi_{w_2}(a | s) \propto \hat{\pi}(a | s) f(A(s,a))^{w_2}$ is a further improvement over $\pi_{w_1}$: |
Higher w = more improvement (but also more divergence from reference → eventual distribution shift).
3. Connection to Diffusion Guidance
The product policy’s score function decomposes additively:
\[\nabla_a \log \pi(a|s) = \nabla_a \log \hat{\pi}(a|s) + \nabla_a \log p(o|s,a)\]Using classifier-free guidance (Bayes’ rule trick), this becomes:
\[\nabla_a \log \hat{\pi}(a|s) + w \cdot (\nabla_a \log \hat{\pi}(a|s, o) - \nabla_a \log \hat{\pi}(a|s))\]This is exactly the standard CFG formula! The guidance weight w directly controls the degree of policy improvement. Both the unconditional and conditional scores come from the same network trained with a simple flow matching objective:
where $o \in {\emptyset, 0, 1}$ is the optimality label (with 10% dropout for unconditional training).
Why this matters vs. AWR
| Property | AWR | CFGRL |
|---|---|---|
| Temperature/weight | $1/\beta$ (fixed at train time) | $w$ (tunable at test time) |
| Gradient distribution | Dominated by outlier high-advantage samples | Even weighting across batch |
| Retraining needed to tune? | Yes | No |
| Empirical scaling | Saturates around $1/\beta = 10$ | Continues improving beyond |
4. Special Case: Goal-Conditioned BC (GCBC)
| This is where CFGRL truly shines. Standard GCBC trains a goal-conditioned policy $\pi(a | s,g)$. The GCBC objective implicitly creates a product policy: |
The second factor satisfies the conditions of Theorem 1. Therefore:
- Standard GCBC = CFGRL with w=1 (implicit, no improvement)
- CFGRL with w>1 = provably improved GCBC — for free!
The sampling formula is simply:
\[\nabla_a \log \hat{\pi}(a|s) + w \cdot (\nabla_a \log \pi(a|s, g) - \nabla_a \log \hat{\pi}(a|s))\]No value function needed. Just train a goal-conditioned flow policy and an unconditional flow policy (same network with dropout), then tune w at test time.
5. Key Results
Offline RL (ExORL benchmark, with learned value function)
CFGRL consistently outperforms AWR on most tasks:
| Task | AWR | CFGRL |
|---|---|---|
| walker-stand | 603 | 782 |
| walker-walk | 444 | 608 |
| walker-run | 247 | 282 |
| quadruped-run | 485 | 571 |
| cheetah-run | 168 | 216 |
| cheetah-run-backward | 146 | 262 |
| jaco-reach-top-right | 33 | 72 |
Goal-Conditioned BC (OGBench, no value function)
CFGRL as drop-in GCBC improvement (selected results, flat policies):
| Task | Flow GCBC | CFGRL | Improvement |
|---|---|---|---|
| pointmaze-large-navigate | 74 | 77 | +4% |
| pointmaze-giant-navigate | 4 | 30 | 7.5x |
| antmaze-medium-navigate | 42 | 53 | +26% |
| humanoidmaze-medium-navigate | 8 | 19 | 2.4x |
| visual-cube-single-play | 13 | 37 | 2.8x |
| visual-scene-play | 25 | 40 | +60% |
With hierarchical policies (HCFGRL), gains are even larger:
| Task | Flow HGCBC | HCFGRL |
|---|---|---|
| antmaze-medium-navigate | 67 | 90 |
| antmaze-large-navigate | 61 | 78 |
| antmaze-giant-navigate | 14 | 38 |
| humanoidmaze-large-navigate | 11 | 38 |
| cube-double-play | 21 | 42 |
Scaling behavior
The guidance weight w provides a reliable knob:
- Performance steadily increases with
wup to a divergence point - The divergence point is further out than AWR’s temperature saturation
wcan be swept without retraining — just re-run inference
6. Strengths
- Elegant theory: clean, provable connection between CFG and policy improvement with formal guarantees (Theorems 1 & 2)
- Extreme simplicity: train conditional diffusion/flow model with supervised loss → tune
wat test time → done - No value function needed for GCBC setting — improvement literally comes for free
- Test-time tunable: unlike AWR where $\beta$ is baked into training,
wcan be swept without retraining - Fixes AWR’s gradient issue: even gradient magnitudes across batch vs. outlier-dominated in AWR
- Broad applicability: state-based, visual, hierarchical, offline RL, goal-conditioned settings
7. Limitations
- One-step improvement only: CFGRL provides one step of policy improvement over the reference, not iterative optimization — not a full RL algorithm
- Distribution shift at high
w: theoretical guarantees hold but practical performance degrades whenwis too large (divergence from reference) - Requires offline data quality: still fundamentally limited by the support of the dataset — can improve suboptimal data but can’t discover entirely new behaviors
- Not SOTA offline RL: the authors explicitly note CFGRL is a tool (replacing AWR), not a complete offline RL system — advanced methods like policy gradients could extrapolate further
8. Takeaways
- CFG = policy improvement: diffusion guidance weight directly and provably controls the degree of policy improvement — one of the cleanest theory-to-practice connections in recent RL
- GCBC users: use guidance! Standard GCBC is CFGRL with
w=1. Settingw>1is a free lunch — no value function, no retraining, often 2-3x success rate - Test-time controllability is underrated: being able to sweep the optimality-regularization tradeoff without retraining is enormously practical
- Supervised learning + guidance can go beyond imitation: this bridges the gap between behavioral cloning and RL in an elegant way
- Implications for robotics: diffusion/flow policies are already dominant in robot learning (pi0, etc.) — CFGRL suggests that guidance could be a simple way to improve them beyond the demonstration data
References
- [Paper] arXiv:2505.23458
- [Code] github.com/kvfrans/cfgrl
