[Paper Notes] CFGRL: Diffusion Guidance Is a Controllable Policy Improvement Operator
Published:
This post supports English / 中文 switching via the site language toggle in the top navigation.
TL;DR
CFGRL derives a direct, principled connection between classifier-free guidance (CFG) in diffusion models and policy improvement in RL. The key insight: a policy can be factored as prior x optimality, and the guidance weight w in CFG directly controls the degree of policy improvement — provably. This means:
- Train with the simplicity of supervised learning (conditional diffusion/flow matching)
- Get policy improvement for free at test time by tuning the guidance weight
w— no retraining needed - Works as a drop-in replacement for advantage-weighted regression (AWR) in offline RL
- Especially powerful for goal-conditioned BC (GCBC): standard GCBC is just CFGRL with
w=1; settingw>1provably improves the policy, often doubling success rates — without learning any value function
Paper Info
- Title: Diffusion Guidance Is a Controllable Policy Improvement Operator
- Authors: Kevin Frans, Seohong Park, Pieter Abbeel, Sergey Levine
- Affiliation: UC Berkeley
- Date: 2025-05-29
- Venue: arXiv preprint, under review
- arXiv: 2505.23458
- Code: github.com/kvfrans/cfgrl
1. Motivation
Two ends of the spectrum for learning from offline data:
| Approach | Training | Optimality | Scalability |
|---|---|---|---|
| Behavioral cloning | Simple (supervised) | Only as good as data | Excellent |
| Offline RL | Complex (value functions, policy gradients) | Can improve beyond data | Notoriously tricky |
Can we get policy improvement while keeping the simplicity of supervised learning?
2. Core Idea: Product Policies
Define an improved policy as a product of two factors:
| \(\pi(a | s) \propto \hat{\pi}(a | s) \cdot f(A^{\hat{\pi}}(s,a))\) |
where \(\hat{\pi}\) is the reference (behavior) policy and \(f\) is a non-negative, monotonically increasing function of the advantage \(A^{\hat{\pi}}(s,a)\).
Theorem 1 (Policy Improvement): If \(f\) is non-negative and non-decreasing in advantage, then the product policy \(\pi\) is guaranteed to improve over \(\hat{\pi}\):
\(J(\pi) \geq J(\hat{\pi})\)
| Theorem 2 (Controllable Improvement): For \(0 \leq w_1 < w_2\), the attenuated product \(\pi_{w_2}(a | s) \propto \hat{\pi}(a | s) f(A(s,a))^{w_2}\) is a further improvement over \(\pi_{w_1}\): |
\(J(\pi_{w_1}) \leq J(\pi_{w_2})\)
Higher w = more improvement (but also more divergence from reference → eventual distribution shift).
3. Connection to Diffusion Guidance
The product policy’s score function decomposes additively:
| \(\nabla_a \log \pi(a | s) = \nabla_a \log \hat{\pi}(a | s) + \nabla_a \log p(o | s,a)\) |
Using classifier-free guidance (Bayes’ rule trick), this becomes:
| \(\nabla_a \log \hat{\pi}(a | s) + w \cdot (\nabla_a \log \hat{\pi}(a | s, o) - \nabla_a \log \hat{\pi}(a | s))\) |
This is exactly the standard CFG formula! The guidance weight w directly controls the degree of policy improvement. Both the unconditional and conditional scores come from the same network trained with a simple flow matching objective:
\(\mathcal{L}(\theta) = \mathbb{E}_{s,a \sim \mathcal{D}} |v_\theta(a_t, t, s, o) - (a - a_0)|^2\)
where \(o \in {\emptyset, 0, 1}\) is the optimality label (with 10% dropout for unconditional training).
Why this matters vs. AWR
| Property | AWR | CFGRL |
|---|---|---|
| Temperature/weight | \(1/\beta\) (fixed at train time) | \(w\) (tunable at test time) |
| Gradient distribution | Dominated by outlier high-advantage samples | Even weighting across batch |
| Retraining needed to tune? | Yes | No |
| Empirical scaling | Saturates around \(1/\beta = 10\) | Continues improving beyond |
4. Special Case: Goal-Conditioned BC (GCBC)
| This is where CFGRL truly shines. Standard GCBC trains a goal-conditioned policy \(\pi(a | s,g)\). The paper makes the connection more explicit than most prior GCBC writeups: |
| \(\pi(a | s, g) = \frac{\hat{\pi}(a | s)\, p_\gamma(g | s,a)}{p_\gamma(g | s)} \propto \hat{\pi}(a | s) \cdot Q^{\hat{\pi}}(s, a, g)\) |
The second factor satisfies the conditions of Theorem 1. Therefore:
- Standard GCBC = CFGRL with w=1 (implicit, no improvement)
- CFGRL with w>1 = provably improved GCBC — for free!
The sampling formula is simply:
| \(\nabla_a \log \hat{\pi}(a | s) + w \cdot (\nabla_a \log \pi(a | s, g) - \nabla_a \log \hat{\pi}(a | s))\) |
No value function needed. Just train a goal-conditioned flow policy and an unconditional flow policy (same network with dropout), then tune w at test time.
5. Key Results
Offline RL (ExORL benchmark, with learned value function)
CFGRL consistently outperforms AWR on most tasks:
| Task | AWR | CFGRL |
|---|---|---|
| walker-stand | 603 | 782 |
| walker-walk | 444 | 608 |
| walker-run | 247 | 282 |
| quadruped-run | 485 | 571 |
| cheetah-run | 168 | 216 |
| cheetah-run-backward | 146 | 262 |
| jaco-reach-top-right | 33 | 72 |
Goal-Conditioned BC (OGBench, no value function)
CFGRL as drop-in GCBC improvement (selected results, flat policies):
| Task | Flow GCBC | CFGRL | Improvement |
|---|---|---|---|
| pointmaze-large-navigate | 74 | 77 | +4% |
| pointmaze-giant-navigate | 4 | 30 | 7.5x |
| antmaze-medium-navigate | 42 | 53 | +26% |
| humanoidmaze-medium-navigate | 8 | 19 | 2.4x |
| visual-cube-single-play | 13 | 37 | 2.8x |
| visual-scene-play | 25 | 40 | +60% |
One detail I found especially convincing in the PDF is that these gains are not coming from heavy per-task retuning. For the main GCBC table, the authors use a single fixed guidance strength of w = 3 and still get broad improvements across state-based and pixel-based tasks.
With hierarchical policies (HCFGRL), gains are even larger:
| Task | Flow HGCBC | HCFGRL |
|---|---|---|
| antmaze-medium-navigate | 67 | 90 |
| antmaze-large-navigate | 61 | 78 |
| antmaze-giant-navigate | 14 | 38 |
| humanoidmaze-large-navigate | 11 | 38 |
| cube-double-play | 21 | 42 |
Scaling behavior
The guidance weight w provides a reliable knob:
- Performance steadily increases with
wup to a divergence point - The divergence point is further out than AWR’s temperature saturation
wcan be swept without retraining — just re-run inference
6. Strengths
- Elegant theory: clean, provable connection between CFG and policy improvement with formal guarantees (Theorems 1 & 2)
- Extreme simplicity: train conditional diffusion/flow model with supervised loss → tune
wat test time → done - No value function needed for GCBC setting — improvement literally comes for free
- Test-time tunable: unlike AWR where \(\beta\) is baked into training,
wcan be swept without retraining - Fixes AWR’s gradient issue: even gradient magnitudes across batch vs. outlier-dominated in AWR
- Broad applicability: state-based, visual, hierarchical, offline RL, goal-conditioned settings
7. Limitations
- One-step improvement only: CFGRL provides one step of policy improvement over the reference, not iterative optimization — not a full RL algorithm
- Distribution shift at high
w: theoretical guarantees hold but practical performance degrades whenwis too large (divergence from reference) - Assumes a separate value-learning story in offline RL: in the AWR-style setting, CFGRL replaces policy extraction, not the upstream Q / V training pipeline
- Requires offline data quality: still fundamentally limited by the support of the dataset — can improve suboptimal data but can’t discover entirely new behaviors
- Not SOTA offline RL: the authors explicitly note CFGRL is a tool (replacing AWR), not a complete offline RL system — advanced methods like policy gradients could extrapolate further
8. Takeaways
- CFG = policy improvement: diffusion guidance weight directly and provably controls the degree of policy improvement — one of the cleanest theory-to-practice connections in recent RL
- GCBC users: use guidance! Standard GCBC is CFGRL with
w=1. Settingw>1is a free lunch — no value function, no retraining, often 2-3x success rate - Test-time controllability is underrated: being able to sweep the optimality-regularization tradeoff without retraining is enormously practical
- Supervised learning + guidance can go beyond imitation: this bridges the gap between behavioral cloning and RL in an elegant way
- Implications for robotics: diffusion/flow policies are already dominant in robot learning (pi0, etc.) — CFGRL suggests that guidance could be a simple way to improve them beyond the demonstration data
References
- [Paper] arXiv:2505.23458
- [Code] github.com/kvfrans/cfgrl
