Why On-Policy Data Matters in Post-Training

9 minute read

Published:

This post supports English / 中文 switching via the site language toggle in the top navigation.

TL;DR

Modern LLM post-training is increasingly organized around a rough pipeline:

Pretraining -> SFT -> RL -> OPD

SFT teaches instruction following, response style, and output format. RL improves behaviors that can be verified by reward, such as math and code. OPD, or On-Policy Distillation, is becoming useful for merging expert capabilities into a final model. The central question is not just which algorithm is stronger. It is why RL and OPD often generalize better and forget less than SFT.

My takeaway from this blog post by wh is that the common ingredient is probably on-policy data. SFT trains on a fixed external dataset. RL and OPD train on states visited by the current student model. That difference aligns the training distribution with the inference distribution, reducing the classic imitation-learning problem where small early mistakes push the model into states the teacher never covered.

SFT: Dense Imitation of an External Distribution

SFT is usually cross-entropy training on a fixed set of demonstrations. In distribution terms, the model is pulled toward an external target distribution defined by the dataset. This is powerful when the model needs to learn a new format or basic instruction-following behavior, but it also explains why SFT can be blunt.

Every target token receives gradient. A token can be a task-critical reasoning step, an operator in a proof, or a style marker such as “therefore” or “alright”; SFT still pushes the model to increase its probability. The supervision is dense and fairly uniform. Many parameters are updated across many tokens, including tokens that may only reflect the teacher’s incidental phrasing. This broad pressure can overwrite earlier representations and create catastrophic forgetting.

From the KL perspective, SFT is often described as minimizing forward KL against the dataset distribution. That framing is useful, but the practical point is simpler: SFT keeps telling the model what exact sequence it should say. It does not directly optimize whether the model can reach a successful outcome under its own rollouts.

RL: Success Over Exact Imitation

RL changes the target. In RLVR-style settings, the reward comes from whether the final answer is correct, whether the code passes tests, or whether some verifiable objective is achieved. The model is no longer required to match a reference answer token by token. It can find different reasoning paths as long as the behavior succeeds.

This is one reason RL tends to preserve more capability than SFT. The update is attached to sampled behavior and reward, not to every token in a fixed external corpus. Some papers interpret RL as reverse-KL-like or mode-seeking, and that intuition helps. But the blog argues that KL direction alone is incomplete. RLVR often continues to show relatively strong anti-forgetting behavior even when explicit KL penalties are weakened.

The deeper reason may be that RL samples from the current policy. The model learns on states it actually visits. If a rollout reaches a wrong intermediate state, training can still expose the model to that region and adjust behavior there. This makes the training process closer to deployment, where the model must condition on its own previous tokens.

OPD: Distillation on the Student’s Own States

OPD sits between SFT and RL. It resembles SFT because it uses token-level teacher supervision. It resembles RL because the data is generated by the student itself. The student first samples trajectories; then a teacher provides distributions on the student’s prefixes; training often looks like reverse-KL-style matching between student and teacher.

This combination matters. OPD has dense supervision, but the supervision is applied on on-policy prefixes. The teacher is not simply showing a perfect trajectory from its own distribution. It is giving advice on states the student currently reaches. That is why OPD can inherit part of RL’s anti-forgetting behavior even when the teacher was produced by SFT.

The blog discusses OPSD, where high per-token KL can appear on style tokens such as “wait” or “alright” rather than mathematical tokens. If updates follow raw KL strength too aggressively, the student may learn style instead of capability. This is why per-token clipping becomes important: the dense signal is useful, but it must not let incidental style dominate the update.

Why Can the Student Beat the Teacher?

One surprising result is that OPD students can sometimes outperform their teachers. In ordinary distillation, the student is expected to copy the teacher, so surpassing the teacher feels unintuitive. OPD changes the setting because the teacher supervises the student’s own state distribution.

If the student makes mistakes the teacher rarely makes, teacher-generated trajectories may not train the student on the right prefixes. OPD does. The teacher corrects the states the student actually visits. In addition, KL matching transfers more than a single greedy answer. It can transfer uncertainty, alternative continuations, reasoning structure, and style. The student distribution may be reshaped in ways that improve sampling behavior without merely copying the teacher’s most likely output.

This is the distributional-shaping view. Post-training does not only edit a few examples. It changes the shape of a very high-dimensional probability distribution. Capability can improve because good trajectories become more likely, bad trajectories become less likely, and reasoning modes are reorganized, even when individual training samples do not look obviously optimal.

On-Policy Data as the Load-Bearing Ingredient

The key difference between SFT and on-policy methods is distribution alignment. In SFT, the model sees states visited by the teacher or dataset writer. At inference time, language generation is autoregressive. Once the model makes an early deviation, the remaining prefix may lie outside the teacher’s state distribution. Errors can compound.

RL and OPD reduce this mismatch because training data comes from the current model. The model is trained on its own prefixes, its own near-misses, and its own reachable states. This creates an implicit regularization: the update is constrained to regions near the current policy. Instead of dragging the model toward an arbitrary external distribution, on-policy training tends to move it toward a nearby task-solving policy.

This also explains why OPD can help merge expert capabilities. A specialized teacher may contain useful behavior, but the student should not simply inherit the teacher’s entire distribution. OPD lets the student ask, in effect: “Given the states I actually reach, how should my distribution be reshaped?”

What the Next Algorithm Might Need

The full post-training pipeline is still imperfect. SFT is useful for format and instruction following. RL is powerful when rewards are verifiable, but outcome rewards are sparse and expensive. OPD provides dense token-level signal, but teacher logits can be biased and can overemphasize style tokens.

The desired algorithm would combine three properties: dense supervision like distillation, low-bias optimization like RL on verifiable outcomes, and the distribution alignment of on-policy data. The exact algorithm is still unclear. But the shape of the problem is becoming clearer: the next step in post-training probably cannot abandon on-policy learning. On-policy data may be the part that lets models gain capability without paying for it through broad forgetting.

本文支持通过顶部导航栏的语言切换按钮在 English / 中文 之间切换。

TL;DR

近年来,大模型后训练逐渐形成了一条粗略路线:

Pretraining -> SFT -> RL -> OPD

SFT 负责让模型学会指令跟随、回复风格和输出格式;RL 负责提升数学、代码等可验证任务上的能力;OPD,也就是 On-Policy Distillation,则越来越多地用于把多个专家能力合并进最终模型。真正值得问的问题不是哪一种算法更强,而是:为什么 RL 和 OPD 往往比 SFT 泛化更好、遗忘更少

我读完 wh 的这篇 blog 后,最认同的解释是 on-policy data。SFT 在固定外部数据集上训练;RL 和 OPD 则在当前 student model 自己会访问到的状态上训练。这会让训练分布和推理分布更一致,缓解 imitation learning 中经典的 distribution shift:模型一旦在前面某一步偏离教师轨迹,后续 prefix 就可能落到教师从未覆盖过的区域,错误于是不断累积。

SFT:对外部分布做密集模仿

SFT 通常是在固定 demonstrations 上做 cross-entropy training。用分布视角看,当前模型被拉向一个由数据集定义的外部目标分布。这在模型需要学习新格式、基础指令跟随和回答风格时非常有用,但也解释了为什么 SFT 的更新有时会显得很“钝”。

每一个目标 token 都会获得梯度。这个 token 可能是决定数学题成败的关键推理步骤,也可能只是 “therefore”、”alright” 这样的风格词;SFT 都会推动模型提高它的概率。因此 SFT 的监督是密集且相对均匀的。大量参数会在大量 token 上一起被修改,其中不少 token 只是教师表达习惯的副产物。这样的广泛压力容易覆盖已有表示,导致灾难性遗忘。

从 KL 视角看,SFT 常被描述为对数据集分布最小化 Forward KL。这个说法有帮助,但更直观的理解是:SFT 不断告诉模型“应该说出哪条具体序列”。它并不直接优化模型在自己 rollout 中是否能完成任务。

RL:优化成功,而不是复制答案

RL 改变了目标。在 RLVR 这类场景中,奖励来自最终答案是否正确、代码是否通过测试,或者某个可验证目标是否达成。模型不再需要逐 token 匹配参考答案。只要行为成功,它可以走出不同的推理路径。

这也是 RL 往往比 SFT 更能保持能力的原因之一。更新绑定的是模型采样出来的行为和 reward,而不是固定外部语料中的每一个 token。一些工作会把 RL 理解成 reverse-KL-like 或 mode-seeking 的优化,这个直觉有用。但原文认为,仅靠 KL 方向解释还不够,因为在 RLVR 中,即使显式 KL penalty 被削弱甚至移除,RL 仍然常常表现出较强的抗遗忘能力。

更深层的因素可能是 RL 从当前 policy 采样。模型在自己真实会访问到的状态上学习。如果 rollout 进入了错误的中间状态,训练过程仍然可以看到这个区域,并在那里调整行为。这更接近部署时的真实情况:语言模型必须基于自己已经生成的 prefix 继续生成。

OPD:在 Student 自己的状态上做蒸馏

OPD 位于 SFT 和 RL 之间。它像 SFT,因为它有 token-level teacher supervision;它也像 RL,因为数据来自 student 自己。Student 先生成 trajectories,然后 teacher 在这些 student prefixes 上提供概率分布,训练目标通常类似 reverse-KL-style matching。

这个组合很关键。OPD 有密集监督,但监督作用在 on-policy prefixes 上。Teacher 不是简单展示自己分布中的完美轨迹,而是在 student 当前会到达的状态上给出指导。因此,即使 teacher 本身是通过 SFT 得到的,OPD 仍可能继承一部分 RL 的抗遗忘特性。

原文还讨论了 OPSD 中一个很有意思的现象:per-token KL 较高的 token 往往不是数学 token,而是 “wait”、”alright” 这类风格 token。如果直接按照 KL 强度大幅更新,student 可能学到的是风格而不是能力。因此 per-token clipping 很重要:密集信号有价值,但不能让偶然的风格差异主导更新。

Student 为什么可能超过 Teacher?

一个反直觉结果是,OPD student 有时可以超过 teacher。传统蒸馏里,student 通常被理解为复制 teacher,因此超越 teacher 并不直观。OPD 的关键不同在于,teacher 监督的是 student 自己的状态分布。

如果 student 犯的错误和 teacher 犯的错误不同,那么只在 teacher-generated trajectories 上训练,未必能覆盖 student 真正需要修正的 prefixes。OPD 可以覆盖这些 prefixes。Teacher 实际上是在 student 自己访问到的状态上纠错。此外,KL matching 学到的不只是一个 greedy answer,还可能包含不确定性、备选路径、推理结构和风格。Student 的分布会被重塑,而不是简单复制 teacher 最可能的输出。

这就是 distributional shaping 的视角。后训练改变的不是几条样本,而是整个高维概率分布的形状。能力提升可能来自好轨迹概率上升、坏轨迹概率下降,以及推理模式被重新组织。即使单条训练样本看起来并不完美,整体分布仍可能朝更有用的方向移动。

On-Policy Data 才是承重结构

SFT 和 on-policy 方法的关键差异是分布对齐。SFT 只看到教师或数据集作者访问过的状态。推理时,语言模型是自回归生成的;一旦模型早期偏离,后续 prefix 就可能不再属于教师状态分布。错误会不断积累。

RL 和 OPD 能减轻这种 mismatch,因为训练数据来自当前模型。模型会在自己的 prefixes、自己的 near-misses、自己的 reachable states 上学习。这带来一种隐式正则:更新被限制在当前 policy 附近的区域。与其把模型拉向任意外部分布,on-policy training 更像是在把模型推向离当前模型最近的 task-solving policy。

这也解释了为什么 OPD 适合做 expert capability merging。一个 specialized teacher 可能包含有价值的能力,但 student 不应该完整继承 teacher 的整个分布。OPD 让 student 问的是:在我真实会访问到的状态上,我的分布应该如何被重塑?

下一种算法可能需要什么

当前后训练路线仍然不完美。SFT 对格式和指令跟随很有用;RL 在 reward 可验证时很强,但 outcome reward 稀疏且昂贵;OPD 提供密集 token-level signal,但 teacher logits 有 bias,也可能过度强调风格 token。

理想的新算法大概需要同时具备三点:像 distillation 一样有密集监督,像 RLVR 一样有低偏差的优化目标,同时保留 on-policy data 带来的分布对齐。具体算法现在还不清楚,但问题的轮廓已经越来越清楚:未来更高效的后训练方法,大概率不能放弃 on-policy learning。On-policy data 可能正是模型获得新能力、又不大规模遗忘旧能力的关键。