Why On-Policy Data Matters in Post-Training
Published:
This post supports English / 中文 switching via the site language toggle in the top navigation.
TL;DR
Modern LLM post-training is increasingly organized around a rough pipeline:
Pretraining -> SFT -> RL -> OPD
SFT teaches instruction following, response style, and output format. RL improves behaviors that can be verified by reward, such as math and code. OPD, or On-Policy Distillation, is becoming useful for merging expert capabilities into a final model. The central question is not just which algorithm is stronger. It is why RL and OPD often generalize better and forget less than SFT.
My takeaway from this blog post by wh is that the common ingredient is probably on-policy data. SFT trains on a fixed external dataset. RL and OPD train on states visited by the current student model. That difference aligns the training distribution with the inference distribution, reducing the classic imitation-learning problem where small early mistakes push the model into states the teacher never covered.
SFT: Dense Imitation of an External Distribution
SFT is usually cross-entropy training on a fixed set of demonstrations. In distribution terms, the model is pulled toward an external target distribution defined by the dataset. This is powerful when the model needs to learn a new format or basic instruction-following behavior, but it also explains why SFT can be blunt.
Every target token receives gradient. A token can be a task-critical reasoning step, an operator in a proof, or a style marker such as “therefore” or “alright”; SFT still pushes the model to increase its probability. The supervision is dense and fairly uniform. Many parameters are updated across many tokens, including tokens that may only reflect the teacher’s incidental phrasing. This broad pressure can overwrite earlier representations and create catastrophic forgetting.
From the KL perspective, SFT is often described as minimizing forward KL against the dataset distribution. That framing is useful, but the practical point is simpler: SFT keeps telling the model what exact sequence it should say. It does not directly optimize whether the model can reach a successful outcome under its own rollouts.
RL: Success Over Exact Imitation
RL changes the target. In RLVR-style settings, the reward comes from whether the final answer is correct, whether the code passes tests, or whether some verifiable objective is achieved. The model is no longer required to match a reference answer token by token. It can find different reasoning paths as long as the behavior succeeds.
This is one reason RL tends to preserve more capability than SFT. The update is attached to sampled behavior and reward, not to every token in a fixed external corpus. Some papers interpret RL as reverse-KL-like or mode-seeking, and that intuition helps. But the blog argues that KL direction alone is incomplete. RLVR often continues to show relatively strong anti-forgetting behavior even when explicit KL penalties are weakened.
The deeper reason may be that RL samples from the current policy. The model learns on states it actually visits. If a rollout reaches a wrong intermediate state, training can still expose the model to that region and adjust behavior there. This makes the training process closer to deployment, where the model must condition on its own previous tokens.
OPD: Distillation on the Student’s Own States
OPD sits between SFT and RL. It resembles SFT because it uses token-level teacher supervision. It resembles RL because the data is generated by the student itself. The student first samples trajectories; then a teacher provides distributions on the student’s prefixes; training often looks like reverse-KL-style matching between student and teacher.
This combination matters. OPD has dense supervision, but the supervision is applied on on-policy prefixes. The teacher is not simply showing a perfect trajectory from its own distribution. It is giving advice on states the student currently reaches. That is why OPD can inherit part of RL’s anti-forgetting behavior even when the teacher was produced by SFT.
The blog discusses OPSD, where high per-token KL can appear on style tokens such as “wait” or “alright” rather than mathematical tokens. If updates follow raw KL strength too aggressively, the student may learn style instead of capability. This is why per-token clipping becomes important: the dense signal is useful, but it must not let incidental style dominate the update.
Why Can the Student Beat the Teacher?
One surprising result is that OPD students can sometimes outperform their teachers. In ordinary distillation, the student is expected to copy the teacher, so surpassing the teacher feels unintuitive. OPD changes the setting because the teacher supervises the student’s own state distribution.
If the student makes mistakes the teacher rarely makes, teacher-generated trajectories may not train the student on the right prefixes. OPD does. The teacher corrects the states the student actually visits. In addition, KL matching transfers more than a single greedy answer. It can transfer uncertainty, alternative continuations, reasoning structure, and style. The student distribution may be reshaped in ways that improve sampling behavior without merely copying the teacher’s most likely output.
This is the distributional-shaping view. Post-training does not only edit a few examples. It changes the shape of a very high-dimensional probability distribution. Capability can improve because good trajectories become more likely, bad trajectories become less likely, and reasoning modes are reorganized, even when individual training samples do not look obviously optimal.
On-Policy Data as the Load-Bearing Ingredient
The key difference between SFT and on-policy methods is distribution alignment. In SFT, the model sees states visited by the teacher or dataset writer. At inference time, language generation is autoregressive. Once the model makes an early deviation, the remaining prefix may lie outside the teacher’s state distribution. Errors can compound.
RL and OPD reduce this mismatch because training data comes from the current model. The model is trained on its own prefixes, its own near-misses, and its own reachable states. This creates an implicit regularization: the update is constrained to regions near the current policy. Instead of dragging the model toward an arbitrary external distribution, on-policy training tends to move it toward a nearby task-solving policy.
This also explains why OPD can help merge expert capabilities. A specialized teacher may contain useful behavior, but the student should not simply inherit the teacher’s entire distribution. OPD lets the student ask, in effect: “Given the states I actually reach, how should my distribution be reshaped?”
What the Next Algorithm Might Need
The full post-training pipeline is still imperfect. SFT is useful for format and instruction following. RL is powerful when rewards are verifiable, but outcome rewards are sparse and expensive. OPD provides dense token-level signal, but teacher logits can be biased and can overemphasize style tokens.
The desired algorithm would combine three properties: dense supervision like distillation, low-bias optimization like RL on verifiable outcomes, and the distribution alignment of on-policy data. The exact algorithm is still unclear. But the shape of the problem is becoming clearer: the next step in post-training probably cannot abandon on-policy learning. On-policy data may be the part that lets models gain capability without paying for it through broad forgetting.
