[Paper Notes] ARM: Advantage Reward Modeling for Long-Horizon Manipulation
Published:
This post supports English / 中文 switching via the site language toggle in the top navigation.
TL;DR
ARM trains a reward model for long-horizon manipulation, but the important move is subtle: it avoids asking humans or VLMs to assign an absolute progress score to every frame. Instead, it asks a simpler relative question: over a short interval, did the robot make progress, regress, or stay stagnant?
This gives the paper its core primitive:
\[y \in \{-1, 0, +1\}\]where (+1) means Progressing, (0) means Stagnant, and (-1) means Regressing. A MIMO temporal transformer then turns these tri-state advantage labels into dense progress curves. Those curves are used to weight imitation learning data through Advantage-Weighted Behavior Cloning (AW-BC).
The result is strong on a real long-horizon towel-folding task. Standard BC with GR00T-N1.5 reaches 62.1% success. RA-BC with SARM reaches 78.5%. AW-BC with ARM reaches 99.4%, with better throughput and folding precision. My read: ARM is less about a new policy architecture and more about a scalable reward-supervision interface for messy, non-monotonic robot demonstrations.
Paper Info
The paper is “ARM: Advantage Reward Modeling for Long-Horizon Manipulation” by Yiming Mao, Zixi Yu, Weixin Mao, Yinhao Li, Qirui Hu, Zihan Lan, Minzhao Zhu, and Hua Chen.
It appears on arXiv as arXiv:2604.03037, with v2 dated April 21, 2026. The project page is aiming1998.github.io/ARM.
Problem and Motivation
Long-horizon manipulation is hard for reinforcement learning because sparse rewards are too thin for credit assignment. A binary success signal at the end of a 120-second towel-folding episode does not tell the policy which grasp, pull, fold, recovery, or placement mattered.
Dense progress rewards are the usual answer, but they are expensive and brittle. A human annotator can label subtask boundaries, but this requires careful temporal localization. A VLM can be prompted to segment videos, but it is noisy, slow, and often lacks the geometric grounding needed for fine robot state changes. A scalar progress score also assumes something that is often false: progress should monotonically increase over time.
Real robot demonstrations contain backtracking, corrections, pauses, and recovery motions. In towel folding, for example, a local adjustment can temporarily move the towel away from the final shape while still being necessary for the next fold. A progress model that only knows “later means better” will punish exactly the kind of recovery behavior that long-horizon manipulation needs.
ARM reframes the reward-labeling problem around relative advantage. Instead of asking “how complete is this frame?”, it asks “did this interval improve the state relative to the recent past?” That makes the annotation more local, more robust to non-monotonic behavior, and easier to scale.
Core Idea: Tri-State Advantage Labels
The labeling scheme has three classes:
- Progressing (+1): the state advances toward the task goal.
- Regressing (-1): the state deviates from the goal, hits an error, or moves toward failure.
- Stagnant (0): no meaningful progress is made, such as waiting or idle motion.
This is much easier than asking for frame-level normalized progress (P \in [0, 1]). Humans do not need to decide whether a towel is 0.43 or 0.47 complete. They only need to judge the local direction of change.
The paper also uses this tri-state supervision as a cold start. After the initial human labels, ARM can run over large unlabeled trajectory datasets and generate pseudo-labels automatically. This is where the approach becomes more scalable than manual subtask segmentation.
Advantage Reward Model Architecture
ARM is a multimodal temporal reward model. For each timestep, it combines:
- CLIP visual features from video frames,
- robot proprioceptive state,
- task instruction text.
Each input is projected into a shared latent space:
\[x_i = \mathrm{MLP}(v_i) + \mathrm{MLP}(s_i) + \mathrm{MLP}(g)\]where (v_i) is the visual feature, (s_i) is the robot state, and (g) is the language task embedding.
The model uses an 8-layer Transformer encoder over a causal window:
\[W_t = \{o_{t-4k}, \dots, o_t\}\]The key architectural choice is MIMO: Multi-Input Multi-Output. Instead of producing one scalar progress estimate for one window, ARM predicts multiple interval advantage labels in one forward pass. This lets the model share temporal context across adjacent predictions and reduces redundant sliding-window inference.
ARM has two heads:
- Multi-frame advantage classification head. Predicts tri-state interval transitions (\Delta \hat{y}) between consecutive hidden states.
- Task completion head. Predicts whether a state is a successful terminal state.
The total loss is:
\[\mathcal{L}_{ARM} = \lambda_{int}\mathcal{L}_{int} + \lambda_{succ}\mathcal{L}_{succ}\]The interval loss uses cross entropy over tri-state labels. The completion loss uses focal loss, because successful terminal states are rare in long continuous trajectories.
Global Progress Reconstruction
Tri-state labels are local. Policy learning, however, benefits from a dense progress signal over the whole episode. ARM reconstructs such a curve in three steps.
First, the MIMO model runs over clipped video segments in parallel, predicting interval advantages efficiently.
Second, segments are aligned and padded when needed, with synthetic padding ignored during final aggregation.
Third, the predicted relative transitions are integrated into a global progress curve (P_t), anchored by the task completion head. A successful terminal frame provides the absolute anchor, such as (P_T = 1.0), while earlier frames are reconstructed from accumulated predicted gains.
This turns local labels such as “progressing” or “regressing” into a dense reward-like signal that can detect dips, pauses, and recoveries instead of forcing a staircase-shaped subtask curve.
AW-BC: Using ARM for Policy Improvement
The downstream policy training method is Advantage-Weighted Behavior Cloning (AW-BC). The idea is simple: use ARM’s reconstructed progress gains to weight action chunks. Good transitions get high weight; regressive or low-value transitions are suppressed.
For an action chunk with horizon (H), ARM defines a length-adaptive gain:
\[\Delta G_t = (P_{t+H} - P_t) \cdot \frac{L_{seq}}{\bar{L}}\]Here, (P_t) is reconstructed progress, (L_{seq}) is the current episode length, and (\bar{L}) is the average episode length. This normalization helps avoid a bias where short episodes get artificially steep progress slopes.
Weights are computed from the batch gain distribution. If (\mu) and (\sigma) are the mean and standard deviation of gains, the paper clips gains between (\mu - 2\sigma) and (\mu + 2\sigma), then maps them into ([0, 1]):
\[\tilde{w}_i = \mathrm{clamp} \left( \frac{\Delta G_i - b_{lower}}{b_{upper} - b_{lower} + \epsilon}, 0, 1 \right)\]The weighted BC objective is:
\[\mathcal{L}_{AW-BC}(\theta) = \mathbb{E}_{(s,a)\sim\mathcal{D}} [-\tilde{w}(s,a)\log \pi_\theta(a|s)]\]This is offline policy improvement in the spirit of advantage-weighted regression: stay close to the data, but prioritize the parts of the data that actually move the task forward.
Task and Dataset
The main benchmark is a real-world bimanual towel-folding task using an AgileX ALOHA-style teleoperation setup. A successful episode requires an 8-stage sequence:
- extract exactly one towel from a cluttered pile;
- place it on the central tabletop;
- flatten it into a planar initial state;
- perform a bottom-to-up longitudinal fold;
- perform a top-to-bottom longitudinal fold;
- perform a right-to-center lateral fold;
- perform a left-to-right lateral fold;
- transport and deposit the folded towel into a target box.
The trial must complete within 120 seconds, with a single towel neatly folded and fully inside the box.
The dataset contains 972 towel-folding episodes, about 20 hours total:
- 809 expert demonstrations,
- 163 DAgger-augmented error-correction episodes.
Unlike approaches that discard slow or messy trajectories, ARM keeps them because they contain valuable recovery behavior.
The real robot setup uses three RGB views: a high global view plus left and right wrist cameras. The proprioceptive state and action are both 14-dimensional, covering bimanual joint positions and gripper states.
Reward Model Results
ARM is compared with SARM on reward reconstruction and terminal success identification.
| Metric | SARM | ARM |
|---|---|---|
| Progress reconstruction MSE ↓ | 0.0059 | 0.0014 |
| Standard episode terminal ID | 83.3% (10/12) | 100.0% (12/12) |
| Failure episode terminal ID | 91.6% (11/12) | 100.0% (12/12) |
Qualitatively, SARM produces stepped progress curves around subtask boundaries. ARM produces smoother dense curves and captures temporary downward dips during regressive adjustments. This is exactly what a long-horizon reward model should do: penalize regressions, but not erase them from the learning signal.
Labeling and Inference Efficiency
The annotation throughput comparison is one of the paper’s clearest practical wins:
| Annotation protocol | Samples per 8 hours |
|---|---|
| Human baseline subtask segmentation | 100 |
| Human tri-state labeling | 250 |
| VLM labeling with Qwen3-VL | about 3,000 |
| Auto tri-state labeling with ARM | more than 400,000 |
Tri-state labeling is 2.5x faster for humans than subtask segmentation. After automation, ARM scales far beyond VLM labeling.
The MIMO architecture is also much faster:
| Method | Architecture | Throughput |
|---|---|---|
| Qwen3-VL | MISO | 1.03 it/s |
| SARM baseline | SISO | 3.9 it/s |
| ARM | MIMO | 14.1 it/s |
The paper computes ARM’s effective speed from 5 parallel outputs per input. This matters because reward labeling becomes a dataset-scale operation, not a small manual step.
Downstream Policy Results
The downstream policy uses GR00T-N1.5-3B with a DiT flow-matching action head. Training uses 32 NVIDIA A100 GPUs, BF16 mixed precision, action horizon 32, and three (224 \times 224) camera views.
The performance comparison is:
| Method | Success Rate | Throughput | Folding Precision |
|---|---|---|---|
| BC baseline with GR00T-N1.5 | 62.1% | 18 episodes/hr | 2.2 |
| RA-BC with GR00T + SARM | 78.5% | 24 episodes/hr | 2.7 |
| AW-BC with GR00T + ARM | 99.4% | 32 episodes/hr | 3.6 |
The ablation separates the two main contributions:
| Method | Labeling | Training | Success |
|---|---|---|---|
| SARM | task segmentation | RA-BC | 78.5% |
| ARM | tri-state | RA-BC | 92.3% |
| ARM | tri-state | AW-BC | 99.4% |
This shows two effects. First, tri-state labels improve reward quality even under the older RA-BC training recipe. Second, AW-BC adds another jump by turning dense relative gains into more effective action weights.
Strengths
The main strength is the label interface. ARM asks annotators for a cognitively simple judgment that still captures the structure long-horizon manipulation needs. “Better, worse, or same” is much easier to scale than frame-level progress scoring or precise subtask boundary labeling.
The method also treats regressions as first-class events. This matters for real manipulation, where recovery is not an anomaly. If a robot adjusts a towel edge before folding, a reward model should recognize the temporary regression and still produce a coherent global signal.
The integration with policy training is clean. AW-BC does not require online environment interaction or hand-designed rewards. It turns a noisy demonstration dataset into a weighted imitation dataset where high-advantage chunks matter more.
Finally, the paper is strong on systems details: dataset size, annotation throughput, reward-model throughput, policy training hardware, camera views, action dimensions, and precision scoring protocol are all specified.
Limitations
The main limitation is task scope. The results are compelling, but they focus on one long-horizon towel-folding setup. It remains unclear how well the same tri-state reward model transfers across very different manipulation categories, objects, or embodiments.
ARM still needs an initial supervised seed. The paper reduces annotation cost, but does not eliminate human supervision entirely.
The downstream policy training is compute-heavy: GR00T-N1.5 policy training uses 32 A100 GPUs. The reward model itself is lighter, using 2 A100 GPUs, but the full pipeline is still a serious systems setup.
The method also depends on visual and proprioceptive observability. If a key task variable is hidden from the cameras and robot state, the tri-state reward model may still misjudge progress.
Takeaways
My takeaway is that ARM is a practical answer to the reward engineering bottleneck in long-horizon robot learning. The paper’s best idea is not a more complicated reward scalar. It is making the supervision question simpler:
Did this short interval help, hurt, or do nothing?
That small shift makes reward modeling cheaper, less monotonicity-biased, and more compatible with messy demonstrations. For research taxonomy, I would label this paper:
Reward Modeling / Long-Horizon Manipulation / Advantage-Weighted Imitation / VLA Policy Refinement
The idea I would reuse first is tri-state advantage labeling. It feels like a nice middle ground between sparse success labels and full dense reward engineering: enough signal for credit assignment, but simple enough to scale.
