[Paper Notes] A Review of Learning-Based Dynamics Models for Robotic Manipulation (Science Robotics 2025)

10 minute read

Published: February 24, 2026

This post supports English / 中文 switching via the site language toggle in the top navigation.

TL;DR

This Science Robotics review is a strong, robotics-centered survey of learning-based dynamics models for manipulation, with one especially useful organizing idea:

the design space is largely shaped by the state representation

The paper builds a clear taxonomy from less-structured to more-structured representations:

pixels
latent states
3D particles
keypoints
object-centric states

and analyzes trade-offs across:

perception difficulty
inductive bias / sample efficiency
generalization
interpretability
control-time computational cost

If you work on world models / model-based control for manipulation, this review is worth reading because it connects representation choice -> dynamics architecture -> control method -> task suitability in a very practical way.

Paper Info

Title: A review of learning-based dynamics models for robotic manipulation
Authors: Bo Ai, Stephen Tian, Haochen Shi, Yixuan Wang, Tobias Pfaff, Cheston Tan, Henrik I. Christensen, Hao Su, Jiajun Wu, Yunzhu Li
Venue: Science Robotics (Review article)
Publication date: 2025-09-17
DOI: 10.1126/scirobotics.adt1497

1. Why This Review Matters

There are many papers on world models, learned simulators, and model-based control, but they often focus on a single object type, a single sensing setup, or one task family. This review is valuable because it asks a broader robotics question:

How should we design learned dynamics models for manipulation when the environment is partially observable, contact-rich, and task-dependent?

The answer the paper emphasizes is not “use architecture X.” Instead, the authors argue that a major design choice is the state representation used by the perception+dynamics+control stack. That framing is practical and reusable.

2. Core Framework: Perception, Dynamics, Control

The paper formalizes manipulation under a POMDP and decomposes learned dynamics pipelines into three modules:

Perception module g: estimate a task-relevant state s_t from observations (and possibly action/history)
Dynamics module \hat{T}: predict state transitions s_t -> s_{t+1} given action a_t
Control module \pi: planning or policy learning using the learned dynamics model

This decomposition is simple but important. In practice, many failures come from the interaction between these modules, especially when a representation is good for one stage (e.g., efficient dynamics learning) but hard for another (e.g., robust state estimation).

3. Main Taxonomy: State Representations (The Most Useful Part)

The review organizes methods by state representation and repeatedly highlights a central trade-off:

more structure usually improves inductive bias and sample efficiency
but often makes perception/state estimation harder

3.1 Pixel representations (2D pixel space)

Pixel-based models treat dynamics learning as action-conditioned video prediction.

Strengths:

minimal explicit state-estimation pipeline
broad applicability to many modalities (RGB, depth, tactile images, density fields)
can leverage large-scale video modeling advances (transformers, diffusion)

Weaknesses:

high-dimensional prediction space -> data hungry
can hallucinate under partial observability
expensive for high-frequency control
standard video metrics often do not align with control quality

My takeaway: pixel models are attractive for generality, but control reliability remains difficult unless you add stronger priors or huge data.

3.2 Latent representations

Latent-state models compress observations into a lower-dimensional z_t, then predict dynamics in latent space.

The review nicely separates:

reconstruction-based representation learning
reconstruction-free objectives (e.g., inverse dynamics, contrastive, reward-predictive)

and discusses probabilistic vs deterministic latent dynamics (e.g., RSSMs vs MLP/CNN predictors).

Strengths:

efficient control-time inference
good sample efficiency when latent structure is well chosen
widely used in real-world model-based RL / manipulation

Weaknesses:

latent quality depends heavily on training objective
task-specific objectives may hurt transfer
generalization across object counts / scene configurations is still limited

3.3 3D particle representations

Particle representations explicitly encode geometry and local interactions, making them especially strong for deformable / nonrigid manipulation.

Common modeling choices:

GNN-based particle interaction models (e.g., DPI-Net / GNS-style families)
convolutional particle interaction architectures (e.g., SPNets-style)

Strengths:

strong physical inductive bias
sample efficiency
good fit for deformable objects, granular materials, fluids
natural integration with multimodal sensing (vision + touch)

Weaknesses:

state estimation from observations is hard (occlusion, tracking, correspondences)
scalability/cost issues for dense graphs

This is a recurring theme in the review: particle models can be excellent dynamics models, but perception can become the bottleneck.

3.4 Keypoint representations

Keypoints are sparse, task-relevant points (2D/3D) with implicit or explicit semantics.

The review covers:

supervised keypoint learning
unsupervised keypoint discovery
zero-shot keypoint detection using vision foundation models (CLIP/DINO-style features, etc.)

Strengths:

compact and efficient
often good for control and real-time planning
can generalize across object instances when keypoints capture consistent task structure

Weaknesses:

sensitive to occlusion and temporal consistency errors
keypoint extraction quality is critical

3.5 Object-centric representations

Object-centric models represent scenes as discrete interacting entities and explicitly model relations.

Strengths:

good for multi-object reasoning and compositional generalization
natural fit for graph-based relational dynamics
high-level abstraction often matches rearrangement/manipulation tasks

Weaknesses:

difficult perception problem (instance segmentation, inverse rendering, object proposals)
less suitable for highly deformable/continuous materials

4. Representation Choice Is Really a Control Design Choice

One of the best messages in the paper is that representation choice is not just a perception or modeling preference. It directly affects control:

planning stability
computational cost
whether gradients are useful
how badly model errors are exploited during optimization

The review discusses two main control paradigms:

motion planning (path planning + trajectory optimization; e.g., random search, CEM, MPPI, gradient-based optimization)
policy learning (including model-based RL and goal-conditioned policy training from learned rollouts)

The practical insight is that different representations pair naturally with different control styles. For example:

compact latents/keypoints can support fast iterative control
particle models can offer better physical fidelity for deformables but may be heavier
object-centric models can help planning in multi-object tasks

5. Representative Tasks Covered

The review summarizes how learned dynamics models are used across several task families:

object repositioning
deformable object manipulation (rope, cloth, dough, soft objects)
multi-object manipulation (packing, insertion, rearrangement)
tool-use manipulation

This section is useful because it maps task types to representation choices instead of treating “world model for robotics” as one homogeneous problem.

6. Future Directions (Well-Framed and Worth Reading)

The future-directions section is one of the strongest parts of the review. It is concrete and not just generic “scale more data.”

Some key directions the authors emphasize:

better handling of partial observability and robust state estimation
richer multimodal perception (vision + touch + audio, etc.)
more robust dynamics models under long-horizon planning and model exploitation
foundation dynamics models (and the data bottleneck for action-labeled interaction data)
using foundation-model priors for physical parameter estimation
importing new scene representations from graphics (e.g., NeRF/3DGS-inspired directions)
large-scale scene representations beyond tabletop settings
hierarchical dynamics modeling and planning
planning under imperfect models with stronger robustness / guarantees

I especially like the emphasis on hierarchical abstraction. It matches the review’s core thesis: one representation level is unlikely to be optimal for every decision scale.

7. Strengths of the Review

Clear robotics-centric framing (not just ML taxonomy)
Useful representation-first organization
Connects perception, dynamics learning, and control in one pipeline
Discusses practical task fit and deployment constraints
Balanced treatment of both structured and unstructured representations
Strong future-directions section with concrete open problems

8. Limitations / What This Review Is (and Is Not)

The authors explicitly scope out:

analytical (non-learned) dynamics models as the main focus
differentiable-but-not-learned models
hybrid approaches beyond selected examples
learned dynamics work without demonstrated robotic manipulation applications

That scope makes the review focused and useful, but readers looking for a unified comparison with broader world-model literature (e.g., general RL world models, video world models without robotics deployment) will still need complementary reading.

9. My Takeaways

The most important design choice is often state representation, not just network architecture.
In robotics, stronger inductive bias often shifts difficulty from dynamics learning to perception/state estimation.
“Model quality” should be evaluated in the context of the control algorithm that uses it.
A universal manipulation dynamics model likely requires multi-level representations and hierarchical planning.

If I were designing a new manipulation system, I would use this review as a checklist:

What representation matches the task physics?
Can I estimate that state robustly from my sensors?
What control method can exploit this model without overfitting to model errors?
What level of abstraction is actually needed for the decision horizon?

Share on

Bluesky Facebook LinkedIn Mastodon X (formerly Twitter)

Lixin Xu

[Paper Notes] A Review of Learning-Based Dynamics Models for Robotic Manipulation (Science Robotics 2025)

TL;DR

Paper Info

1. Why This Review Matters

2. Core Framework: Perception, Dynamics, Control

3. Main Taxonomy: State Representations (The Most Useful Part)

3.1 Pixel representations (2D pixel space)

3.2 Latent representations

3.3 3D particle representations

3.4 Keypoint representations

3.5 Object-centric representations

4. Representation Choice Is Really a Control Design Choice

5. Representative Tasks Covered

6. Future Directions (Well-Framed and Worth Reading)

7. Strengths of the Review

8. Limitations / What This Review Is (and Is Not)

9. My Takeaways

TL;DR

论文信息

1. 为什么这篇综述重要

2. 核心框架：感知、动力学、控制

3. 主线分类：状态表示（这篇综述最有价值的部分）

3.1 像素表示（2D pixel space）

3.2 潜变量表示（latent）

3.3 3D 粒子表示（particles）

3.4 关键点表示（keypoints）

3.5 物体中心表示（object-centric）

4. 表示选择本质上也是控制设计选择

5. 代表性任务覆盖（很实用）

6. 未来方向（写得很好，且不空泛）

7. 这篇综述的优点

8. 这篇综述的边界（它是什么 / 不是什么）

9. 我的几点总结

Share on

You May Also Enjoy

[Paper Notes] EgoScale: Scaling Dexterous Manipulation with Diverse Egocentric Human Data (arXiv 2026)

Intelligence Is Not Only Reasoning

Cognitive Bandwidth in the AI Agent Era

[Paper Notes] BFM-Zero: A Promptable Behavioral Foundation Model for Humanoid Control Using Unsupervised Reinforcement Learning (arXiv 2025)