[Paper Notes] A Review of Learning-Based Dynamics Models for Robotic Manipulation (Science Robotics 2025)
Published:
This post supports English / 中文 switching via the site language toggle in the top navigation.
TL;DR
This Science Robotics review is a strong, robotics-centered survey of learning-based dynamics models for manipulation, with one especially useful organizing idea:
- the design space is largely shaped by the state representation
The paper builds a clear taxonomy from less-structured to more-structured representations:
- pixels
- latent states
- 3D particles
- keypoints
- object-centric states
and analyzes trade-offs across:
- perception difficulty
- inductive bias / sample efficiency
- generalization
- interpretability
- control-time computational cost
If you work on world models / model-based control for manipulation, this review is worth reading because it connects representation choice -> dynamics architecture -> control method -> task suitability in a very practical way.
Paper Info
- Title: A review of learning-based dynamics models for robotic manipulation
- Authors: Bo Ai, Stephen Tian, Haochen Shi, Yixuan Wang, Tobias Pfaff, Cheston Tan, Henrik I. Christensen, Hao Su, Jiajun Wu, Yunzhu Li
- Venue: Science Robotics (Review article)
- Publication date: 2025-09-17
- DOI:
10.1126/scirobotics.adt1497
1. Why This Review Matters
There are many papers on world models, learned simulators, and model-based control, but they often focus on a single object type, a single sensing setup, or one task family. This review is valuable because it asks a broader robotics question:
How should we design learned dynamics models for manipulation when the environment is partially observable, contact-rich, and task-dependent?
The answer the paper emphasizes is not “use architecture X.” Instead, the authors argue that a major design choice is the state representation used by the perception+dynamics+control stack. That framing is practical and reusable.
2. Core Framework: Perception, Dynamics, Control
The paper formalizes manipulation under a POMDP and decomposes learned dynamics pipelines into three modules:
- Perception module
g: estimate a task-relevant states_tfrom observations (and possibly action/history) - Dynamics module
\hat{T}: predict state transitionss_t -> s_{t+1}given actiona_t - Control module
\pi: planning or policy learning using the learned dynamics model
This decomposition is simple but important. In practice, many failures come from the interaction between these modules, especially when a representation is good for one stage (e.g., efficient dynamics learning) but hard for another (e.g., robust state estimation).
3. Main Taxonomy: State Representations (The Most Useful Part)
The review organizes methods by state representation and repeatedly highlights a central trade-off:
- more structure usually improves inductive bias and sample efficiency
- but often makes perception/state estimation harder
3.1 Pixel representations (2D pixel space)
Pixel-based models treat dynamics learning as action-conditioned video prediction.
Strengths:
- minimal explicit state-estimation pipeline
- broad applicability to many modalities (RGB, depth, tactile images, density fields)
- can leverage large-scale video modeling advances (transformers, diffusion)
Weaknesses:
- high-dimensional prediction space -> data hungry
- can hallucinate under partial observability
- expensive for high-frequency control
- standard video metrics often do not align with control quality
My takeaway: pixel models are attractive for generality, but control reliability remains difficult unless you add stronger priors or huge data.
3.2 Latent representations
Latent-state models compress observations into a lower-dimensional z_t, then predict dynamics in latent space.
The review nicely separates:
- reconstruction-based representation learning
- reconstruction-free objectives (e.g., inverse dynamics, contrastive, reward-predictive)
and discusses probabilistic vs deterministic latent dynamics (e.g., RSSMs vs MLP/CNN predictors).
Strengths:
- efficient control-time inference
- good sample efficiency when latent structure is well chosen
- widely used in real-world model-based RL / manipulation
Weaknesses:
- latent quality depends heavily on training objective
- task-specific objectives may hurt transfer
- generalization across object counts / scene configurations is still limited
3.3 3D particle representations
Particle representations explicitly encode geometry and local interactions, making them especially strong for deformable / nonrigid manipulation.
Common modeling choices:
- GNN-based particle interaction models (e.g., DPI-Net / GNS-style families)
- convolutional particle interaction architectures (e.g., SPNets-style)
Strengths:
- strong physical inductive bias
- sample efficiency
- good fit for deformable objects, granular materials, fluids
- natural integration with multimodal sensing (vision + touch)
Weaknesses:
- state estimation from observations is hard (occlusion, tracking, correspondences)
- scalability/cost issues for dense graphs
This is a recurring theme in the review: particle models can be excellent dynamics models, but perception can become the bottleneck.
3.4 Keypoint representations
Keypoints are sparse, task-relevant points (2D/3D) with implicit or explicit semantics.
The review covers:
- supervised keypoint learning
- unsupervised keypoint discovery
- zero-shot keypoint detection using vision foundation models (CLIP/DINO-style features, etc.)
Strengths:
- compact and efficient
- often good for control and real-time planning
- can generalize across object instances when keypoints capture consistent task structure
Weaknesses:
- sensitive to occlusion and temporal consistency errors
- keypoint extraction quality is critical
3.5 Object-centric representations
Object-centric models represent scenes as discrete interacting entities and explicitly model relations.
Strengths:
- good for multi-object reasoning and compositional generalization
- natural fit for graph-based relational dynamics
- high-level abstraction often matches rearrangement/manipulation tasks
Weaknesses:
- difficult perception problem (instance segmentation, inverse rendering, object proposals)
- less suitable for highly deformable/continuous materials
4. Representation Choice Is Really a Control Design Choice
One of the best messages in the paper is that representation choice is not just a perception or modeling preference. It directly affects control:
- planning stability
- computational cost
- whether gradients are useful
- how badly model errors are exploited during optimization
The review discusses two main control paradigms:
- motion planning (path planning + trajectory optimization; e.g., random search, CEM, MPPI, gradient-based optimization)
- policy learning (including model-based RL and goal-conditioned policy training from learned rollouts)
The practical insight is that different representations pair naturally with different control styles. For example:
- compact latents/keypoints can support fast iterative control
- particle models can offer better physical fidelity for deformables but may be heavier
- object-centric models can help planning in multi-object tasks
5. Representative Tasks Covered
The review summarizes how learned dynamics models are used across several task families:
- object repositioning
- deformable object manipulation (rope, cloth, dough, soft objects)
- multi-object manipulation (packing, insertion, rearrangement)
- tool-use manipulation
This section is useful because it maps task types to representation choices instead of treating “world model for robotics” as one homogeneous problem.
6. Future Directions (Well-Framed and Worth Reading)
The future-directions section is one of the strongest parts of the review. It is concrete and not just generic “scale more data.”
Some key directions the authors emphasize:
- better handling of partial observability and robust state estimation
- richer multimodal perception (vision + touch + audio, etc.)
- more robust dynamics models under long-horizon planning and model exploitation
- foundation dynamics models (and the data bottleneck for action-labeled interaction data)
- using foundation-model priors for physical parameter estimation
- importing new scene representations from graphics (e.g., NeRF/3DGS-inspired directions)
- large-scale scene representations beyond tabletop settings
- hierarchical dynamics modeling and planning
- planning under imperfect models with stronger robustness / guarantees
I especially like the emphasis on hierarchical abstraction. It matches the review’s core thesis: one representation level is unlikely to be optimal for every decision scale.
7. Strengths of the Review
- Clear robotics-centric framing (not just ML taxonomy)
- Useful representation-first organization
- Connects perception, dynamics learning, and control in one pipeline
- Discusses practical task fit and deployment constraints
- Balanced treatment of both structured and unstructured representations
- Strong future-directions section with concrete open problems
8. Limitations / What This Review Is (and Is Not)
The authors explicitly scope out:
- analytical (non-learned) dynamics models as the main focus
- differentiable-but-not-learned models
- hybrid approaches beyond selected examples
- learned dynamics work without demonstrated robotic manipulation applications
That scope makes the review focused and useful, but readers looking for a unified comparison with broader world-model literature (e.g., general RL world models, video world models without robotics deployment) will still need complementary reading.
9. My Takeaways
- The most important design choice is often state representation, not just network architecture.
- In robotics, stronger inductive bias often shifts difficulty from dynamics learning to perception/state estimation.
- “Model quality” should be evaluated in the context of the control algorithm that uses it.
- A universal manipulation dynamics model likely requires multi-level representations and hierarchical planning.
If I were designing a new manipulation system, I would use this review as a checklist:
- What representation matches the task physics?
- Can I estimate that state robustly from my sensors?
- What control method can exploit this model without overfitting to model errors?
- What level of abstraction is actually needed for the decision horizon?
