[Paper Notes] RigidFormer: Learning Rigid Dynamics using Transformers
Published:
This post supports English / 中文 switching via the site language toggle in the top navigation.
TL;DR
RigidFormer is a mesh-free learned simulator for multi-object rigid-body dynamics. It takes two recent object-level point-cloud states, a temporal step size, and optional control signals, then predicts the next point-cloud state. Instead of updating every point independently, it compresses each object into an object token, lets object tokens interact through a Transformer, advances a small set of anchors per object, and then recovers a rigid transform with differentiable Kabsch alignment.
The paper is strong as a simulator-style world model. It improves the efficiency and scalability of rigid contact rollout by moving from vertex-level interaction to object-level and anchor-level interaction. The key move is not only “use a Transformer”; it is the choice to respect the low-dimensional structure of rigid motion while still reading contact-relevant local geometry from point clouds.
The most useful critical lens here is to separate “world model” into two concepts:
Action-Conditioned World Model:
world state + action/control -> future world state
World-Action Model:
world observation/change + goal -> action-relevant belief, affordance, risk, or action
RigidFormer mostly belongs to the first category. It asks: if the physical scene continues under this dynamics/control condition, what will the future object state be? But a policy may not need a full future reconstruction. For action, the important question is often: what changed, what matters, and how should I react?
Paper Info
- Title: RigidFormer: Learning Rigid Dynamics using Transformers
- Authors: Zhiyang Dou, Minghao Guo, Haixu Wu, Doug Roble, Tuur Stuyck, Wojciech Matusik
- Affiliations: MIT and Meta
- arXiv: 2605.09196
- Project page: people.csail.mit.edu/frankzydou/projects/RigidFormer
- Code: Frank-ZY-Dou/Dynamics-Modeling, with the current README noting that code release is coming soon.
The Problem
Rigid-body dynamics matters for robotics, graphics, and embodied AI because manipulation and interaction are full of contact: blocks collide, objects slide, stacks fall, and articulated parts push against each other. Classical physics engines can simulate this well when we have clean meshes, calibrated material parameters, and carefully tuned contact models. In real perception pipelines, those assumptions are often weak. We may have point clouds, incomplete geometry, noisy segmentation, and approximate material parameters.
Prior learned simulators often rely on mesh connectivity or vertex-level message passing. That creates two bottlenecks:
- Point clouds do not naturally come with triangle connectivity.
- Vertex-level interaction becomes expensive as point resolution and object count grow.
RigidFormer’s answer is to treat rigid bodies as coherent objects. A rigid body may have many observed points, but its motion is low-dimensional. So the model should reason over object-level state and only use local point features where contact geometry matters.
Inputs and Outputs
The core dynamics model is:
\[ x_{t+1} = f_\theta(x_{t-1}, x_t, \Delta t) \]
Each scene has \(M\) rigid objects. Object \(i\) is represented by a point set:
\[ x_t^{(i)} \in \mathbb{R}^{N_v^{(i)} \times d} \]
For each point, the paper builds a 12-dimensional feature vector from:
- nearest-neighbor displacement to another object or the ground,
- per-step position increment \(x_t - x_{t-1}\), used as a discrete velocity proxy,
- offset from a reference shape/frame,
- physical parameters \([m, \mu, \epsilon]\): mass, friction, and restitution.
The model also receives the integration step size \(\Delta t\). In the articulated-body extension, it can additionally receive high-level control commands such as target speed, target movement direction, and target facing direction.
The output is the next state of every object as full-resolution point positions:
input: object point clouds at t-1 and t, physics/contact features, step size, optional controls
output: object point clouds at t+1
Internally, the output path is more structured than direct point regression:
points -> object tokens -> object interaction Transformer
-> anchor queries -> anchor accelerations
-> Verlet integration -> candidate anchor positions
-> Kabsch rigid alignment -> rigid transform
-> all object points at t+1
This is important. The model predicts enough motion to determine a rigid transform, then applies that transform to all points. The final object cannot arbitrarily deform because the projection enforces rigid-body structure.
Method
1. Object-Level Interaction
RigidFormer first uses a PointNet-style encoder to compress each object’s point cloud into one object token. These object tokens, plus 16 learned register tokens, enter a Transformer decoder. Because there is no sequence-index positional embedding over objects, the object interaction stage is designed to be permutation-equivariant: reordering objects should reorder the outputs, not change the physics.
The Transformer is conditioned on the temporal step size through FiLM. The conditioning code uses \((s, s^2)\), mirroring the first-order and second-order terms that appear in motion integration. This lets a single model operate at different effective time resolutions.
2. Anchor-Based State Advance
Instead of predicting every point’s next position, RigidFormer chooses a small number of FPS anchors per object, with \(N_a = 4\) as the default. Each anchor attends to object tokens and receives local point features through Anchor-Vertex Pooling. This pooling aggregates nearby vertex features with a learnable distance kernel, giving anchors contact-local context without dense vertex-level attention.
The network predicts per-anchor acceleration. Candidate anchor positions are advanced with Verlet integration:
\[ \hat{q}{t+1}^{(i,k)} = a_t^{(i,k)} \Delta t^2 + 2q_t^{(i,k)} - q{t-1}^{(i,k)} \]
The paper then aligns reference anchors to predicted anchors using Kabsch alignment and applies the resulting \(SE(3)\) transform to the full object point set.
3. Anchor-Based RoPE
The paper introduces Anchor-based Rotary Positional Embedding to inject 3D geometry into attention. The idea is to encode object geometry through sparse anchor positions rather than through a single centroid or all vertices. Mean-pooling anchor rotary descriptors makes the embedding invariant to anchor reindexing while still carrying shape extent and world-frame position information.
This is a small but meaningful design decision: a rigid object’s geometry matters for contact, but anchor identity should not become an arbitrary index dependency.
Results
On MOVi-A, MOVi-B, and MOVi-Sphere, RigidFormer matches or outperforms strong learned simulator baselines while using point inputs rather than mesh connectivity. The most relevant comparison is with HopNet, a strong prior rigid-dynamics baseline. On MOVi-B at 100 frames, the paper reports an improvement from 0.176 m / 17.91 deg to 0.161 m / 15.33 deg under the matched step-size-1 setting.
The step-size experiments are especially interesting. Larger step sizes reduce long-horizon autoregressive error because the model makes fewer rollout calls over the same physical horizon. On MOVi-B at 100 frames, the reported errors are:
| Step size | Position RMSE | Orientation RMSE |
|---|---|---|
| 1 | 0.161 m | 15.33 deg |
| 5 | 0.136 m | 13.55 deg |
| 10 | 0.115 m | 10.85 deg |
This is not just a numerical trick. It says the learned simulator can expose sparse long-horizon futures cheaply, which is useful for planning when the planner does not need every high-frequency contact frame.
The runtime comparison is also central to the paper’s claim:
| Method | ms/step | FPS |
|---|---|---|
| HopNet | 4228.7 | 0.2 |
| FIGNet | 336.0 | 3.0 |
| RigidFormer | 41.9 | 23.9 |
The paper also shows scalability on WreckingBall scenes with up to 217 objects and a preliminary extension to command-conditioned articulated bodies, where body parts are treated as interacting object-level components.
What I Like
The paper has a clean structural bias. Rigid objects should move rigidly; object interactions should be object-level; local contact still needs local geometry. RigidFormer maps those intuitions into architecture:
- object tokens for global interaction,
- anchors for low-dimensional state advance,
- local anchor-vertex pooling for contact cues,
- rigid projection for stability,
- step-size conditioning for controllable rollout.
This makes the model feel less like a generic Transformer pasted onto physics and more like a learned simulator with the right pressure points exposed.
I also like the way it treats point clouds. The model does not require meshes for the dynamics interface, but it also does not pretend geometry can be collapsed to a centroid. The anchor representation is a compromise between dense geometry and low-dimensional physical state.
A Critical Reading: Simulator World Model vs Policy World Model
RigidFormer is valuable, and it is a good paper to read through this conceptual boundary. It is a simulator world model, or more specifically an action/control-conditioned world model:
\[ P(\text{world}{future} \mid \text{world}{now}, \text{action/control}) \]
Its job is to predict future object states. That is useful for physics rollout, data generation, model-predictive control, trajectory optimization, and counterfactual planning.
But a policy-facing world model may want something different:
\[ P(\text{action or action-relevant latent} \mid \text{world}_{now}, \text{goal}, \text{change}) \]
Call this a World-Action Model. It does not need to reconstruct every future point coordinate. It needs to decide what the current world means for action. The policy may care about:
- whether an object blocks the goal,
- whether it is graspable,
- whether it is sliding or falling,
- whether contact is about to matter,
- whether uncertainty is high enough to slow down,
- whether the scene deserves more prediction compute.
This is closer to how human perception often feels. We usually do not run a full internal physics renderer for every object point. We notice the relevant change, allocate attention, and react. If a cup starts slipping, the useful internal state is not a dense future point cloud. It is something like: “slipping, reachable, act now.”
So the question is not whether RigidFormer conflicts with world models. It does not. The question is: which world model is being built?
Simulator world model:
state_t, action_t -> state_t+1
Policy world model:
observation_t, goal_t, change_t -> action-relevant belief/action
RigidFormer is excellent evidence for the first direction. It does not directly solve the second. For embodied policy learning, the second may be the more central abstraction.
Why This Distinction Matters
If the goal is simulation, dense state prediction is sensible. We want rollout fidelity, physically plausible contact, and stable long-horizon trajectories. RigidFormer’s point-cloud output is a feature, not a burden.
If the goal is policy, dense prediction can become an expensive intermediate. The policy may not need to know every point’s future coordinate. It may only need a compressed, task-conditioned representation of interaction:
object location
motion trend
contact affordance
risk
goal relevance
uncertainty
compute budget
This suggests a possible research direction: use RigidFormer-like object/anchor structure, but train it not only to predict future geometry. Train it to produce policy-useful state abstractions:
- affordance fields over objects and anchors,
- event predictions such as collision, slip, fall, or blockage,
- adaptive rollout depth,
- uncertainty-aware “think more here” signals,
- action-conditioned summaries rather than full state reconstructions.
In this view, RigidFormer could become a component inside a larger embodied system:
perception -> object point clouds -> RigidFormer-style physical latent
-> world-action module -> action / planner / compute allocation
The simulator module answers “what would happen?” The policy module answers “what should I do with what is happening?”
Limitations
The paper is clear about several limitations:
- It assumes object labels that tell the model which points belong to which object.
- Partial point-cloud results are promising, but severe occlusion and real sensor noise remain difficult.
- Contact is learned from data rather than solved by an analytic complementarity/contact solver.
- The main setting is rigid objects; articulated bodies are treated as collections of object-level parts.
- Mixed rigid-deformable scenes and adaptive time stepping are left for future work.
From the policy-world-model perspective, I would add one more limitation: the output is still simulator-oriented. It is not wrong, but it is not the same as action understanding. A future embodied model may need to decide when full simulation is worth the cost and when a reactive abstraction is enough.
Takeaways
RigidFormer is a strong learned-simulation paper because it makes rigid dynamics cheaper and more stable from mesh-free point inputs. The architecture is well matched to the physical structure of the problem: objects interact, anchors move, rigidity is projected, and local geometry enters where contact needs it.
The broader lesson is conceptual. “World model” should not be a single overloaded phrase. Some world models predict the world forward; others translate the world into action. RigidFormer is a good example of the former. For policy learning, the next question is how to build the latter without losing the physical structure that makes RigidFormer effective.
