[Paper Notes] LDA-1B: Scaling Latent Dynamics Action Model via Universal Embodied Data Ingestion
Published:
This post supports English / 中文 switching via the site language toggle in the top navigation.
TL;DR
LDA-1B is a 1.6B-parameter robot foundation model that learns policy, forward dynamics, inverse dynamics, and visual forecasting in a shared DINO latent space. Instead of predicting future pixels, it predicts future structured visual features extracted by a frozen DINOv3 encoder, which lets the model focus more on object structure, contact-relevant regions, and action-induced state changes.
The paper’s central claim is not that DINO itself is a physics engine. Rather, DINO provides a cleaner and more semantic visual state space, making it easier for a large action-conditioned dynamics model to learn interaction physics from heterogeneous embodied data. LDA-1B scales this recipe to 30k+ hours of robot and human interaction data and reports strong results across simulation, real-world gripper manipulation, dexterous manipulation, and long-horizon tasks.
Paper Info
- Title: LDA-1B: Scaling Latent Dynamics Action Model via Universal Embodied Data Ingestion
- Authors: Jiangran Lyu*, Kai Liu*, Xuheng Zhang*, Haoran Liao, Yusen Feng, Wenxuan Zhu, Tingrui Shen, Jiayi Chen, Jiazhao Zhang, Yifei Dong, Wenbo Cui, Senmao Qi, Shuo Wang, Yixin Zheng, Mi Yan, Xuesong Shi, Haoran Li, Dongbin Zhao, Ming-Yu Liu, Zhizheng Zhang, Li Yi, Yizhou Wang, He Wang
- Affiliations: Peking University, Galbot, CASIA, BAAI, Tsinghua University, Sun Yat-sen University, NVIDIA
- Date: 2026-02-12
- arXiv: 2602.12215
- Project page: pku-epic.github.io/LDA
1. Motivation
Most recent robot foundation models are still heavily behavior-cloning-centric: collect high-quality demonstrations, then train a policy to imitate expert actions. This works, but it creates a bottleneck:
- High-quality robot data is expensive.
- Many large embodied datasets are heterogeneous, noisy, or actionless.
- Pure BC discards low-quality trajectories and actionless videos, even though they may contain useful dynamics knowledge.
- Pixel-space world models waste capacity on appearance details such as lighting, texture, and background clutter.
The paper argues that robot pretraining should use data according to its quality and supervision type:
| Data Type | Main Use |
|---|---|
| High-quality robot/human demonstrations | Policy + dynamics + visual forecasting |
| Lower-quality trajectories | Dynamics + visual forecasting, not direct policy imitation |
| Actionless human videos | Visual forecasting and scene-transition priors |
This is what the authors call Universal Embodied Data Ingestion.
2. Core Idea
LDA-1B follows a Unified World Model style formulation. Given current observation \(o_t\), future observations \(o_{t+1:t+k}\), action chunk \(a_{t+1:t+k}\), and language instruction \(\ell\), it jointly learns:
- Policy: \(p(a_{t+1:t+k} \mid o_t, \ell)\)
- Forward dynamics: \(p(o_{t+1:t+k} \mid o_t, a_{t+1:t+k}, \ell)\)
- Inverse dynamics: \(p(a_{t+1:t+k} \mid o_{t:t+k}, \ell)\)
- Visual planning / forecasting: \(p(o_{t+1:t+k} \mid o_t, \ell)\)
The important twist is that future visual states are represented as DINO latent features, not pixels or VAE image latents.
RGB observation
-> frozen DINOv3-ViT-s encoder
-> structured visual latent tokens
-> MM-DiT predicts future DINO latents and/or action chunks
During pretraining, the DINO encoder and VLM are frozen. The trainable part is the MM-DiT plus action encoder/decoder.
3. Why DINO Latent Helps
The paper makes a strong case that representation quality is a scaling bottleneck for world models. Pixel-space or VAE-space prediction entangles many things:
- object geometry
- texture
- background
- illumination
- camera viewpoint
- task-relevant contact dynamics
This makes dynamics learning harder because the model must spend capacity predicting visual details that may not matter for control.
DINO features are different. They tend to preserve semantic and spatial structure while suppressing irrelevant low-level appearance variation. For robot manipulation, this means the latent state is closer to the things we care about:
- where the object is
- whether the object has moved
- which surface is contact-relevant
- whether a tool, hand, or gripper is aligned with the object
- whether the future state is coherent under the action
My read: DINO latent is not physics by itself; it is a better coordinate system for learning physics-like dynamics. The physics comes from training the temporal/action model on lots of interaction sequences.
4. Model Architecture
LDA uses a Multi-Modal Diffusion Transformer (MM-DiT). It jointly denoises action chunks and future DINO latent tokens.
The model receives:
- current observation and language encoded by a VLM
- current / past DINO visual features
- noisy future action chunks
- noisy future DINO latent tokens
- task embeddings indicating policy, forward dynamics, inverse dynamics, or visual forecasting
- diffusion timestep embeddings
Each MM-DiT block uses multi-modal self-attention over action and visual tokens, with modality-specific projections and output heads. Language tokens are injected through cross-attention.
Key model details from the appendix:
| Component | Value |
|---|---|
| VLM | Qwen3-VL-4B-Instruct |
| Observation encoder | DINOv3-ViT-s |
| Hidden size | 1536 |
| Layers | 16 |
| Attention heads | 32 |
| Image shape | 224 x 224 x 3 |
| DINO latent shape | 14 x 14 x 384 |
| Action chunk | 16 |
| Pretraining batch | 32 x 48 |
| Pretraining hardware | 48 NVIDIA H800 GPUs |
| Pretraining iterations | 400k |
| Compute cost | 4,608 GPU hours |
5. EI-30k Dataset
The paper builds EI-30k, an embodied interaction dataset with more than 30,000 hours of human and robot trajectories.
| Category | Hours |
|---|---|
| Real-world robot data | 8.03k |
| Simulated robot data | 8.6k |
| Ego human data with actions | 7.2k |
| Actionless ego human videos | 10k |
| Total | 30k+ |
The dataset is standardized into the LeRobot format and includes aligned hand-centric action representations:
- robot 6-DoF end-effector pose plus gripper width or dexterous hand joints
- human 6-DoF wrist pose plus MANO hand parameters
- camera extrinsics retained to decouple hand motion from egocentric camera motion
- quality labels so lower-quality trajectories can be used for dynamics without forcing policy imitation
This dataset engineering is one of the paper’s quiet but important contributions. The model improvement is not just from architecture; it also comes from making heterogeneous data trainable in one pipeline.
6. Main Results
RoboCasa-GR1 Simulation
On RoboCasa-GR1, the paper evaluates 24 tabletop rearrangement and articulated-object manipulation tasks. The key ablation compares VAE latent state representations with DINO latent representations.
| Method | Visual Rep. | MM-DiT | VLM | Success Rate |
|---|---|---|---|---|
| GR00T-N1.6 | - | - | Cosmos | 47.6 |
| StarVLA | - | - | Qwen3-VL | 47.8 |
| GR00T-EI30k | - | - | Qwen3-VL | 51.3 |
| UWM-0.1B | VAE | no | - | 14.2 |
| UWM-1B | VAE | no | Qwen3-VL | 19.3 |
| UWM + MM-DiT | VAE | yes | Qwen3-VL | 20.0 |
| LDA + DiT | DINO | no | Qwen3-VL | 48.9 |
| LDA-0.5B | DINO | yes | Qwen3-VL | 50.7 |
| LDA-1B | DINO | yes | Qwen3-VL | 55.4 |
The striking number is the jump from 20.0 to 55.4 when replacing VAE-style pixel latents with DINO representations under the LDA setup. The authors use this to argue that semantically structured latent spaces are critical for scalable dynamics learning.
Real-World Gripper Manipulation
LDA-1B is evaluated on Galbot G1 with a two-finger gripper. It outperforms GR00T-N1.6 and \(\pi_{0.5}\) across pick-and-place, contact-rich, fine manipulation, and long-horizon tasks.
Examples:
- Pick and place: 80-90% success
- Flip box: 60% success vs. 40% / 20% baselines
- Wipe board: 72% success vs. 44% / 52% baselines
- Clean rubbish: 35% success while both baselines fail at 0%
The long-horizon and contact-rich tasks are especially relevant: they require temporal consistency, contact reasoning, and recovery from intermediate errors.
Dexterous Manipulation
The model is also tested on low-DoF BrainCo hands and high-DoF Sharpa hands. On tasks like Pull Nail and Flip Bread, LDA shows much stronger performance than baselines:
- Pull Nail: 80% for LDA vs. much lower baselines
- Flip Bread: 90% for LDA vs. 10% for \(\pi_{0.5}\)
The authors interpret this as evidence that large-scale human interaction data provides useful latent priors for dexterous control.
Generalization
On pick-and-place with perturbations:
| Method | Novel Objects | Variant Background | OOD Position |
|---|---|---|---|
| \(\pi_{0.5}\) | 26.7 | 20.0 | 6.7 |
| GR00T | 40.0 | 40.0 | 20.0 |
| LDA-1B | 60.0 | 60.0 | 40.0 |
This supports the paper’s claim that DINO-latent dynamics helps the model focus on task-relevant affordances rather than visual distractors.
Mixed-Quality Fine-Tuning
The most practically interesting result: LDA benefits from adding low-quality trajectories, while \(\pi_{0.5}\) degrades.
| Task | Split | \(\pi_{0.5}\) | LDA |
|---|---|---|---|
| Place pen into box | High only | 60 | 70 |
| Place pen into box | High + Low | 40 | 80 |
| Remove lid | High only | 50 | 50 |
| Remove lid | High + Low | 40 | 60 |
This is exactly the kind of behavior you want from a dynamics-centric pretraining method: imperfect trajectories are not simply noise; they can teach what happens under suboptimal actions.
7. Does DINO Latent Learn Physics?
Short answer: it helps the model learn intuitive interaction physics, but it is not a substitute for explicit physical modeling.
DINO latent is useful because it gives the downstream dynamics model a structured state representation. It tends to preserve:
- object-level semantics
- spatial coherence
- foreground object structure
- contact-relevant regions
- motion-relevant visual changes
But DINO latent does not explicitly encode:
- mass
- friction
- force
- torque
- material stiffness
- conservation laws
- precise contact constraints
So the real recipe is:
structured visual state space
+ action-conditioned temporal prediction
+ large-scale interaction data
+ forward/inverse dynamics objectives
= better intuitive physics for robot control
This distinction matters. If a task requires precise force control, transparent objects, deformable materials, or hidden physical variables, DINO latent alone may be insufficient. The paper itself lists future work around jointly learning visual representations and latent dynamics, richer sensory modalities, and better data-role optimization.
8. Strengths
- Clear scaling direction: use more heterogeneous data, not just more expert demonstrations.
- Good representation choice: DINO latent avoids wasting capacity on pixel-level appearance modeling.
- Unified objectives: policy, forward dynamics, inverse dynamics, and visual forecasting reinforce each other.
- Practical data usage: lower-quality trajectories and actionless videos become useful instead of discarded.
- Strong robotics evaluation: simulation, real-world gripper tasks, dexterous hands, generalization tests, and mixed-quality fine-tuning.
9. Limitations
- Frozen DINO features may bottleneck generalization: if DINO misses a physical cue, the dynamics model may not recover it.
- Egocentric camera bias: the paper notes reliance on predominantly egocentric viewpoints.
- No guarantee of physical correctness: predictions are learned from data, not constrained by physical laws.
- Dataset and reproducibility: EI-30k and checkpoints are marked as coming soon on the project page at the time of writing.
- Real-world success rates are still far from solved: some long-horizon tasks remain difficult.
10. Takeaways
- DINO latent is a strong state representation for robot world models because it is semantic, spatial, and less distracted by low-level visual variation.
- The dynamics model, not DINO alone, learns interaction physics through action-conditioned future prediction.
- Data quality should be role-aware: low-quality data may hurt behavior cloning but help dynamics learning.
- Actionless human videos are useful when the objective includes visual forecasting.
- For embodied intelligence, representation and data ingestion may matter as much as model size.
My personal read: this paper is a useful bridge between VLA policy learning and world-model-style robot pretraining. It suggests a practical direction: stop forcing all data into expert imitation, and instead let different data teach different parts of the robot’s world model.
