Before the Robot GPT-3.5 Moment: Latent Actions, World Models, and Embodied Memory
Published:
This post supports English / 中文 switching via the site language toggle in the top navigation.
TL;DR
This essay starts from a larger question: what would it take for robotics to have its GPT-3.5 moment?
My current answer is that we are still before that moment. We have not yet done the robotics equivalent of strong pretraining. Many current VLA systems feel like promising pre-GPT-3.5 models: enough signal to be impressive in constrained settings, not enough generality to make prompt engineering, context engineering, or harness engineering truly shine. Before robotics can have sophisticated harnesses, it needs better pretrained representations of action, memory, physical consequence, and closed-loop interaction.
The argument is organized around four ideas:
- Latent action should become a bridge between human video, robot video, and embodiment-specific control. It should capture “what is being done” before it is forced into a particular robot’s joint space.
- World models should not be treated as pretty video generators. Their real value is as interactive simulators that let an agent try, observe, revise, and learn consequences.
- KV-cache memory suggests a different view of embodied memory: memory is not just text in a prompt, but a reusable computational state whose validity depends on how the world changes.
- Human memory is not photorealistic reconstruction. We do not rebuild the world pixel by pixel. We recognize whether an unfolding environment feels familiar, plausible, or wrong after the environment replays itself around us.
The long-term hypothesis is simple: robot intelligence will not be built from action prediction alone. It will need pretrained action abstractions, physical world models, structured memory, reward-guided closed-loop learning, and eventually a robot-specific harness that turns weak or partial intelligence into reliable work.
1. Latent Action: Action Before Embodiment
The first piece is latent action.
The appeal is that raw robot action is too tied to one embodiment. A 7-DoF gripper action, a dexterous hand action, a bimanual robot action, and a human hand motion all describe interaction with the world, but their low-level coordinates are not naturally aligned. If we train directly on embodiment-specific actions, we risk learning a brittle mapping from pixels to joints instead of a reusable concept of manipulation.
Latent action is an attempt to insert a middle layer:
visual change + task semantics -> latent action -> embodiment-specific control
The latent action should answer questions like:
- What object is being acted on?
- What physical relation is changing?
- Is the agent approaching, grasping, rotating, opening, pushing, placing, or stabilizing?
- Which parts of the motion are task-essential, and which parts are embodiment-specific style?
- Can this action be transferred from human hands to a gripper, from a single arm to a dual-arm setup, or from video data to robot execution?
The difficulty is evaluation. If latent action is only used as an intermediate state, how do we know whether the representation is good? One instinct is to supervise it directly, but that may be the wrong level of pressure. In language models, we usually do not demand that every hidden vector correspond to a human-readable concept. We optimize the model by whether it predicts and acts well.
For robotics, the practical version might be:
- Learn latent actions from human and robot videos.
- Use them to predict future visual states or action chunks.
- Decode them into embodiment-specific controls.
- Score them by downstream success, reward models, physical plausibility, and policy improvement.
This creates a tension between interpretability and utility. A highly interpretable latent action space is easier to debug, but a less interpretable one may be more powerful if it is optimized through closed-loop outcomes. My current leaning is that we should not over-supervise the latent space too early. We need probes and diagnostics, but the final pressure should come from whether the latent action improves task success and transfer.
There is also a scale issue. If latent action is learned only from adjacent frames, it may capture local optical flow while missing long-horizon intent. If it is learned from whole videos, it may become too semantic and lose contact-level detail. The right representation probably needs both local and global constraints: short-horizon motion for physical grounding, long-horizon task structure for meaning.
2. World Models: More Than Video Prediction
The second piece is the meaning of a world model.
In robotics, world models are often pulled toward video generation. This is understandable. Video is a convenient medium for checking whether a model has learned physical regularities: objects persist, hands make contact, doors swing, liquids pour, tools interact with surfaces, and occlusion must be handled. A good video model can provide a strong visual prior.
But a robot does not need a world model merely to watch a possible future. It needs a world model as an environment for interaction.
The difference is important:
video generator:
current observation + prompt -> plausible future video
robot world model:
current state + candidate action -> changed state + observations + rewards + uncertainty
A video generator can look physically impressive and still fail at the properties robots care about. It may not know whether an apple is heavy or light, whether a transparent object has a sharp boundary, whether contact force is enough, or whether an object hidden behind another object should still constrain the plan. Photorealistic motion is not the same as physical understanding.
This is why I increasingly think the useful robotics world model should be evaluated by closed-loop affordance, not just visual fidelity. Can the agent use it to choose better actions? Can it roll out counterfactuals? Can it discover that a plan fails before touching the real robot? Can it support reinforcement learning or best-of-N selection? Can it preserve enough physics that reward optimization inside the model does not exploit nonsense?
The most promising direction may not be choosing between “world action models” and “action-conditioned world models” as a permanent fork. A mature robotics model may need flexible conditioning:
- text-to-video-and-action mode for generating successful demonstrations;
- action-to-future mode for simulating candidate controls;
- video-to-latent-action mode for learning from human and robot data;
- reward-conditioned mode for optimizing behavior;
- memory-conditioned mode for using what the agent has already seen.
In that framing, video is not the final product. Video is one observation channel through which the model learns physical consequence.
3. KV-Cache Memory and Embodied Recall
The KEEP paper made the conversation about memory more concrete. In LLM agents, memory is often treated as text: retrieve some notes, paste them into the prompt, and let the model reason. That works, but it is expensive and conceptually limited. If every embodied planning step requires the model to reread a long memory prompt before outputting a short action, the agent spends most of its time re-processing old context.
KV-cache-centric memory suggests a sharper abstraction. A memory is not only a sentence. It is also a reusable computational state inside the model. If the model has already processed a stable part of memory, perhaps that key-value state should be cached and reused. But embodied memory complicates this because the world changes. If the robot picks up the cup, opens the drawer, or moves the knife, some memory is now stale and some cache is invalid.
That gives a useful metaphor for embodied intelligence:
memory is not a static archive; it is a living index into a changing world.
A key-value view makes this intuition concrete. A key alone does not reconstruct the value. A place, an object, a posture, or a partial observation can trigger memory, but the full “value” is often recovered only when the environment reenacts enough of the context. You walk into a room and know you have seen this arrangement before. You pick up a tool and your hand remembers how it should feel. You ride a bicycle not by rendering a full internal movie, but by continuously matching sensory feedback against a learned control manifold.
This is very different from treating memory as a database of complete scene descriptions.
For robots, the implication is that memory should probably have multiple forms:
- textual memory for explicit facts and task instructions;
- visual memory for object identity, pose, and scene layout;
- action memory for what was tried and what happened;
- KV or latent memory for reusable model-internal state;
- environmental memory for facts that should be checked by re-observing the world rather than trusted from stale recall.
The last category may be the most important. A robot should not simply remember “the mug is on the table” forever. It should know how confident that memory is, when it was last verified, what actions may have invalidated it, and how to cheaply re-check it.
4. Human Experience: Recognition, Not Reconstruction
The relevant intuition is:
A human’s sense of the world is not reconstruction. It is knowing, after the environment replays, that this is familiar or this is right.
That feels closer to how embodied memory works. When we imagine a cup falling, we do not necessarily generate a high-resolution video in our head. We anticipate a structure of consequences: gravity, acceleration, collision, sound, possible breakage, changed location. When a real cup falls in a way that violates those expectations, it feels wrong before we can explain why.
This matters for world models because current AI systems often overemphasize visual reconstruction. If the generated video looks plausible, we are tempted to say the model understands the world. But human physical understanding is not primarily judged by photorealism. It is judged by whether the unfolding event remains coherent under action.
For robotics, the useful internal question may not be:
Can I reconstruct the next frame?
It may be:
If I act, will the world respond in a way that remains dynamically coherent,
task-relevant, and recoverable?
This reframes perception. The robot does not need to know everything in the scene. It needs to know which aspects of the scene are decision-relevant, which uncertainties can be resolved by acting or moving the camera, and which physical constraints must not be violated. In closed-loop work, changing the viewpoint is not a cosmetic improvement; it is part of cognition. Occlusion, color ambiguity, transparent boundaries, mass, and contact are not defects to be solved once. They are reasons the agent must keep interacting with the world.
5. The Robot GPT-3.5 Moment
The analogy to language models is useful if we keep it precise.
For LLMs, the path was roughly:
- Pretraining gave models broad competence.
- Prompt engineering discovered how to elicit latent abilities.
- Context engineering managed long inputs, retrieval, tool results, memory, and state.
- Harness engineering wrapped the model in workflows, verifiers, tools, policies, and product constraints.
Robotics has not yet completed step one.
We have robot datasets, VLA models, action experts, imitation learning, diffusion policies, flow matching, teleoperation pipelines, simulators, and increasingly good video models. But we do not yet have the robotics equivalent of a broadly pretrained GPT-3.5: a model that has absorbed enough tasks, embodiments, physics, action outcomes, and recovery patterns that simple prompting or light adaptation unlocks robust general behavior.
This explains why some current robot systems feel both exciting and premature. The field is already reaching for context engineering and harness engineering, but the base model may not yet have enough general competence to be harnessed in the same way LLMs can be. A weak model can be improved by a workflow, but a workflow cannot invent missing physical priors, missing action abstractions, or missing contact understanding from nothing.
Still, the harness idea is important. Eventually robots will need their own harnesses. A robot harness will not be just a prompt template. It will coordinate:
- perception and active viewpoint selection;
- memory retrieval and memory invalidation;
- world-model rollouts and uncertainty estimates;
- reward models and safety constraints;
- low-level controllers and action-space conversion;
- failure detection and recovery;
- task decomposition and re-planning;
- human feedback and escalation.
In language agents, harnesses turn text prediction into work. In robotics, harnesses will have to turn imperfect embodied prediction into safe, recoverable physical action.
6. A Possible Research Stack
These ideas can be organized as a stack:
flowchart TD
A["Raw human and robot experience<br/>egocentric video, robot trajectories, task outcomes"] --> B["Pretraining<br/>vision, language, action, contact, temporal consequence"]
B --> C["Latent action space<br/>task-level change before embodiment-specific control"]
C --> D["World model<br/>simulate consequences, uncertainty, and reward"]
D --> E["Embodied memory<br/>text, visual state, action history, KV cache, stale-state tracking"]
E --> F["Robot harness<br/>planning, retrieval, verification, controller routing, recovery"]
F --> G["Physical execution<br/>gripper, hand, arm, mobile base, sensors"]
G --> H["Environment feedback<br/>new observations, contact, success, failure"]
H --> E
H --> B
This diagram is not a claim that the architecture must be built exactly this way. It is a way to keep the layers distinct:
- Pretraining should learn broad regularities from large-scale experience.
- Latent action should abstract over embodiments without erasing physical detail.
- The world model should expose consequences, not only videos.
- Memory should track what remains true after action.
- The harness should make the whole system reliable enough to operate.
The near-term research agenda is smaller:
- Organize the paper map around latent action, world models, reward modeling, and embodied memory.
- Design latent-action evaluations that go beyond reconstruction: action decoding, reward correlation, sim success, and cross-embodiment transfer.
- Evaluate world models by whether they improve closed-loop decisions, not only by visual fidelity.
- Treat memory validity as a first-class problem, not an afterthought.
- Prototype robot harnesses that combine retrieval, verification, rollout, safety checks, and recovery.
7. What I Want to Remember
The useful question is not any single module. It is how the layers line up.
If language models taught us anything, it is that intelligence often appears after scale, representation, and harnessing line up. Prompt engineering became powerful only after pretraining made the model worth prompting. Context engineering mattered because the model could use context. Harness engineering became productive because the model had enough latent skill to be organized into workflows.
Robotics is still waiting for that alignment. The field has fragments: action heads, video priors, teleoperation data, simulators, memory systems, reward models, and increasingly strong VLM backbones. But the robot GPT-3.5 moment probably requires these fragments to become one learning system.
My current bet is that the key will not be “better action prediction” alone. It will be an embodied memory loop: latent actions that can transfer, world models that can be acted inside, memory that knows when it is stale, and harnesses that force the model to check, retry, and recover.
The robot does not only need to remember the world.
It needs to let the world remind it what is true.
