[Paper Notes] MEM: Multi-Scale Embodied Memory for Vision Language Action Models
Published:
This post supports English / 中文 switching via the site language toggle in the top navigation.
TL;DR
MEM adds explicit memory to Vision-Language-Action models by splitting memory into two parts:
- short-term video memory for recent visual context, self-occlusion handling, and quick strategy adaptation
- long-term language memory for compact semantic summaries of what has already happened in a long task
The paper’s main claim is that end-to-end robot policies need multi-scale memory, not just a longer observation window. MEM lets a VLA solve tasks that require memory over up to 15 minutes, while still respecting real-time inference constraints.
Paper Info
- Title: MEM: Multi-Scale Embodied Memory for Vision Language Action Models
- Authors: Marcel Torne, Karl Pertsch, Homer Walke, Kyle Vedder, Suraj Nair, Brian Ichter, Allen Z. Ren, Haohuan Wang, Jiaming Tang, Kyle Stachowicz, Karan Dhabalia, Michael Equi, Quan Vuong, Jost Tobias Springenberg, Sergey Levine, Chelsea Finn, Danny Driess
- Affiliations: Physical Intelligence, Stanford University, UC Berkeley, MIT
- Project page: pi.website/research/memory
- Paper type: robotics/VLA systems paper
1. Motivation
The paper starts from a gap in current VLAs:
- many strong VLAs act only on the current observation
- some memory-augmented methods simply append a dense history of past observations
- that approach becomes expensive and still does not distinguish between fine-grained short-term context and abstract long-term task state
For real robot tasks, these two memory types matter for different reasons:
- short-term memory helps with self-occlusion, tracking recent failures, and adjusting manipulation online
- long-term memory helps remember semantic progress, such as which ingredients have already been collected or which subtasks are done
The authors argue that a single uniform memory mechanism is a poor fit for both.
2. Core Idea of MEM
MEM splits action generation into a low-level and high-level process.
- The low-level policy uses a short horizon of dense observations plus a subtask instruction.
- The high-level policy predicts both the next subtask instruction and an updated language memory summarizing past semantic events.
This gives the model a mixed-modal memory system:
- video memory for recent detailed visual context
- language memory for long-horizon compressed task history
That design is the real contribution of the paper. It is not just “more context”; it is a structured memory interface matched to different timescales.
3. Method Breakdown
3.1 Language memory for long-horizon context
The language memory is a running semantic summary of what has already happened in the episode. Instead of passing all previous observations or all previous subtasks verbatim, the model updates a compact textual state over time.
This is important because:
- it compresses long-horizon history much better than raw images
- it keeps the model focused on semantically relevant events
- it avoids exploding context length for tasks that span many minutes
The paper also shows that a naive alternative, simply concatenating previous subtask instructions, works noticeably worse because training-time demonstrations rarely contain repeated failures, while inference-time rollouts often do.
3.2 Video encoder for short-horizon context
For short-term memory, MEM uses a video encoder that extends a ViT-style image encoder into a video encoder by interleaving:
- spatial attention within each frame
- causal-temporal attention across frames
An important engineering detail is that this encoder:
- compresses time before passing features into the VLA backbone
- preserves the token count seen by the backbone at roughly the single-frame level
- introduces no new learnable parameters compared with the standard image ViT
That makes it possible to initialize from pre-trained VLM weights while keeping inference latency under real-time constraints.
3.3 Integrating MEM into pi0.6
The paper instantiates MEM inside the pi0.6 VLA. During pre-training, the model sees sequences of six observations with one-second stride. During post-training, the observation horizon can be expanded significantly, up to around one minute for observation-based memory, while the language memory covers much longer horizons.
The broader point is that the memory capability is developed during large-scale pre-training, not bolted on only at the end.
4. Main Results
4.1 Long-horizon manipulation tasks
The headline result is that MEM enables long-horizon tasks such as:
- recipe setup
- kitchen cleanup
- grilled cheese preparation
These tasks require maintaining task-relevant memory over up to fifteen minutes. The paper shows that memoryless pi0.6 struggles badly on them, while the full MEM system makes them much more tractable.
The ablations are especially informative:
- removing video memory hurts tasks that require recent timing, occlusion handling, or adaptation
- removing language memory hurts tasks that require long-range semantic progress tracking
- using naive text memory without compression performs much worse than learned semantic summaries
This supports the paper’s central claim that both memory scales are necessary.
4.2 In-context adaptation
One of the most interesting parts of the paper is that memory is not only for remembering task progress. It also enables in-context adaptation of manipulation strategy.
The authors demonstrate this on tasks like:
- picking up a chopstick at an out-of-distribution table height
- opening a refrigerator when the door-opening direction is ambiguous
With memory, the policy can use recent failed attempts as context and switch strategy. The reported gains are substantial:
- about +11% on chopstick pickup
- about +62% on fridge opening
The memoryless baseline cannot adapt as effectively because it cannot explicitly condition on what was just tried and failed.
4.3 Core memory capability benchmarks
The paper also evaluates memory-intensive skills including:
- partial observability
- counting
- timing
- spatial memory
Example tasks include finding an object hidden in one of four drawers, unpacking groceries without forgetting items inside the bag, counting coffee scoops, cooking for the right duration, and remembering which parts of a window have already been cleaned.
Across these tasks, MEM is reported as the only approach that performs strongly across the full range of memory demands, outperforming:
- no-memory VLAs
- pooled observation-memory baselines
- proprio-only memory baselines
4.4 Memory without sacrificing standard manipulation
A useful systems result is that MEM does not just improve memory-heavy tasks. The paper also shows it remains competitive on manipulation tasks that do not require memory, suggesting the memory machinery does not degrade core dexterous control.
5. Why This Paper Matters
I think the most important idea here is conceptual: robot memory should be structured by abstraction level and timescale.
That is more convincing than the simpler “just add more frames” strategy for three reasons:
- recent visual context and long-term semantic state are fundamentally different objects
- runtime constraints matter a lot for real robot control
- long-horizon deployment requires compression, not only larger context windows
In that sense, MEM feels closer to a practical robot architecture than a pure sequence-model scaling exercise.
6. Strengths
- Clear separation between short-term and long-term memory roles.
- Strong systems focus: the method is designed around real-time latency constraints, not only benchmark accuracy.
- Good ablation story showing why both memory modalities matter.
- The in-context adaptation result is more interesting than standard memory benchmarks because it shows memory changing behavior, not just recall.
- The paper evaluates a fairly broad set of embodied tasks across different memory requirements.
7. Limitations and Open Questions
- The language memory still depends on the model learning useful semantic compression; it is not guaranteed to preserve every critical detail.
- The approach is integrated into a large proprietary-style VLA stack (
pi0.6), so reproducibility and accessibility may be limited compared with fully open systems. - It is still episodic memory. The paper explicitly frames longer-term deployment memory across days, weeks, or months as future work.
- It remains unclear how robust the memory summaries are under very long failure chains or heavily out-of-distribution task structures.
- The paper demonstrates strong system behavior, but it is harder to isolate how much comes from the memory design itself versus the scale of the underlying training setup.
8. Takeaways
My main takeaway is that MEM gives a strong recipe for long-horizon robot control:
- keep recent observations in a compact visual memory
- summarize distant history in language
- let the policy reason over both
This feels like a more scalable direction for embodied agents than trying to extend a flat observation history indefinitely. If long-horizon robotics is going to work in open-ended homes and kitchens, some version of this multi-scale memory design is likely to become standard.
