[Paper Notes] KEEP: A KV-Cache-Centric Memory Management System for Efficient Embodied Planning
Published:
This post supports English / 中文 switching via the site language toggle in the top navigation.
TL;DR
KEEP studies a very practical bottleneck in LLM-based embodied planning: memory is useful, but feeding long textual memory into the model at every planning step makes the agent slow. The planner may only need to output one action such as “find the potato” or “open the microwave”, yet before it can do that, the LLM repeatedly pays the prefill cost for a long prompt containing object states, robot state, past actions, and few-shot task examples.
The paper’s answer is to treat memory as KV cache rather than repeatedly as raw text. But embodied memory is awkward for naive caching because the world changes after actions. If the robot picks up an object, opens a drawer, places a plate in the sink, or turns on a faucet, part of the memory becomes stale and part of the cache is no longer valid. KEEP is therefore not simply a KV cache reuse paper. It is a paper about how to make KV reuse respect the structure and update pattern of embodied memory.
The system has three main pieces:
- Static-Dynamic Memory Construction: stable memory is cached in larger semantic groups, while frequently changing memory is cached at finer segment granularity.
- Multi-hop Memory Re-computation: the system recomputes KV for memories that matter to the current query, including memories that are only indirectly relevant through other memories.
- Layer-balanced Memory Loading: KV loading from CPU memory to GPU is scheduled across layers to avoid idle bubbles caused by uneven recomputation ratios.
On ALFRED, KEEP keeps success rate close to full recomputation while substantially lowering TTFT. Compared with CacheBlend, the paper reports 4.13% success-rate improvement and 1.90x TTFT reduction on Qwen-2.5-32B INT4.
Paper Info
The paper is “KEEP: A KV-Cache-Centric Memory Management System for Efficient Embodied Planning” by Zebin Yang, Tong Xie, Baotong Lu, Shaoshan Liu, Bo Yu, and Meng Li. The arXiv version is 2602.23592. The paper is also listed by Microsoft Research, and the code is available at PKU-SEC-Lab/KEEP_Embodied_Memory.
The local codebase I read contains two implementations:
KEEP_transformers, a simpler Transformers implementation mainly for accuracy-oriented experiments.KEEP_vllm, a modified vLLM implementation used for latency experiments and system-level KV loading/recomputation.
Why This Problem Exists
A language-based embodied planner usually runs in a loop. It observes the environment, updates memory, builds a prompt, asks the LLM to choose the next action, executes that action, then repeats. At first glance this looks like ordinary prompting. But the cost profile is different from a chat application.
The generated answer is short. In ALFRED-style planning, the output might be one skill selected from a constrained set: find an object, pick it up, put it down, slice something, open a receptacle, close it, toggle an appliance, and so on. The prompt, however, can be long. It may contain task rules, demonstrations, object memories, the current state of discovered objects, and the task instruction. Over many planning steps, this creates a strange imbalance: the model spends most of its time reading memory, not producing action.
This is why TTFT matters. If every step must prefill thousands of memory tokens before generating a tiny action, the agent becomes sluggish. In a real robot loop, that is not a cosmetic issue. Slow planning changes the feel of interaction, limits how often the agent can replan, and makes long-horizon tasks more expensive.
The tempting fix is KV caching. Once the model has processed a memory segment, why not store its key-value states and reuse them later? That is exactly the right instinct, but embodied memory immediately complicates it.
Suppose the memory says:
{"object": "milk", "state": "cold", "position": "on the table"}
{"object": "table", "position": "in the kitchen"}
After the agent picks up the milk, the milk record changes. Depending on the representation, the table record may change too, because it no longer holds the milk. A prefix cache would only reuse the exact unchanged prefix, so a small edit can invalidate a long suffix. Fixed-size KV blocks are less brittle, but their boundaries are mechanical rather than semantic. A block might mix a stable task-history example with a volatile object state, or split two related object memories apart.
KEEP is built around this mismatch. Embodied memory is structured, semantically meaningful, and updated unevenly. A good cache strategy should be structured too.
The Memory Design Space
The paper places KEEP against several broad memory paradigms.
Parametric memory stores task experience in model weights through fine-tuning. It can be fast at inference time, but it is expensive to update and risks forgetting or overfitting.
Text memory keeps memory as prompt text. This is simple, transparent, and training-free, which is why it is common in embodied-agent prototypes. But the more memory you retrieve, the larger the prefill cost becomes.
Latent memory compresses history into learned or summarized states. This can be efficient, but compression may discard details that later become important. In embodied tasks, a tiny detail such as where a key was last seen may become the whole difference between success and failure.
KV-centric memory, KEEP’s choice, stores memory as the LLM’s own KV states. It is training-free like text memory, but aims to avoid repeated prefill. The challenge is scalability and correctness: KV tensors are large, may live partly in CPU RAM, and become invalid when their corresponding memory changes.
So the core contribution of KEEP is not merely “store KV”. The contribution is a management policy for deciding what granularity to cache, what to recompute, and when to load it.
Idea 1: Not All Memory Ages at the Same Speed
The first KEEP idea is Static-Dynamic Memory Construction. The intuition is almost mundane, which is part of why I like it: some memories change all the time, and some barely change at all.
Object-state memory is often dynamic. A potato can be whole, sliced, heated, moved, placed in a sink, or held by the robot. A drawer can be opened or closed. A mug can move from a countertop to a table. These records are exactly the ones that make embodied memory useful, but they are also the ones that break naive cache reuse.
Task-history examples are different. A few-shot example saying “Human: Put a cooked potato slice in the sink. Robot: find a knife, pick up the knife, …” does not change after the robot executes the current task. It is semantically rich and stable. Caching it as many tiny independent blocks would lose useful internal attention for no good reason.
KEEP therefore clusters memory using a sentence encoder, and then classifies memory groups by recent update frequency. A group is static if all its segments have stayed unchanged for the recent threshold window, set to t = 10 in the paper. Static groups are cached and retrieved as groups, preserving cross-attention inside the group. Dynamic groups are cached at individual segment granularity, so one updated object does not invalidate a large semantic neighborhood.
In other words, KEEP asks each memory group a practical question: are you old enough to be cached together, or active enough that I should isolate your changes?
This directly addresses a weakness of fixed-size block reuse. Fixed token blocks know nothing about object identity, task examples, or update frequency. KEEP uses the fact that embodied memory already has a semantic structure.
Idea 2: Importance Can Be Indirect
The second idea is Multi-hop Memory Re-computation. This part answers a subtle question: if we cached memory segments independently, how do we recover the cross-attention that the model would have used under full prefill?
One answer is to recompute a small fraction of tokens. Prior KV recomputation systems often choose positions heuristically, or choose tokens based on local discrepancy. KEEP argues that embodied planning needs a more query-aware and context-aware criterion.
Consider the instruction:
Open the door.
The directly relevant memory might be:
{"object": "door", "state": "locked"}
But the correct next action may depend on another memory:
{"object": "key", "position": "on the table"}
And that memory may itself depend on:
{"object": "table", "position": "in the kitchen"}
The table memory is not necessarily the most query-similar memory if we only compare it to “open the door”. But through the chain door to key to table, it becomes essential. This is the paper’s central reason for doing importance propagation.
At selected layers, KEEP starts from query-to-memory attention and assigns initial importance scores. It then selects a top ratio of memories, looks at attention from those selected memories to other memories, updates importance scores, and repeats until the selected set stabilizes. The final set is recomputed for the next layer.
This is a nice conceptual shift. Importance is not treated as a one-hop retrieval score. It is allowed to move through the memory graph created implicitly by attention. For embodied planning, that matters because many tasks are chains of preconditions: to open the door, find the key; to find the key, locate the table; to reach the table, go to the kitchen.
The paper also chooses memory-segment granularity rather than arbitrary token granularity. That is less fine-grained, but much friendlier to contiguous KV loading. In the code, the vLLM path tracks memory modules through module_lengths and recompute_modules, rather than maintaining a scattered set of individual token positions as the primary abstraction.
Idea 3: The Loader Has to Be Scheduled Too
The third idea, Layer-balanced Memory Loading, is the most systems-flavored part of the paper.
KEEP stores KV memory in larger-capacity memory such as CPU RAM and loads it to GPU when needed. This is necessary because long memory creates a large KV footprint. But once KV loading enters the picture, the bottleneck is no longer just “how many tokens do we recompute?” It is also “when do we move KV, and can that movement be hidden behind computation?”
A straightforward schedule loads KV for layer i + 1 while computing layer i. That works only if each layer has similar loading and computation work. KEEP observes that under recomputation, this assumption breaks. Early layers recompute more memory, so they load less old KV. Later layers recompute less memory, so they load more old KV. As a result, the loader may be idle early and the compute path may wait later.
KEEP uses a cross-layer observation: once a memory segment is not recomputed in an earlier layer, it cannot suddenly be recomputed later, because the hidden state needed to recompute it has already been discarded. Therefore, its KV for later layers is guaranteed to be needed. The system can use idle time in early layers to pre-load KV for future layers, smoothing the load.
This turns memory loading into a scheduling problem. The system is not only choosing which memories matter; it is also trying to make GPU computation and CPU-to-GPU KV movement arrive at the right time.
How the Code Maps to the Paper
The codebase makes the paper easier to understand because it exposes the planner-level and model-level pieces separately.
At the planner level, KEEP_vllm/src/task_planner.py defines message_kv, which wraps one memory item with its text content, token ids, semantic vector, KV cache, length, id, and KV placement. This is the basic unit that lets the system talk about memory semantically while still holding the concrete KV tensors needed by the model.
The same file defines overall_kv_schedule, which is the main coordination object for static KV, object memories, example memories, similarity scores, selected memory order, module ranges, and layer-wise loading buffers. Functions such as add_message_objects_by_observation and add_message_objects_by_action show how ALFRED observations and executed actions update object memory. If an object record changes, the old KV placement is removed and the object memory is re-added.
At the task level, KEEP_vllm/src/alfred/alfred_task_planner.py turns ALFRED objects and skills into planner prompts. It maintains skill sets for actions like finding objects, picking them up, putting them down, opening and closing receptacles, slicing objects, and toggling appliances. It also selects few-shot examples using sentence-transformer similarity. These examples become a relatively stable memory source, which fits the static side of KEEP.
At the serving/model level, KEEP_vllm/vllm_KEEP/vllm/model_executor/models/qwen2.py is where the custom vLLM behavior appears. The modified forward path passes cache_fuse_metadata, old_kv, layer_id, and kv_schedule through Qwen2 layers. The model tracks whether it is collecting KV, checking/recomputing memory, or using cached KV. It maintains recompute_modules, imp_indices, query_imp_indices, check_layers, and per-layer recomputation counts.
The layer-balanced loading logic appears through calls such as:
move_kv_layer_wise_vllm_v1
move_kv_layer_wise_extra_vllm_v1
These functions fill preallocated layer-wise KV buffers and use loading tables to avoid repeatedly loading the same memory for the same layer. The code is researchy, with placeholders such as PATH/TO/SENTENCE_TRANSFORMER, explicit CUDA assumptions, and Qwen-specific paths. But the artifact is still valuable because it shows how the abstract system idea lands in an actual inference loop.
Experiments and What They Show
The paper evaluates KEEP on ALFRED and WAH-NL, using Qwen-2.5 models. ALFRED reports Success Rate (SR), while WAH-NL reports Subgoal Success Rate (Sub-SR). Efficiency is measured with TTFT, because prefill dominates planning latency.
On ALFRED, the main comparison is telling:
- Full recompute with Qwen-2.5-14B reaches 44.63% SR and 0.410s TTFT.
- KEEP with Qwen-2.5-14B reaches 44.30% SR and 0.236s TTFT.
- Full recompute with Qwen-2.5-32B INT4 reaches 45.81% SR and 1.213s TTFT.
- KEEP with Qwen-2.5-32B INT4 reaches 45.50% SR and 0.635s TTFT.
So KEEP nearly preserves full-recompute accuracy while removing a large part of the prefill cost. Compared with CacheBlend, KEEP improves ALFRED SR from 39.36% to 44.30% on Qwen-14B and from 41.37% to 45.50% on Qwen-32B INT4. The TTFT improvement over CacheBlend is also substantial.
On WAH-NL, the pattern is similar. KEEP keeps Sub-SR close to full recompute while reducing TTFT. Compared with CacheBlend, it improves Sub-SR by about 3.3% and cuts TTFT by around 1.5x to 1.9x, depending on model size.
The ablation study clarifies the role of each module:
- Without static-dynamic memory construction, ALFRED SR drops from 44.30% to 37.36%, and TTFT rises from 0.230s to 0.355s. This means the grouping strategy is not just a speed trick; it also preserves useful memory interactions.
- Without multi-hop memory recomputation, SR drops to 41.78%, while TTFT stays close. This is the accuracy-protection mechanism.
- Without layer-balanced loading, SR is unchanged, but TTFT rises to 0.273s. This isolates the scheduler’s role as a latency optimization.
I read these results as a clean decomposition: static-dynamic construction decides the right cache granularity, multi-hop recomputation repairs the most important lost attention, and layer-balanced loading makes the system run efficiently.
Why I Find KEEP Interesting
Many agent-memory papers talk about what to remember: summaries, episodic records, semantic facts, retrieved examples, or compressed latent states. KEEP is more concerned with how memory is physically represented and moved through inference. That shift is important.
For an embodied agent, memory is not a decorative prompt section. It is part of the control loop. If the agent must consult memory before every action, then memory representation directly affects action latency. A memory system that is semantically elegant but too slow may fail in deployment. A memory system that is fast but breaks important relationships may plan badly. KEEP sits exactly at that tension point.
I also like that the three designs operate at different levels:
- Static-dynamic construction is a memory-organization idea.
- Multi-hop recomputation is a model-attention idea.
- Layer-balanced loading is a runtime-scheduling idea.
The paper is strongest when these levels reinforce each other. Segment-level memory makes recomputation decisions meaningful. Recomputed modules define what needs to be loaded. Loading decisions affect whether the system speedup materializes in wall-clock TTFT. It is not one isolated trick; it is a stack.
Limitations and Questions
KEEP assumes memory can be represented as structured segments. ALFRED and WAH-NL provide a friendly setting for this because object states and task histories can be written down clearly. In a less controlled real robot setting, the hard part may shift upstream: perception may be noisy, object identity may be uncertain, and memory updates may be incomplete or wrong.
There is also an implementation portability question. The released code is closely tied to Qwen, vLLM internals, attention-weight access, KV tensor layout, and explicit CUDA behavior. That is understandable for a systems prototype, but adopting KEEP in a different serving stack would require careful engineering.
Another question is how robust the importance propagation is when attention does not align with causal task relevance. The door-key-table story is compelling, but attention can be noisy or diffuse. In larger environments, it would be interesting to study whether explicit symbolic relations, retrieval graphs, or learned memory routers could complement attention-based propagation.
Finally, KEEP accelerates planning with the memory it is given. It does not guarantee memory correctness. If the agent records the wrong location for the key, faster KV reuse only helps the model consult the wrong fact faster.
Takeaway
KEEP is best understood as a systems paper about making long-memory embodied planners practical. It starts from a very real observation: LLM agents need memory, but long text memory turns every short action decision into an expensive prefill. KV cache reuse is the obvious direction, yet embodied memory changes too often and too unevenly for naive reuse.
The contribution is to make caching aware of the life cycle of memory. Stable memories can be grouped. Active memories should be isolated. Indirectly important memories deserve recomputation. KV loading should be scheduled across layers rather than treated as a passive cost.
That is the larger lesson I take from this paper: in embodied agents, memory is simultaneously a semantic object, a planning resource, and a systems artifact. Good memory design has to respect all three.
