[Paper Notes] Reward Prediction with Factorized World States
Published:
This post supports English / 中文 switching via the site language toggle in the top navigation.
TL;DR
This paper asks a very practical question for planning agents:
Can we predict useful rewards in a zero-shot way without training a task-specific reward model, simply by building a better state representation?
The answer proposed here is StateFactory, a method that factorizes text observations into hierarchical object-attribute world states, then predicts reward as the semantic similarity between the current state and a dynamically interpreted goal state.
The paper contributes both:
- a new benchmark, RewardPrediction, with 2,454 trajectories across five domains
- a zero-shot reward prediction framework that beats strong baselines
The headline numbers are:
- 60% lower EPIC distance than VLWM-critic
- 8% lower EPIC distance than LLM-as-a-Judge
- planning gains of +21.64% on AlfWorld
- planning gains of +12.40% on ScienceWorld
My short read is that the paper’s strongest idea is simple: reward quality is largely a state-representation problem.
Paper Info
- Title: Reward Prediction with Factorized World States
- Authors: Yijun Shen, Delong Chen, Xianming Hu, Jiaming Mi, Hongbo Zhao, Kai Zhang, Pascale Fung
- Affiliations: East China Normal University, HKUST
- arXiv: 2603.09400
- Project page: statefactory.github.io
- Paper type: agent planning / reward prediction / world-state representation
1. Problem and Motivation
Planning agents need more than just next-state prediction. They also need to know:
- whether a predicted state is closer to the goal
- how much progress an action makes
- which branch of a plan is more promising
The usual fix is to train a supervised reward model. But the paper points out a real problem:
- supervised reward models can overfit to training domains
- reward labels can encode dataset-specific biases
- generalization to new goals and environments can collapse
So the paper explores a different route:
- do not train a domain-specific reward predictor
- instead, build a structured state representation
- estimate reward by measuring distance between current state and goal state
This is a good framing because it shifts the problem from “how do I fit a scalar critic?” to “how do I represent world state so that semantic distance actually reflects task progress?”
2. The RewardPrediction Benchmark
One of the paper’s strongest contributions is the benchmark itself.
RewardPrediction contains:
- 2,454 unique trajectories
- step-wise actions, observations, and ground-truth rewards
- five interactive domains
The five domains are:
- AlfWorld
- ScienceWorld
- TextWorld
- WebShop
- BlocksWorld
This mix is useful because it spans:
- embodied planning
- scientific reasoning
- text-adventure procedural tasks
- web navigation
- classical symbolic planning
2.1 Why this benchmark matters
Most prior setups evaluate sparse end success or domain-specific reward quality. This benchmark instead focuses on step-wise progress estimation, which is much closer to what a planner actually needs.
The evaluation metric is EPIC distance, which compares predicted reward sequences against ground-truth reward sequences. Lower is better.
I think this is the right choice for the paper’s claim, because the goal is not only to know who wins at the end, but whether the model’s reward landscape is actually aligned with task progress.
2.2 Benchmark construction
The data construction is also thoughtful.
Each task instance includes:
- a positive trajectory
- a negative trajectory
Positive trajectories come from expert demonstrations and are densified with interpolated progress rewards. Negative trajectories come from random policies and are filtered to avoid accidental overlap with expert behavior.
That design makes it harder for a method to game the benchmark with shallow heuristics such as trajectory length or generic optimism.
3. StateFactory
The core method is StateFactory, a representation-based reward predictor.
Instead of regressing reward directly from raw text history, it decomposes the process into:
- state extraction
- goal interpretation
- hierarchical routing
The final reward is computed as semantic similarity between the extracted current state and the interpreted goal state.
4. Method Breakdown
4.1 State extraction
StateFactory converts each observation into a structured state:
- a set of objects
- each object paired with attributes and values
The paper writes each object as an identity plus attribute-value pairs, such as:
- object identity:
Mug - attribute:
location - value:
on the table
This is a stronger abstraction than:
- raw observation text, which is noisy
- plain object-centric state, which often entangles multiple properties together
The state extraction is also recurrent and goal-conditioned. It uses:
- current observation
- previous state
- previous action
- previous goal interpretation
- original goal text
That matters because state tracking in long-horizon tasks is not just parsing a snapshot. It is updating a belief over evolving world state.
4.2 Goal interpretation
The paper argues that static goal representations can create an illusion of progress because they do not adapt to changes during execution.
So StateFactory treats goal interpretation as a dynamic process:
- start from the language goal
- repeatedly refine its grounded meaning using current state and trajectory context
This is an important design choice. In practical planning, the operational meaning of a goal often becomes clearer as the agent interacts with the environment.
4.3 Hierarchical routing
This is the mechanism that turns structured states into rewards.
For each goal object, the method:
- finds the best matching object in the current state
- checks identity similarity
- checks attribute-value similarity
- aggregates these scores into local progress
- averages over all goal objects to get global reward
The matching is hierarchical:
- first align the right object
- then align the right attributes
- then measure how close the values are
This is much more interpretable than asking an LLM to emit a reward directly.
5. Main Results on RewardPrediction
Table 1 is the key result.
5.1 Zero-shot comparison
Overall EPIC distance:
- VLWM-critic:
0.738 - LLM-as-a-Judge
(best listed):0.322 - StateFactory:
0.297
So StateFactory improves substantially over the zero-shot baselines, especially against VLWM-critic.
The abstract summarizes this as:
- 60% lower EPIC distance than VLWM-critic
- 8% lower EPIC distance than LLM-as-a-Judge
Those numbers match the main takeaway: explicit structured state helps reward generalization.
5.2 Comparison to supervised reward models
The supervised models are especially interesting because they expose the generalization gap.
When trained on a single domain, they do very well in-domain, but their error rises sharply on unseen domains. The paper reports an average 138% increase in error when transferring out of domain.
That is probably the most important result in the paper besides StateFactory itself:
supervised reward modeling is powerful in-domain, but brittle for cross-domain zero-shot planning.
5.3 Nearing the supervised upper bound
A particularly strong result is that StateFactory’s zero-shot performance gets close to the supervised model trained on all domains.
That does not mean supervision is useless. It means that for this task, a strong structured representation can recover a large fraction of what supervised critics were learning.
6. Ablations
The ablations are well aligned with the paper’s claim.
6.1 Representation granularity matters
Figure 5(a) shows a clear progression:
- raw observations are worst
- plain textual states are better
- object-centric states are better still
- full object-attribute factorization is best
The argument is convincing:
- raw observations contain too much distractor text
- object-only states still entangle attributes
- object-attribute factorization separates the parts of state that actually change during task progress
6.2 Dynamic goal interpretation is close to oracle
The online goal interpretation performs only slightly worse than an offline oracle goal-state setup, with about a 0.02 EPIC gap according to the paper.
That is a strong sign that the dynamic goal grounding mechanism is not the main bottleneck.
6.3 Better reasoning models help
The paper also shows that stronger LLM backbones and “thinking” modes improve factorization quality. This is a good sign for scalability: the method should benefit as reasoning models improve.
6.4 Embedding quality matters
StateFactory depends on semantic embeddings to measure similarity. The paper shows a strong correlation between triplet-based embedding accuracy and final reward performance.
That is useful because it clarifies where future gains may come from:
- better factorization
- better semantic alignment models
7. Utility for Planning
The reward model is only interesting if it actually helps planning.
7.1 ReAct + StateFactory
The paper augments ReAct with StateFactory-based reward scoring and reports:
- AlfWorld:
34.33 -> 55.97 - BlocksWorld:
85.00 -> 93.00 - ScienceWorld:
22.63 -> 35.03
These are large gains, especially on AlfWorld and ScienceWorld.
This is important because it shows the benchmark result is not merely cosmetic. The reward signal is actually useful for action selection.
7.2 System-2 planning
The paper also integrates StateFactory into a system-2 planning setup with:
- LLM action proposals
- a world model
- Monte Carlo Tree Search
The qualitative analysis suggests that StateFactory serves as a structured heuristic, helping search avoid dead ends and making reward increases more grounded in state evidence.
8. Why This Paper Is Interesting
I think the paper is valuable for three reasons.
8.1 It reframes reward prediction
Rather than treating reward modeling as a scalar regression problem, it treats it as a world-state representation problem. That is a cleaner and more general framing.
8.2 It contributes a useful benchmark
RewardPrediction is not only another dataset. It provides a way to evaluate step-wise reward quality across domains, which is still missing in many agent papers.
8.3 It connects representation quality to planning quality
The paper does not stop at EPIC scores. It closes the loop by showing that better reward estimates actually improve agent success rates.
9. Limitations
A few limitations are worth keeping in mind.
- The method relies on strong LLM-based state extraction and goal interpretation, so quality depends on those components.
- The environments are text-based or text-grounded; extension to richer visual real-world settings is plausible but not directly demonstrated here.
- Semantic similarity is powerful, but it may still miss deeper causal or irreversible task structure in some domains.
- The benchmark is diverse, but still limited to five domains and offline trajectories.
10. Takeaways
My main takeaway is:
StateFactory shows that if you can factorize world state into the right semantic units, reward prediction becomes much more generalizable.
More broadly, the paper suggests a practical recipe for planning agents:
- predict or track explicit world state
- represent state as objects plus attributes
- interpret goals dynamically rather than once
- derive reward from semantic alignment instead of only learned scalar heads
For LLM agents, that feels like a productive direction. Instead of asking a model to “judge progress” directly, we can first make the state legible and then let reward emerge from structure.
