[Paper Notes] Reward Prediction with Factorized World States

14 minute read

Published: March 14, 2026

This post supports English / 中文 switching via the site language toggle in the top navigation.

TL;DR

This paper asks a very practical question for planning agents:

Can we predict useful rewards in a zero-shot way without training a task-specific reward model, simply by building a better state representation?

The answer proposed here is StateFactory, a method that factorizes text observations into hierarchical object-attribute world states, then predicts reward as the semantic similarity between the current state and a dynamically interpreted goal state.

The paper contributes both:

a new benchmark, RewardPrediction, with 2,454 trajectories across five domains
a zero-shot reward prediction framework that beats strong baselines

The headline numbers are:

60% lower EPIC distance than VLWM-critic
8% lower EPIC distance than LLM-as-a-Judge
planning gains of +21.64% on AlfWorld
planning gains of +12.40% on ScienceWorld

My short read is that the paper’s strongest idea is simple: reward quality is largely a state-representation problem.

Paper Info

Title: Reward Prediction with Factorized World States
Authors: Yijun Shen, Delong Chen, Xianming Hu, Jiaming Mi, Hongbo Zhao, Kai Zhang, Pascale Fung
Affiliations: East China Normal University, HKUST
arXiv: 2603.09400
Project page: statefactory.github.io
Paper type: agent planning / reward prediction / world-state representation

1. Problem and Motivation

Planning agents need more than just next-state prediction. They also need to know:

whether a predicted state is closer to the goal
how much progress an action makes
which branch of a plan is more promising

The usual fix is to train a supervised reward model. But the paper points out a real problem:

supervised reward models can overfit to training domains
reward labels can encode dataset-specific biases
generalization to new goals and environments can collapse

So the paper explores a different route:

do not train a domain-specific reward predictor
instead, build a structured state representation
estimate reward by measuring distance between current state and goal state

This is a good framing because it shifts the problem from “how do I fit a scalar critic?” to “how do I represent world state so that semantic distance actually reflects task progress?”

2. The RewardPrediction Benchmark

One of the paper’s strongest contributions is the benchmark itself.

RewardPrediction contains:

2,454 unique trajectories
step-wise actions, observations, and ground-truth rewards
five interactive domains

The five domains are:

AlfWorld
ScienceWorld
TextWorld
WebShop
BlocksWorld

This mix is useful because it spans:

embodied planning
scientific reasoning
text-adventure procedural tasks
web navigation
classical symbolic planning

2.1 Why this benchmark matters

Most prior setups evaluate sparse end success or domain-specific reward quality. This benchmark instead focuses on step-wise progress estimation, which is much closer to what a planner actually needs.

The evaluation metric is EPIC distance, which compares predicted reward sequences against ground-truth reward sequences. Lower is better.

I think this is the right choice for the paper’s claim, because the goal is not only to know who wins at the end, but whether the model’s reward landscape is actually aligned with task progress.

2.2 Benchmark construction

The data construction is also thoughtful.

Each task instance includes:

a positive trajectory
a negative trajectory

Positive trajectories come from expert demonstrations and are densified with interpolated progress rewards. Negative trajectories come from random policies and are filtered to avoid accidental overlap with expert behavior.

That design makes it harder for a method to game the benchmark with shallow heuristics such as trajectory length or generic optimism.

3. StateFactory

The core method is StateFactory, a representation-based reward predictor.

Instead of regressing reward directly from raw text history, it decomposes the process into:

state extraction
goal interpretation
hierarchical routing

The final reward is computed as semantic similarity between the extracted current state and the interpreted goal state.

4. Method Breakdown

4.1 State extraction

StateFactory converts each observation into a structured state:

a set of objects
each object paired with attributes and values

The paper writes each object as an identity plus attribute-value pairs, such as:

object identity: Mug
attribute: location
value: on the table

This is a stronger abstraction than:

raw observation text, which is noisy
plain object-centric state, which often entangles multiple properties together

The state extraction is also recurrent and goal-conditioned. It uses:

current observation
previous state
previous action
previous goal interpretation
original goal text

That matters because state tracking in long-horizon tasks is not just parsing a snapshot. It is updating a belief over evolving world state.

4.2 Goal interpretation

The paper argues that static goal representations can create an illusion of progress because they do not adapt to changes during execution.

So StateFactory treats goal interpretation as a dynamic process:

start from the language goal
repeatedly refine its grounded meaning using current state and trajectory context

This is an important design choice. In practical planning, the operational meaning of a goal often becomes clearer as the agent interacts with the environment.

4.3 Hierarchical routing

This is the mechanism that turns structured states into rewards.

For each goal object, the method:

finds the best matching object in the current state
checks identity similarity
checks attribute-value similarity
aggregates these scores into local progress
averages over all goal objects to get global reward

The matching is hierarchical:

first align the right object
then align the right attributes
then measure how close the values are

This is much more interpretable than asking an LLM to emit a reward directly.

5. Main Results on RewardPrediction

Table 1 is the key result.

5.1 Zero-shot comparison

Overall EPIC distance:

VLWM-critic: 0.738
LLM-as-a-Judge (best listed): 0.322
StateFactory: 0.297

So StateFactory improves substantially over the zero-shot baselines, especially against VLWM-critic.

The abstract summarizes this as:

60% lower EPIC distance than VLWM-critic
8% lower EPIC distance than LLM-as-a-Judge

Those numbers match the main takeaway: explicit structured state helps reward generalization.

5.2 Comparison to supervised reward models

The supervised models are especially interesting because they expose the generalization gap.

When trained on a single domain, they do very well in-domain, but their error rises sharply on unseen domains. The paper reports an average 138% increase in error when transferring out of domain.

That is probably the most important result in the paper besides StateFactory itself:

supervised reward modeling is powerful in-domain, but brittle for cross-domain zero-shot planning.

5.3 Nearing the supervised upper bound

A particularly strong result is that StateFactory’s zero-shot performance gets close to the supervised model trained on all domains.

That does not mean supervision is useless. It means that for this task, a strong structured representation can recover a large fraction of what supervised critics were learning.

6. Ablations

The ablations are well aligned with the paper’s claim.

6.1 Representation granularity matters

Figure 5(a) shows a clear progression:

raw observations are worst
plain textual states are better
object-centric states are better still
full object-attribute factorization is best

The argument is convincing:

raw observations contain too much distractor text
object-only states still entangle attributes
object-attribute factorization separates the parts of state that actually change during task progress

6.2 Dynamic goal interpretation is close to oracle

The online goal interpretation performs only slightly worse than an offline oracle goal-state setup, with about a 0.02 EPIC gap according to the paper.

That is a strong sign that the dynamic goal grounding mechanism is not the main bottleneck.

6.3 Better reasoning models help

The paper also shows that stronger LLM backbones and “thinking” modes improve factorization quality. This is a good sign for scalability: the method should benefit as reasoning models improve.

6.4 Embedding quality matters

StateFactory depends on semantic embeddings to measure similarity. The paper shows a strong correlation between triplet-based embedding accuracy and final reward performance.

That is useful because it clarifies where future gains may come from:

better factorization
better semantic alignment models

7. Utility for Planning

The reward model is only interesting if it actually helps planning.

7.1 ReAct + StateFactory

The paper augments ReAct with StateFactory-based reward scoring and reports:

AlfWorld: 34.33 -> 55.97
BlocksWorld: 85.00 -> 93.00
ScienceWorld: 22.63 -> 35.03

These are large gains, especially on AlfWorld and ScienceWorld.

This is important because it shows the benchmark result is not merely cosmetic. The reward signal is actually useful for action selection.

7.2 System-2 planning

The paper also integrates StateFactory into a system-2 planning setup with:

LLM action proposals
a world model
Monte Carlo Tree Search

The qualitative analysis suggests that StateFactory serves as a structured heuristic, helping search avoid dead ends and making reward increases more grounded in state evidence.

8. Why This Paper Is Interesting

I think the paper is valuable for three reasons.

8.1 It reframes reward prediction

Rather than treating reward modeling as a scalar regression problem, it treats it as a world-state representation problem. That is a cleaner and more general framing.

8.2 It contributes a useful benchmark

RewardPrediction is not only another dataset. It provides a way to evaluate step-wise reward quality across domains, which is still missing in many agent papers.

8.3 It connects representation quality to planning quality

The paper does not stop at EPIC scores. It closes the loop by showing that better reward estimates actually improve agent success rates.

9. Limitations

A few limitations are worth keeping in mind.

The method relies on strong LLM-based state extraction and goal interpretation, so quality depends on those components.
The environments are text-based or text-grounded; extension to richer visual real-world settings is plausible but not directly demonstrated here.
Semantic similarity is powerful, but it may still miss deeper causal or irreversible task structure in some domains.
The benchmark is diverse, but still limited to five domains and offline trajectories.

10. Takeaways

My main takeaway is:

StateFactory shows that if you can factorize world state into the right semantic units, reward prediction becomes much more generalizable.

More broadly, the paper suggests a practical recipe for planning agents:

predict or track explicit world state
represent state as objects plus attributes
interpret goals dynamically rather than once
derive reward from semantic alignment instead of only learned scalar heads

For LLM agents, that feels like a productive direction. Instead of asking a model to “judge progress” directly, we can first make the state legible and then let reward emerge from structure.

Share on

Bluesky Facebook LinkedIn Mastodon X (formerly Twitter)

Lixin Xu

TL;DR

Paper Info

1. Problem and Motivation

2. The RewardPrediction Benchmark

2.1 Why this benchmark matters

2.2 Benchmark construction

3. StateFactory

4. Method Breakdown

4.1 State extraction

4.2 Goal interpretation

4.3 Hierarchical routing

5. Main Results on RewardPrediction

5.1 Zero-shot comparison

5.2 Comparison to supervised reward models

5.3 Nearing the supervised upper bound

6. Ablations

6.1 Representation granularity matters

6.2 Dynamic goal interpretation is close to oracle

6.3 Better reasoning models help

6.4 Embedding quality matters

7. Utility for Planning

7.1 ReAct + StateFactory

7.2 System-2 planning

8. Why This Paper Is Interesting

8.1 It reframes reward prediction

8.2 It contributes a useful benchmark

8.3 It connects representation quality to planning quality

9. Limitations

10. Takeaways

TL;DR

论文信息

1. 问题与动机

2. RewardPrediction Benchmark

2.1 为什么这个 benchmark 有意义

2.2 Benchmark 构建方式

3. StateFactory

4. 方法拆解

4.1 State extraction

4.2 Goal interpretation

4.3 Hierarchical routing

5. RewardPrediction 上的主结果

5.1 Zero-shot 比较

5.2 和 supervised reward model 的比较

5.3 接近 supervised upper bound

6. 消融实验

6.1 表示粒度确实重要

6.2 动态目标解释接近 oracle

6.3 更强的 reasoning model 会继续带来收益

6.4 embedding 质量也很关键

7. 对规划的实际帮助

7.1 ReAct + StateFactory

7.2 System-2 planning

8. 为什么这篇论文值得看

8.1 它重新定义了 reward prediction 的重点

8.2 它贡献了一个很有用的 benchmark

8.3 它把表示质量和规划质量真正连了起来

9. 局限性

10. 总结

Share on

You May Also Enjoy

[Paper Notes] Cross-Hand Latent Representation for Vision-Language-Action Models

[Paper Notes] Visual Dexterity: In-Hand Reorientation of Novel and Complex Object Shapes

[Paper Notes] TiPToP: A Modular Open-Vocabulary Planning System for Robotic Manipulation

The Singularity is Near