[Project Notes] RoboInF: Scaling Robot Manipulation Data in Simulation for General Instruction Following
Published:
TL;DR
RoboInF is XLANG Lab’s preview release for scaling instruction-following robot manipulation data in simulation. It is best read as a data-engine preview: the release gives the generation pipeline, qualitative examples, and scale numbers, while quantitative model results, training recipes, benchmarks, data, and code are still pending.
The key idea is to make synthetic trajectories usable for VLA training by pairing them with natural instructions, executable reward checks, motion-planning programs, and reward-verified rollouts. The resulting records can carry instruction, observation, action, end-effector pose, phase labels, success metadata, predicate-level results, and optional videos.
Core Story
The project page, published on May 27, 2026 at xlang.ai/blog/roboinf, reports 1M+ successful trajectories, 5K+ tasks, 300+ scenes, and 50+ reward primitives. The suggested citation is an @article with journal = {xlang.ai}, which matches the current status: a public project/blog preview whose full experimental tables are still forthcoming.
RoboInF addresses a data bottleneck that keeps appearing in generalist robot learning. Real teleoperation gives robot-native actions but scales slowly; internet videos scale broadly but lack robot actions and embodiment alignment; narrow simulation benchmarks give clean labels but often miss linguistic, visual, and task diversity. RoboInF tries to occupy the middle ground by producing simulated manipulation experience that is scene-diverse, language-rich, programmatically checkable, and easy to filter.
The release goes beyond simple pick-and-place. Its examples cover object orientation, precise spatial arrangement, semantic grouping, articulated-object interaction, drawer opening and closing, tool use, edge placement, multi-object layout, and longer-horizon tasks. This breadth matters because instruction following fails when language variation and control precision are generated separately. RoboInF’s bet is that scene, task, reward, program, and rollout should be generated as one coupled artifact.
Pipeline
| Stage | Role in the data engine |
|---|---|
| Scene generation | Builds randomized tabletop worlds through random synthesis and image-conditioned organized-scene reconstruction. |
| Task generation | Produces scene-grounded natural-language instructions from object identities and spatial relations. |
| Reward generation | Synthesizes executable evaluate() checks from the task, scene objects, and predicate library. |
| Motion-code generation | Uses simulator feedback to write and refine manipulation programs through an Agent-Simulation Interface. |
| Trajectory generation | Replays successful programs under randomized variants and keeps reward-verified rollouts. |
The two scene modes play complementary roles. Random synthesis samples everyday objects, turns them into simulation-ready assets, rescales them to plausible physical sizes, and places them in physics-valid layouts. Image-conditioned generation reconstructs more natural household or kitchen-style arrangements from reference images. Across both modes, RoboInF randomizes object poses, camera views, robot initial states, textures, backgrounds, lighting, physics parameters, and controller dynamics such as stiffness and damping.
Reward synthesis is the strongest technical hook. For each task, RoboInF generates an executable evaluate() function using predicates such as On, LeftTo, RightTo, IsInside, IsStatic, Upright, IsOpen, and ConstraintAlways. These checks combine spatial relations, object states, contact and orientation conditions, articulation state, temporal constraints, and neural/image predicates. If the reward code is good, it turns synthetic data generation from “render many attempts” into “retain attempts that satisfy task-specific intent.”
Motion-code generation closes the loop. An agent writes robot programs with low-level calls such as move_to(...), open_gripper(), close_gripper(), move_linear(...), and move_planar(...); the simulator returns planning failures, collisions, joint-limit issues, predicate-level reward results, object and robot states, multi-view observations, local object frames, and visualized target poses. The program is revised until the reward succeeds or the refinement budget is exhausted. For a drawer task, this can mean opening the drawer, grasping a can, moving it into the cavity, releasing it, re-grasping the handle, closing the drawer, and checking the final state.
Relation to Qwen-VLA
RoboInF also clarifies part of Qwen-VLA’s synthetic-data story. The Qwen-VLA paper describes an internal early ROBOINF pipeline for vision-language-action simulation data: 20 tabletop scenes, 10 initial configurations per scene, 450 manipulation tasks, and about 359,848 successful full trajectories including subtask segments. The public RoboInF preview expands that direction to a larger stated scale: 1M+ successful trajectories, 5K+ tasks, 300+ scenes, and 50+ reward primitives.
I read the relationship as a split between model recipe and data factory. Qwen-VLA shows one downstream use of RoboInF-style data; RoboInF is the generator being expanded into a broader, inspectable, reusable source of VLA supervision.
Evidence, Scope, and Limits
The current release does not publish benchmark tables, ablations, model training recipes, or released code/data. The qualitative claims are that RoboInF-trained models handle distractors, lighting changes, and shifted camera poses more reliably than internal baselines; follow compositional instructions more consistently; and show early signs of zero-shot sim-to-real transfer. Those are promising signals, but they remain directional until the benchmark and training details are public.
The current scope is also deliberately bounded: single-arm manipulation, rigid objects, reward-filtered SFT data, and relatively coarse-grained manipulation tasks. The roadmap points to dual-arm systems, broader embodiments, soft objects, liquids, deformables, RL fine-tuning, multi-task co-training, and more intricate fine-grained programs. These are exactly the cases that will stress-test reward generation and motion-code robustness.
The main risk is reward misspecification. If evaluate() omits a key constraint or encodes a shortcut, the system can produce many trajectories that pass code while drifting from the intended instruction. Simulation still carries the usual reality gap as well: contact-rich manipulation, deformable objects, liquids, tool use, and fine force control remain hard to model faithfully, even with domain randomization.
Takeaway
RoboInF is worth watching because it treats synthetic robot data as a full verification pipeline. The reusable lesson is simple: generate tasks together with executable success checks, debug motion programs through simulator feedback, and keep only reward-verified rollouts for VLA supervision. If the reward layer proves reliable, the same infrastructure could support both filtered imitation data and RL-style optimization from generated success checks.
For my taxonomy, I would label RoboInF as:
Synthetic Robot Data Engine / Simulation-to-VLA Data Generation / Reward-Verified Instruction Following
