[Paper Notes] TabletopGen: Instance-Level Interactive 3D Tabletop Scene Generation from Text or Single Image

11 minute read

Published: March 11, 2026

This post supports English / 中文 switching via the site language toggle in the top navigation.

TL;DR

TabletopGen is a training-free pipeline for generating instance-level, simulation-ready 3D tabletop scenes from either text or a single image.

Its core idea is to avoid generating a whole tabletop scene monolithically. Instead, it:

turns text into a reference image when needed
segments and completes each object instance separately
reconstructs each instance into a 3D asset
solves layout recovery with a two-stage alignment module:
- DRO for object rotation
- TSA for translation and metric scale

This decomposition is what makes the method strong: it preserves object count and style better than retrieval-heavy or whole-scene reconstruction baselines, and it sharply reduces object collisions in dense tabletop layouts.

Paper Info

Title: TabletopGen: Instance-Level Interactive 3D Tabletop Scene Generation from Text or Single Image
Authors: Ziqian Wang, Yonghao He, Licheng Yang, Wei Zou, Hongxuan Ma, Liu Liu, Wei Sui, Yuxin Guo, Hu Su
Affiliations: University of Chinese Academy of Sciences, D-Robotics, Institute of Automation CAS, Horizon Robotics
arXiv: 2512.01204
Project page: d-robotics-ai-lab.github.io/TabletopGen.project
Paper type: 3D scene generation / embodied AI / simulation pipeline

1. Problem Setting and Motivation

The paper focuses on a very practical gap in embodied AI: most existing 3D scene generation systems are not well suited for dense tabletop manipulation scenes.

The authors argue that a useful tabletop scene for robotic simulation should satisfy three conditions:

each object should be an independent, geometrically complete 3D instance
the arrangement should be functionally meaningful, not random
the final scene should be physically plausible, especially collision-free

This is harder than generic room-scale scene generation because tabletops contain:

many small objects packed into a limited area
frequent occlusion in a single image
fine-grained spatial relations that matter for manipulation

The paper’s critique of prior work is straightforward:

retrieval-based methods are limited by fixed asset libraries
text-to-3D scene planning methods are better at coarse layouts than dense small-object tabletop reasoning
single-image 3D scene reconstruction methods struggle with occlusion, incomplete instances, pose errors, and interpenetration

2. Core Idea

TabletopGen uses a unified pipeline for both text input and single-image input.

If the input is text, an LLM first expands it into a detailed prompt, then a text-to-image model generates a realistic reference image.
If the input is already an image, that image is used directly.

From there, the framework runs four stages:

Instance Extraction
Canonical 3D Model Generation
Pose and Scale Alignment
3D Scene Assembly

The most important design choice is the instance-first decomposition: rather than reconstructing the whole scene jointly, the method reconstructs each object separately and only then solves the layout.

3. Method Breakdown

3.1 Instance extraction from the reference image

The pipeline first uses an MLLM to infer open-vocabulary object categories in the scene. It then applies GroundedSAM-v2 to get per-instance masks.

Because tabletop scenes are heavily occluded and object boundaries are often incomplete, the paper does not rely on ordinary inpainting. Instead, it uses a multimodal generative completion model that redraws each segmented object into a clearer, high-resolution instance image.

That step matters because downstream 3D generation quality depends heavily on whether each instance is visually complete.

3.2 Per-instance canonical 3D reconstruction

Each completed object image is passed to an image-to-3D diffusion model to produce a 3D mesh.

These per-instance meshes initially live in arbitrary local coordinates, so the method performs canonical coordinate alignment. An MLLM reasons about upright orientation using both visual evidence and semantic priors, then rotates each model so it is properly aligned with the tabletop world frame.

This is an important practical benefit over retrieval pipelines:

the system is not constrained to a fixed asset database
appearance and geometry can better match the reference
object-level editing becomes easier later

3.3 DRO: Differentiable Rotation Optimizer

The first half of the layout solver is DRO, which estimates each object’s rotation.

The paper renders each candidate object with a differentiable renderer and minimizes a tri-modal matching loss:

L_rot = λ_s L_sil + λ_e L_edge + λ_a L_app

where:

L_sil is a soft-IoU silhouette loss
L_edge is a contour-matching loss based on distance transforms
L_app is a DINOv2 perceptual feature loss

This is a strong part of the paper. Instead of asking a VLM to guess object orientation in one shot, the method uses a render-and-optimize loop that directly compares projected 3D geometry against the target instance.

3.4 TSA: Top-view Spatial Alignment

After recovering rotation, TabletopGen estimates translation and scale with TSA.

This addresses the classic single-view ambiguity problem: from one image, absolute placement and metric size are hard to infer reliably.

The pipeline:

synthesizes a top-view image from the front-view reference
detects top-view 2D boxes for each instance
queries an MLLM for commonsense physical size priors
selects a reliable anchor object using the proposed RMA-Score

The score is:

RMA(i) = A_px(i) / (1 + (ε_ratio / τ)^2)

where larger visible objects with better aspect-ratio consistency are preferred as anchors.

Conceptually, TSA is doing a useful compromise:

use generative and language models for semantic spatial reasoning
constrain the final estimate with explicit geometric heuristics

3.5 Scene assembly

Once rotation, translation, and scale are estimated, the pipeline imports all instances into Isaac Sim, applies transforms, and assigns collision properties through convex decomposition. This converts visually reconstructed objects into a simulation-ready interactive tabletop scene.

4. Experimental Results

The evaluation covers 78 test samples with different table shapes and tabletop categories, including office, dining, workbench, and more stylized scenes.

The baselines are:

ACDC (retrieval-based)
Gen3DSR
MIDI

4.1 Quantitative gains

TabletopGen reports the best numbers across perceptual, semantic, and physical metrics.

Some headline results from Table 1:

LPIPS: 0.4483 vs 0.4559 for MIDI
DINOv2: 0.8383 vs 0.7070 for MIDI
CLIP: 0.9077 vs 0.8867 for MIDI
object collision rate: 0.42% vs 17.39% for MIDI
scene collision rate: 7.69% vs 98.72% for MIDI

The collision numbers are the most convincing part of the paper. Many prior methods can produce something visually plausible from the reference view, but collision-free assembly is what actually determines whether a scene is usable for embodied simulation.

4.2 GPT and human evaluation

On GPT-4o-based scoring, the method gets the best average score (6.19) across visual fidelity, image alignment, and physical plausibility.

The user study with 128 participants is also strong:

average human score: 5.56
second-best baseline: 3.57
overall preference for TabletopGen: 83.13%

That large margin suggests the gains are not limited to one metric choice.

4.3 Ablations

The ablation study is clean and useful because it isolates the two key geometric modules:

removing DRO increases object collision rate from 0.42% to 1.27%
removing TSA raises it much more sharply to 5.50%
removing both yields severe placement failures and 62.82% scene-level collision rate

My reading is that TSA contributes most of the physical plausibility gain, while DRO stabilizes orientation quality and improves visual consistency.

5. What I Find Most Interesting

The paper’s strongest contribution is not just another better benchmark score. It is the claim that tabletop generation should be treated as a compositional geometry-and-reasoning problem, not only as a big generative modeling problem.

Three aspects stand out:

instance-first reconstruction is a better fit for manipulation scenes than whole-scene synthesis
explicit geometric optimization is still necessary, even in an era of strong multimodal models
physical plausibility is treated as a first-class target instead of an afterthought

This makes the work particularly relevant for simulation data generation, robot benchmarking, and sim-to-real pipelines.

6. Strengths

Clear focus on tabletop scenes, which are genuinely important for embodied manipulation.
Good systems design: each stage solves a specific bottleneck rather than forcing one model to do everything.
Strong physical-plausibility results, especially collision reduction.
Useful ablation story showing why the alignment modules matter.
Supports both text-to-scene and image-to-scene inputs in one framework.
Enables modular scene editing because assets are reconstructed per instance.

7. Limitations and Open Questions

The pipeline is training-free, but it depends on several powerful external components: text-to-image, multimodal completion, MLLM reasoning, and image-to-3D generation. In practice, this is still a fairly heavyweight system stack.
The top-view synthesis and commonsense-size reasoning are helpful, but they introduce additional model assumptions that may break on unusual objects or highly nonstandard camera perspectives.
The evaluation is strong for tabletop scenes, but it is less clear how far the method would extend to cluttered shelves, cabinets, or multi-surface manipulation settings.
The paper emphasizes collision-free layouts, but long-term usefulness for robotics also depends on object articulation, contact fidelity, and material realism, which are only partially addressed here.
Because the method uses many proprietary or rapidly evolving foundation models, reproducibility may depend heavily on implementation choices that are not fully visible in the paper.

8. Takeaways

My main takeaway is simple: if you want simulation-ready tabletop scenes, recovering layout explicitly matters at least as much as generating pretty images.

TabletopGen works because it separates the problem into:

object completion
per-instance 3D generation
explicit pose recovery
explicit scale and translation recovery
simulator-based scene assembly

For embodied AI, that decomposition feels more practical than end-to-end scene generation alone. I would view this paper as a strong systems recipe for turning modern multimodal generative models into usable 3D tabletop environments rather than just attractive renderings.

Share on

Bluesky Facebook LinkedIn Mastodon X (formerly Twitter)

Lixin Xu

TL;DR

Paper Info

1. Problem Setting and Motivation

2. Core Idea

3. Method Breakdown

3.1 Instance extraction from the reference image

3.2 Per-instance canonical 3D reconstruction

3.3 DRO: Differentiable Rotation Optimizer

3.4 TSA: Top-view Spatial Alignment

3.5 Scene assembly

4. Experimental Results

4.1 Quantitative gains

4.2 GPT and human evaluation

4.3 Ablations

5. What I Find Most Interesting

6. Strengths

7. Limitations and Open Questions

8. Takeaways

TL;DR

论文信息

1. 问题背景与动机

2. 核心思路

3. 方法拆解

3.1 从参考图像中提取实例

3.2 面向实例的 canonical 3D 重建

3.3 DRO：可微旋转优化器

3.4 TSA：俯视图空间对齐

3.5 场景装配

4. 实验结果

4.1 定量结果

4.2 GPT 评价与用户研究

4.3 消融实验

5. 我觉得最有意思的点

6. 优点

7. 局限与开放问题

8. 总结

Share on

You May Also Enjoy

[Paper Notes] Cross-Hand Latent Representation for Vision-Language-Action Models

[Paper Notes] Visual Dexterity: In-Hand Reorientation of Novel and Complex Object Shapes

[Paper Notes] TiPToP: A Modular Open-Vocabulary Planning System for Robotic Manipulation

The Singularity is Near