[Paper Notes] TabletopGen: Instance-Level Interactive 3D Tabletop Scene Generation from Text or Single Image
Published:
This post supports English / 中文 switching via the site language toggle in the top navigation.
TL;DR
TabletopGen is a training-free pipeline for generating instance-level, simulation-ready 3D tabletop scenes from either text or a single image.
Its core idea is to avoid generating a whole tabletop scene monolithically. Instead, it:
- turns text into a reference image when needed
- segments and completes each object instance separately
- reconstructs each instance into a 3D asset
- solves layout recovery with a two-stage alignment module:
- DRO for object rotation
- TSA for translation and metric scale
This decomposition is what makes the method strong: it preserves object count and style better than retrieval-heavy or whole-scene reconstruction baselines, and it sharply reduces object collisions in dense tabletop layouts.
Paper Info
- Title: TabletopGen: Instance-Level Interactive 3D Tabletop Scene Generation from Text or Single Image
- Authors: Ziqian Wang, Yonghao He, Licheng Yang, Wei Zou, Hongxuan Ma, Liu Liu, Wei Sui, Yuxin Guo, Hu Su
- Affiliations: University of Chinese Academy of Sciences, D-Robotics, Institute of Automation CAS, Horizon Robotics
- arXiv: 2512.01204
- Project page: d-robotics-ai-lab.github.io/TabletopGen.project
- Paper type: 3D scene generation / embodied AI / simulation pipeline
1. Problem Setting and Motivation
The paper focuses on a very practical gap in embodied AI: most existing 3D scene generation systems are not well suited for dense tabletop manipulation scenes.
The authors argue that a useful tabletop scene for robotic simulation should satisfy three conditions:
- each object should be an independent, geometrically complete 3D instance
- the arrangement should be functionally meaningful, not random
- the final scene should be physically plausible, especially collision-free
This is harder than generic room-scale scene generation because tabletops contain:
- many small objects packed into a limited area
- frequent occlusion in a single image
- fine-grained spatial relations that matter for manipulation
The paper’s critique of prior work is straightforward:
- retrieval-based methods are limited by fixed asset libraries
- text-to-3D scene planning methods are better at coarse layouts than dense small-object tabletop reasoning
- single-image 3D scene reconstruction methods struggle with occlusion, incomplete instances, pose errors, and interpenetration
2. Core Idea
TabletopGen uses a unified pipeline for both text input and single-image input.
- If the input is text, an LLM first expands it into a detailed prompt, then a text-to-image model generates a realistic reference image.
- If the input is already an image, that image is used directly.
From there, the framework runs four stages:
- Instance Extraction
- Canonical 3D Model Generation
- Pose and Scale Alignment
- 3D Scene Assembly
The most important design choice is the instance-first decomposition: rather than reconstructing the whole scene jointly, the method reconstructs each object separately and only then solves the layout.
3. Method Breakdown
3.1 Instance extraction from the reference image
The pipeline first uses an MLLM to infer open-vocabulary object categories in the scene. It then applies GroundedSAM-v2 to get per-instance masks.
Because tabletop scenes are heavily occluded and object boundaries are often incomplete, the paper does not rely on ordinary inpainting. Instead, it uses a multimodal generative completion model that redraws each segmented object into a clearer, high-resolution instance image.
That step matters because downstream 3D generation quality depends heavily on whether each instance is visually complete.
3.2 Per-instance canonical 3D reconstruction
Each completed object image is passed to an image-to-3D diffusion model to produce a 3D mesh.
These per-instance meshes initially live in arbitrary local coordinates, so the method performs canonical coordinate alignment. An MLLM reasons about upright orientation using both visual evidence and semantic priors, then rotates each model so it is properly aligned with the tabletop world frame.
This is an important practical benefit over retrieval pipelines:
- the system is not constrained to a fixed asset database
- appearance and geometry can better match the reference
- object-level editing becomes easier later
3.3 DRO: Differentiable Rotation Optimizer
The first half of the layout solver is DRO, which estimates each object’s rotation.
The paper renders each candidate object with a differentiable renderer and minimizes a tri-modal matching loss:
L_rot = λ_s L_sil + λ_e L_edge + λ_a L_app
where:
L_silis a soft-IoU silhouette lossL_edgeis a contour-matching loss based on distance transformsL_appis a DINOv2 perceptual feature loss
This is a strong part of the paper. Instead of asking a VLM to guess object orientation in one shot, the method uses a render-and-optimize loop that directly compares projected 3D geometry against the target instance.
3.4 TSA: Top-view Spatial Alignment
After recovering rotation, TabletopGen estimates translation and scale with TSA.
This addresses the classic single-view ambiguity problem: from one image, absolute placement and metric size are hard to infer reliably.
The pipeline:
- synthesizes a top-view image from the front-view reference
- detects top-view 2D boxes for each instance
- queries an MLLM for commonsense physical size priors
- selects a reliable anchor object using the proposed RMA-Score
The score is:
RMA(i) = A_px(i) / (1 + (ε_ratio / τ)^2)
where larger visible objects with better aspect-ratio consistency are preferred as anchors.
Conceptually, TSA is doing a useful compromise:
- use generative and language models for semantic spatial reasoning
- constrain the final estimate with explicit geometric heuristics
3.5 Scene assembly
Once rotation, translation, and scale are estimated, the pipeline imports all instances into Isaac Sim, applies transforms, and assigns collision properties through convex decomposition. This converts visually reconstructed objects into a simulation-ready interactive tabletop scene.
4. Experimental Results
The evaluation covers 78 test samples with different table shapes and tabletop categories, including office, dining, workbench, and more stylized scenes.
The baselines are:
- ACDC (retrieval-based)
- Gen3DSR
- MIDI
4.1 Quantitative gains
TabletopGen reports the best numbers across perceptual, semantic, and physical metrics.
Some headline results from Table 1:
- LPIPS:
0.4483vs0.4559for MIDI - DINOv2:
0.8383vs0.7070for MIDI - CLIP:
0.9077vs0.8867for MIDI - object collision rate:
0.42%vs17.39%for MIDI - scene collision rate:
7.69%vs98.72%for MIDI
The collision numbers are the most convincing part of the paper. Many prior methods can produce something visually plausible from the reference view, but collision-free assembly is what actually determines whether a scene is usable for embodied simulation.
4.2 GPT and human evaluation
On GPT-4o-based scoring, the method gets the best average score (6.19) across visual fidelity, image alignment, and physical plausibility.
The user study with 128 participants is also strong:
- average human score: 5.56
- second-best baseline: 3.57
- overall preference for TabletopGen: 83.13%
That large margin suggests the gains are not limited to one metric choice.
4.3 Ablations
The ablation study is clean and useful because it isolates the two key geometric modules:
- removing DRO increases object collision rate from
0.42%to1.27% - removing TSA raises it much more sharply to
5.50% - removing both yields severe placement failures and
62.82%scene-level collision rate
My reading is that TSA contributes most of the physical plausibility gain, while DRO stabilizes orientation quality and improves visual consistency.
5. What I Find Most Interesting
The paper’s strongest contribution is not just another better benchmark score. It is the claim that tabletop generation should be treated as a compositional geometry-and-reasoning problem, not only as a big generative modeling problem.
Three aspects stand out:
- instance-first reconstruction is a better fit for manipulation scenes than whole-scene synthesis
- explicit geometric optimization is still necessary, even in an era of strong multimodal models
- physical plausibility is treated as a first-class target instead of an afterthought
This makes the work particularly relevant for simulation data generation, robot benchmarking, and sim-to-real pipelines.
6. Strengths
- Clear focus on tabletop scenes, which are genuinely important for embodied manipulation.
- Good systems design: each stage solves a specific bottleneck rather than forcing one model to do everything.
- Strong physical-plausibility results, especially collision reduction.
- Useful ablation story showing why the alignment modules matter.
- Supports both text-to-scene and image-to-scene inputs in one framework.
- Enables modular scene editing because assets are reconstructed per instance.
7. Limitations and Open Questions
- The pipeline is training-free, but it depends on several powerful external components: text-to-image, multimodal completion, MLLM reasoning, and image-to-3D generation. In practice, this is still a fairly heavyweight system stack.
- The top-view synthesis and commonsense-size reasoning are helpful, but they introduce additional model assumptions that may break on unusual objects or highly nonstandard camera perspectives.
- The evaluation is strong for tabletop scenes, but it is less clear how far the method would extend to cluttered shelves, cabinets, or multi-surface manipulation settings.
- The paper emphasizes collision-free layouts, but long-term usefulness for robotics also depends on object articulation, contact fidelity, and material realism, which are only partially addressed here.
- Because the method uses many proprietary or rapidly evolving foundation models, reproducibility may depend heavily on implementation choices that are not fully visible in the paper.
8. Takeaways
My main takeaway is simple: if you want simulation-ready tabletop scenes, recovering layout explicitly matters at least as much as generating pretty images.
TabletopGen works because it separates the problem into:
- object completion
- per-instance 3D generation
- explicit pose recovery
- explicit scale and translation recovery
- simulator-based scene assembly
For embodied AI, that decomposition feels more practical than end-to-end scene generation alone. I would view this paper as a strong systems recipe for turning modern multimodal generative models into usable 3D tabletop environments rather than just attractive renderings.
