Steerable Scene Generation with
Post Training and Inference-Time Search

1Massachusetts Institute of Technology, 2Toyota Research Institute, 3Carnegie Mellon University

System Overview

We train a diffusion-based generative model on SE(3) scenes generated by procedural models, then adapt it to downstream objectives via reinforcement learning-based post training, conditional generation, or inference-time search. The resulting scenes are physically feasible and fully interactable. We demonstrate teleoperated interaction in a subset of generated scenes using a mobile KUKA iiwa robot.

Abstract

Training robots in simulation requires diverse 3D scenes that reflect the specific challenges of downstream tasks. However, scenes that satisfy strict task requirements, such as high-clutter environments with plausible spatial arrangement, are rare and costly to curate manually. Instead, we generate large-scale scene data using procedural models that approximate realistic environments for robotic manipulation, and adapt it to task-specific goals. We do this by training a unified diffusion-based generative model that predicts which objects to place from a fixed asset library, along with their SE(3) poses. This model serves as a flexible scene prior that can be adapted using reinforcement learning-based post training, conditional generation, or inference-time search, steering generation toward downstream objectives even when they differ from the original data distribution. Our method enables goal-directed scene synthesis that respects physical feasibility and scales across scene types. We introduce a novel MCTS-based inference-time search strategy for diffusion models, enforce feasibility via projection and simulation, and release a dataset of over 44 million SE(3) scenes spanning five diverse environments.

Overview Video


Proof-of-Concept Teleoperation Videos

The demonstrations below were collected in scenes generated by our pipeline, without any manual scene modifications.
The environments are immediately simulation-ready and support realistic interaction out of the box!

All demonstrations were performed using a mobile KUKA iiwa robot, controlled via a single SpaceMouse device to command end-effector pose deltas, which were executed using differential inverse kinematics. Some demonstrations show task-like behaviors, such as moving objects, while others emphasize the rich interactivity of our scenes, for example by knocking over stacked items.


Diffusion Trajectory Videos

These videos show the generation process of our scene diffusion model.
The output starts from pure noise in the scene representation and ends in the final scene before post processing. Note that the number of objects, object types, and object poses change throughout the generation process.

We subsampled the initial generation steps as they are mostly noise due to our linear noise schedule.

We find the utensil crock in "Breakfast Table - 1" to be particularly interesting.


Interactive Generated Scenes

The following scenes were generated by our models. Use mouse controls to interact with the scenes.
Using the carousel, you can explore a wide variety of generated scenes, including those sampled from models after reinforcement learning post training.


BibTeX

@misc{pfaff2025_steerable_scene_generation,
    author        = {Pfaff, Nicholas and Dai, Hongkai and Zakharov, Sergey and Iwase, Shun and Tedrake, Russ},
    title         = {Steerable Scene Generation with Post Training and Inference-Time Search},
    year          = {2025},
    eprint        = {2505.04831},
    archivePrefix = {arXiv},
    primaryClass  = {cs.RO},
    url           = {https://arxiv.org/abs/2505.04831}, 
}