3D Tokenization Treats Scenes as Objects, Not Primitives
TL;DR
- A new arXiv paper proposes a feed-forward 3D framework that outputs object instances natively rather than recovering them from unstructured Gaussians.
- Each token group pairs an instance token with anchor tokens that decode into 3D Gaussians, trained via differentiable rendering with no 3D annotations.
- The authors report surpassing per-scene optimization baselines on class-agnostic instance segmentation while staying competitive on novel view synthesis.
Most feed-forward 3D reconstruction pipelines today produce an undifferentiated cloud of Gaussians, and if you want to know which of those primitives belongs to which chair, table, or lamp, you run a second stage that tries to recover the object structure the model discarded. A new preprint on arXiv proposes the alternative: bake the object structure in from the start.
The framework, from Mijin Yoo, In Cho, Subin Jeon, Jiwoo Lee, Eunbyung Park, and Seon Joo Kim, generates what the authors call instance-structured 3D token groups directly from unposed multi-view images. Each group pairs an instance token carrying object identity with anchor tokens that encode local geometry and appearance, and those tokens decode into sets of 3D Gaussians. Training is end to end via differentiable rendering with joint reconstruction and segmentation supervision, and the paper says the method requires no 3D annotations.
Why this matters for anyone building on top of 3D reconstruction is that scene editing gets conceptually cheaper. If a scene is a bag of Gaussians, moving a chair means clustering, masking, and hoping the boundary is clean. If a scene is a set of instance tokens, moving a chair means editing one token group. The paper claims the framework surpasses per-scene optimization baselines on class-agnostic instance segmentation and stays competitive on novel view synthesis, and it unlocks open-vocabulary 3D instance retrieval whose complexity scales with the number of objects rather than the number of primitives.
The honest caveats: the numbers are the authors' own, on the benchmarks they picked, and 'competitive' on novel view synthesis is not the same as winning. What the abstract does not give you is how the method handles heavily occluded objects, out-of-distribution categories, or how the representation behaves as instance counts climb. But if the claim holds up under external replication, it points at a cleaner interface for downstream 3D work in robotics, simulation, and AR scene graphs, where the useful primitive was never the Gaussian in the first place.
Originally reported by paper
Read the original article →Original headline: Scenes as Objects: Feed-Forward 3D Model Makes Object Instances Native Outputs, Not Post-Processing Artifacts