huggingface.co web signal

Google Research FLAT Converts Single Photos to 3D Meshes

By Alexis Dufresne Published June 24, 2026 at 09:38 UTC Updated June 24, 2026 at 09:40 UTC

google research computer vision generative ai 3d-generation computer-vision diffusion-models

TL;DR

FLAT produces game-engine-ready 3D triangle meshes from a single image in one forward pass, with no per-scene optimization.
FLAT scores 29.45 PSNR on mesh extraction versus 22.32 for 3DGS, a gap of more than 7 dB on RealEstate10K.
Surface normal accuracy reaches 0.853 cosine similarity, versus 0.587 for 2DGS and 0.125 for 3DGS.

Turning a single photograph into a navigable 3D scene without any per-scene optimization loop has been a longstanding goal. Researchers at Google Research and the University of Oxford's Visual Geometry Group published a paper describing FLAT (Feedforward Latent Triangle Splatting), a model that does this in one forward pass by decoding surface-aligned triangle primitives directly from the latents of a frozen video diffusion model.

The key design choice is using triangles rather than the volumetric Gaussian primitives that dominate current real-time rendering approaches. The tradeoff is explicit: in novel view synthesis, FLAT's reported PSNR of 22.89 sits slightly below the 3DGS variant the team tested against (23.41). But the picture reverses sharply when you need an actual mesh. Converting FLAT's triangles to an opaque mesh yielded 29.45 PSNR on RealEstate10K; doing the same with 3DGS via standard marching cubes got 22.32, a gap of more than 7 dB. The team also measured cosine similarity to ground-truth surface normals: FLAT scored 0.853 versus 0.587 for 2DGS and 0.125 for 3DGS.

The underlying diffusion backbone is Uni3C, built on Wan-2.1, which the team left frozen. A lightweight scene decoder was trained on top, progressively from lower to higher resolution, over 200,000 iterations on 8 H100 GPUs. Training data combined real indoor footage from RealEstate10K and DL3DV with 25,000 synthetic images from S3OD. The decoder is described as modular, compatible with other Wan-2.1 variants including text-to-video pipelines, without retraining the diffusion model itself.

The honest caveat is what the paper's own limitations section concedes: triangles are suboptimal for thin structures, reflections, and semi-transparent regions, and the output geometry is sparse rather than watertight. The method is also currently aimed at single scenes along short camera paths, not large explorable worlds. What the paper does not give you is a code release date or benchmark results on scene types beyond the indoor datasets used for training.

For practitioners building robotics simulation pipelines or prototyping assets for game engines, the direct output of game-engine-compatible geometry from a single photo, without per-scene fitting, is the part worth watching.

Originally reported by huggingface.co

Read the original article →

Original headline: FLAT: Google Research and Oxford Introduce Feedforward Latent Triangle Splatting — 7+ dB PSNR Over 3DGS, Real-Time Renderable 3D Scenes From Single Image