arxiv.org web signal June 29th 2026

Surflo fuses unposed views into one 3D surface latent

TL;DR

Surflo encodes a variable number of unposed RGB views into a fixed global state of K=128 tokens, then decodes 3D surface points via flow-matching ODEs.
From a single encoder pass the model can sample any number of oriented surface points, up to roughly one million, without committing to a fixed grid.
The authors report state-of-the-art results across eight benchmarks, including a Tanks and Temples Chamfer Distance of 0.0056 and F1 of 86.40 on 8 views.

A new paper from a Berkeley, Kyoto, École Polytechnique and Kyutai team, posted to arXiv, proposes a 3D reconstructor that tries to have it both ways: a single fixed-size latent for the scene, and per-point decoding at whatever resolution you want. The system, Surflo, takes a variable number of unposed RGB views and compresses them into K=128 tokens of "global state," then decodes oriented 3D surface points one at a time via flow-matching ODEs. According to the project page, one encoder pass can produce anywhere from a few thousand to about one million points.

The motivation is a real pain point in feed-forward 3D. Per-view methods emit pointmaps that pile up and disagree as you add views; global-latent methods lock you to a fixed, low-resolution output. The authors' framing is that geometry is invariant to viewpoint, so the redundancy across views should collapse into a single state rather than scale linearly with input count. Independent per-point decoding then handles the resolution question separately from the encoding question.

Independent decoding has an obvious failure mode, which is that nearby points stop agreeing with each other. The fix here is an inference-time guidance term: at each ODE step the in-flight points are turned into 3D Gaussians and rendered with Gaussian Splatting, and the photometric gradient nudges neighbors back into local consistency. The reported numbers are state of the art across eight benchmarks, with a Chamfer Distance of 0.0056 and F1 of 86.40 on Tanks and Temples at 8 views, and the authors say the model runs an order of magnitude faster than optimization-based methods.

The honest caveat is that the evaluation lives close to home: training on roughly 10.5K DL3DV scenes and testing on DL3DV and Tanks and Temples is a fairly curated indoor-and-outdoor distribution, and the retrieved material does not show how the method holds up on dynamic scenes, sparse captures, or much larger environments. The rendering-in-the-loop guidance is also doing real work in the headline numbers, and what that costs in latency at the million-point setting is not laid out in what is publicly posted.

If the pattern holds, the bigger story is architectural. A fixed-size scene latent that downstream models can consume, paired with a decoder that emits as many surface samples as the task needs, is a more useful interface for robotics and capture pipelines than a pointmap whose shape depends on how many photos you happened to take.

Shared on Bluesky by 2 AI experts

David Picard @davidpicard.eurosky.social amplified

@si-cv-graphics.bsky.social

𝗦𝘂𝗿𝗳𝗹𝗼: 𝗖𝗼𝗻𝘀𝗶𝘀𝘁𝗲𝗻𝘁 𝟯𝗗 𝗦𝘂𝗿𝗳𝗮𝗰𝗲 𝗙𝗹𝗼𝘄 𝗠𝗼𝗱𝗲𝗹 𝘄𝗶𝘁𝗵 𝗚𝗹𝗼𝗯𝗮𝗹 𝗦𝘁𝗮𝘁𝗲 Antoine Guédon, Shu Nakamura, Nicolas Dufour ... Angjoo Kanazawa arxiv.org/abs/2606.13644 Trending on scholar-inbox.com
View on Bluesky →
Christian Laforte @chrlaf.bsky.social amplified

@ericzzj.bsky.social

Surflo: Consistent 3D Surface Flow Model with Global State @antoine-guedon.bsky.social, Shu Nakamura, @nicolasdufour.bsky.social, Jiahui Lei, Ko Nishino, @akanazawa.bsky.social arxiv.org/abs/2606.13644
View on Bluesky →

Originally reported by arxiv.org

Read the original article →

Original headline: Surflo: Consistent 3D Surface Flow Model with Global State