arxiv.org web signal

Surflo fuses unposed views into one 3D surface latent

TL;DR

  • Surflo encodes a variable number of unposed RGB views into a fixed global state of K=128 tokens, then decodes 3D surface points via flow-matching ODEs.
  • From a single encoder pass the model can sample any number of oriented surface points, up to roughly one million, without committing to a fixed grid.
  • The authors report state-of-the-art results across eight benchmarks, including a Tanks and Temples Chamfer Distance of 0.0056 and F1 of 86.40 on 8 views.

A new paper from a Berkeley, Kyoto, ร‰cole Polytechnique and Kyutai team, posted to arXiv, proposes a 3D reconstructor that tries to have it both ways: a single fixed-size latent for the scene, and per-point decoding at whatever resolution you want. The system, Surflo, takes a variable number of unposed RGB views and compresses them into K=128 tokens of "global state," then decodes oriented 3D surface points one at a time via flow-matching ODEs. According to the project page, one encoder pass can produce anywhere from a few thousand to about one million points.

The motivation is a real pain point in feed-forward 3D. Per-view methods emit pointmaps that pile up and disagree as you add views; global-latent methods lock you to a fixed, low-resolution output. The authors' framing is that geometry is invariant to viewpoint, so the redundancy across views should collapse into a single state rather than scale linearly with input count. Independent per-point decoding then handles the resolution question separately from the encoding question.

Independent decoding has an obvious failure mode, which is that nearby points stop agreeing with each other. The fix here is an inference-time guidance term: at each ODE step the in-flight points are turned into 3D Gaussians and rendered with Gaussian Splatting, and the photometric gradient nudges neighbors back into local consistency. The reported numbers are state of the art across eight benchmarks, with a Chamfer Distance of 0.0056 and F1 of 86.40 on Tanks and Temples at 8 views, and the authors say the model runs an order of magnitude faster than optimization-based methods.

The honest caveat is that the evaluation lives close to home: training on roughly 10.5K DL3DV scenes and testing on DL3DV and Tanks and Temples is a fairly curated indoor-and-outdoor distribution, and the retrieved material does not show how the method holds up on dynamic scenes, sparse captures, or much larger environments. The rendering-in-the-loop guidance is also doing real work in the headline numbers, and what that costs in latency at the million-point setting is not laid out in what is publicly posted.

If the pattern holds, the bigger story is architectural. A fixed-size scene latent that downstream models can consume, paired with a decoder that emits as many surface samples as the task needs, is a more useful interface for robotics and capture pipelines than a pointmap whose shape depends on how many photos you happened to take.

Shared on Bluesky by 2 AI experts

  • David Picard @davidpicard.eurosky.social amplified

    @si-cv-graphics.bsky.social

    ๐—ฆ๐˜‚๐—ฟ๐—ณ๐—น๐—ผ: ๐—–๐—ผ๐—ป๐˜€๐—ถ๐˜€๐˜๐—ฒ๐—ป๐˜ ๐Ÿฏ๐—— ๐—ฆ๐˜‚๐—ฟ๐—ณ๐—ฎ๐—ฐ๐—ฒ ๐—™๐—น๐—ผ๐˜„ ๐— ๐—ผ๐—ฑ๐—ฒ๐—น ๐˜„๐—ถ๐˜๐—ต ๐—š๐—น๐—ผ๐—ฏ๐—ฎ๐—น ๐—ฆ๐˜๐—ฎ๐˜๐—ฒ Antoine Guรฉdon, Shu Nakamura, Nicolas Dufour ... Angjoo Kanazawa arxiv.org/abs/2606.13644 Trending on scholar-inbox.com

    View on Bluesky โ†’
  • Christian Laforte @chrlaf.bsky.social amplified

    @ericzzj.bsky.social

    Surflo: Consistent 3D Surface Flow Model with Global State @antoine-guedon.bsky.social, Shu Nakamura, @nicolasdufour.bsky.social, Jiahui Lei, Ko Nishino, @akanazawa.bsky.social arxiv.org/abs/2606.13644

    View on Bluesky โ†’