VideoSearch-R1 loops video retrieval with latent query refinement
TL;DR
- VideoSearch-R1 unifies large-scale video retrieval and temporal grounding into one multi-turn loop instead of a fixed retrieve-then-reason pipeline.
- Its Soft Query Refinement edits query tokens in a continuous latent space, reportedly using significantly fewer generated tokens than text-level rewrites.
- The system is trained with Group Relative Policy Optimization and claims state-of-the-art results across three Video Corpus Moment Retrieval datasets.
Most video search systems today are two pipelines glued together. A retriever pulls candidate clips, and a separate reasoning model does the fine-grained work of finding the exact moment inside them. When the retriever misses, the reasoning stage has nothing to work with and there is no way back. A new arXiv paper from Seohyun Lee and colleagues, accepted to ECCV 2026, proposes to close that loop.
Their system, VideoSearch-R1, treats retrieval and temporal grounding as a single multi-turn interaction with a video search engine. The interesting move is how it revises a failing query. Rather than have the model rewrite the search in plain text, the default in most agentic frameworks, they introduce what they call Soft Query Refinement, which nudges the query tokens in a continuous latent space. The authors' claim is that this needs significantly fewer generated tokens than an explicit text-level rewrite, which matters when the loop runs several times. The refinement policy is trained with Group Relative Policy Optimization, using rewards from both the retrieval step and the downstream task, so the model learns to refine in ways that help the eventual answer rather than just the next lookup.
Why this matters if you build video search or video RAG for a company: the retrieve-then-reason pattern people copied from text RAG breaks harder on video, because a wrong clip is not a subtly wrong passage, it is completely irrelevant footage. A trained refine-and-retry loop is a much more useful primitive than a fixed top-k retriever, and the fact that the action is a latent nudge instead of a text rewrite hints at a template other agentic search systems could borrow.
The honest caveat is that the paper reports state-of-the-art performance across three Video Corpus Moment Retrieval datasets, and the abstract does not name those datasets, publish the score deltas, or give the per-query latency of the iterative loop. Latent-space refinements are also harder for downstream teams to inspect than a rewritten text query, which is a real operational cost worth pricing in. What is worth watching is whether this loop-based framing shows up in the next wave of enterprise video products, and whether the same latent-action trick generalises to document search, where the pain of a bad first retrieval is equally real.
Originally reported by paper
Read the original article →Original headline: VideoSearch-R1 Unifies Video Corpus Retrieval and Temporal Reasoning via Latent-Space Query Refinement