AnyGroundBench: 15 top VLMs fail specialized video grounding
TL;DR
- AnyGroundBench evaluates 15 state-of-the-art vision-language models on spatio-temporal video grounding across five specialized domains: animal, industry, sports, surgery, and public security.
- The authors report models fail in both zero-shot generalization and in-context learning when moved from general daily-life video into specialized domains.
- The benchmark ships dedicated training subsets, letting researchers measure domain adaptation techniques rather than only comparing base model choice.
A new arXiv paper called AnyGroundBench puts fifteen state-of-the-art vision-language models through a spatio-temporal video grounding task across five specialized domains, and reports that they fail in both zero-shot and in-context learning setups. The authors' framing is that today's evaluation is largely confined to "general, daily-life benchmarks," which flatters the models and hides how brittle their spatio-temporal reasoning becomes once the visual concepts get rare or the dynamics get complex.
Video grounding matters here because it is the tighter, less forgiving cousin of tasks like captioning. The model has to localize an object in space and time — where and when — not just describe what happened. If you are pitching a VLM as a drop-in tool for surgical video review, plant-floor safety monitoring, or public-space incident retrieval, "adapts to your domain" is essentially the whole product. The paper's claim, drawn from the arXiv abstract, is that neither zero-shot use nor in-context learning under practical computational constraints closes that gap.
Why practitioners should care: the benchmark ships dedicated training subsets, so it is set up not just to embarrass a model but to measure whether any given adaptation technique actually helps. That reframes the leaderboard question from "which VLM is best" to "which adaptation strategy transfers." Vendors selling generalist models into specialist buyers now have a public target they can be measured against, and specialist buyers have a reason to demand fine-tuning or retrieval-augmented approaches rather than accepting zero-shot demos.
The honest caveat is that the abstract does not disclose per-model scores or the exact ICL protocol, so we cannot tell yet which of the fifteen VLMs came closest, whether the failure gap widens more in surgery than in industry or sports, or how much of the shortfall is a data-scale problem versus an architectural one. What the paper does give you is the direction of travel: the interesting research question is no longer "how well does the model generalize out of the box" but "how well does it adapt when you feed it the domain it will actually run in." Teams building for medical, industrial or security video should read the release of training subsets as an invitation to stop benchmarking on daily-life clips.
Originally reported by paper
Read the original article →Original headline: AnyGroundBench: 15 State-of-the-Art VLMs Fail Specialized-Domain Video Grounding in Both Zero-Shot and In-Context Adaptation