Robust-TO Beats Gemini 2.5 Pro on Corrupted Video Benchmarks
TL;DR
- Frontier video models suffer 15-30 percentage point accuracy drops when input frames are corrupted by motion blur, glare, or occlusion.
- Robust-TO achieves 56.4% average accuracy on clean video benchmarks, outperforming Gemini-2.5-Pro's 46.2% and the best open-source baseline by 10.6 points.
- Under five realistic corruption types, Robust-TO holds 54.3% accuracy, the smallest clean-to-corrupted drop of any compared method.
Most video AI systems treat every frame they receive as equally trustworthy. A researcher at a desk, a camera mounted on a robot navigating a dusty corridor -- the model weights both the same way. Researchers at Nanyang Technological University Singapore call this the "Blind Trust Problem," and according to their paper on Hugging Face, it causes frontier video reasoning models to suffer 15 to 30 percentage point accuracy drops when frames are corrupted by motion blur, glare, or occlusion.
Their proposed fix, Robust-TO, integrates per-frame trustworthiness scores into every stage of an agentic video reasoning pipeline. Rather than treating all visual evidence equally, the system assigns calibrated reliability scores to each frame and uses a three-tier synthesis process to weight evidence as high, medium, or low before reasoning. A Confidence-Cost GRPO reward jointly optimizes correctness, evidence reliability, and efficiency during training.
The reported results are notable for an open-source system. On clean inputs across two video reasoning benchmarks spanning eight tasks, Robust-TO achieves 56.4% average accuracy -- 10.6 percentage points above the strongest open-source baseline and above Gemini-2.5-Pro's 46.2%. Under five realistic corruption types, it holds 54.3% average accuracy, 5.8 percentage points above the strongest open-source baseline under corruptions, and records the smallest clean-to-corrupted accuracy drop of any compared method.
The honest caveat is that benchmark wins on specific embodied video tasks don't automatically translate to other domains -- long-form video, medical imaging, or broadcast footage each present different corruption profiles. The paper also doesn't detail the inference overhead the trustworthiness scoring adds, so teams working with latency-sensitive deployments should treat these numbers as a research signal, not a deployment guarantee.
If the approach holds up in broader testing, the clearest beneficiaries are robotics navigation teams, autonomous vehicle developers, and surveillance systems that routinely encounter degraded visual input -- all settings where a model that degrades gracefully rather than fails silently is substantially more valuable.
Originally reported by huggingface.co
Read the original article →Original headline: Robust-TO: Confidence-Aware Tool Orchestration Framework Solves Video AI's 'Blind Trust Problem' — 56.4% Accuracy on Clean Inputs, Outperforms Gemini 2.5 Pro and Holds 54.3% Under Real-World Frame Corruptions