InnerZoom paper: one-pass GUI grounding rivals two-pass methods
TL;DR
- InnerZoom extracts target-region evidence from intermediate decoder layers and refines it within a single forward pass, avoiding crop-and-rerun.
- On a Qwen3-VL 4B backbone it reports 73.1 on OSWorld-G-Refine, 64.7 on OSWorld-G, 95.2 on ScreenSpot-V2 and 40.2 on UI-Vision.
- The authors claim 23.8%-35.7% latency and 26.0%-32.0% TFLOPs reductions versus external Zoom-In pipelines on the same backbone.
A new paper on Hugging Face, titled "One Forward Beats Two: InnerZoom for Accurate and Efficient GUI Grounding," goes after the part of the computer-use stack that actually slows agents down: the screen-grounding step where a model has to turn a natural-language instruction into a pixel coordinate. The standard trick today is crop-and-rerun, you locate a rough region, zoom in, and run the vision-language model a second time. InnerZoom's claim is that a single forward pass can do the same job.
The diagnosis the authors lead with is more interesting than the method itself. They report a "Region-to-Point Gap": intermediate decoder layers around 19-23 already attend to the right target region with about 69% ROI Recall, but by the time the model is ready to emit coordinates at the final layer that attention has collapsed to 14%. So instead of running the whole model twice, they extract that mid-layer evidence, carry it through a small adapter inserted at layers 20, 23, 26 and 29, and inject it into the key/value projections for coordinate decoding. On a Qwen3-VL 4B backbone with about 180.4M added trainable parameters, they report 73.1 on OSWorld-G-Refine, 64.7 on OSWorld-G, 95.2 on ScreenSpot-V2, 66.2 on ScreenSpot-Pro and 40.2 on UI-Vision, in each case a modest gain over their reported previous best.
The efficiency story is the part practitioners will read first. Against the paper's own two-pass Zoom-In baseline on the same backbone, InnerZoom reports 23.8% to 35.7% lower latency and 26.0% to 32.0% fewer TFLOPs. If you are building a browser or desktop agent, that is the budget you spend on every single click, so the compounding matters.
The honest caveat is that all the comparisons are inside the paper's own family of baselines on a single backbone, not against shipped computer-use products. Take the numbers as reported, not settled. The failure analysis is also worth reading before getting excited: the authors attribute roughly 40% of errors to ambiguous text-to-image grounding in dense UIs, which is exactly the regime real browser agents live in. Code and models are stated to be coming.
The useful direction here is the diagnosis. If mid-layer attention already knows where to click and the gap is just that the final layer forgets, a lot of the crop-and-rerun overhead in current computer-use stacks may be solving the wrong problem.
Originally reported by huggingface.co
Read the original article →Original headline: Paper 'One Forward Beats Two': InnerZoom Achieves Accurate and Efficient GUI Grounding for Computer-Use Agents in a Single Forward Pass