huggingface.co web signal

Xiaomi Trains Mobile GUI Agent on Real Phones, Not Simulators

agents multimodal ai-research

TL;DR

  • Xiaomi-GUI-0 reports 72.0% success on the group's in-house RealMobile benchmark and 78.9% on AndroidWorld.
  • Training runs inside a real-device-dominant hybrid infrastructure, with physical phones as the primary execution environment and sandboxes as auxiliary support.
  • A three-stage pipeline combines supervised fine-tuning, step-level reinforcement learning, and agentic reinforcement learning, fed by an error-driven data flywheel.

The interesting thing about Xiaomi's newly posted Xiaomi-GUI-0 technical report is not the headline benchmark, it is where the training actually happens. Most GUI agents get trained and evaluated on offline trajectories and simulated Android environments, and the argument in the paper is that this is exactly what makes them brittle in the wild. Account states, permission dialogs, payment authentication, and risk control keep reshaping the state distribution a real phone sees, so a leaderboard win does not translate into a working assistant.

The claim is that a native multimodal agent, trained inside what the authors call a real-device-dominant hybrid infrastructure — physical phones doing the primary execution, sandboxes only providing auxiliary support — gets substantially closer to that real distribution. Training runs through a progressive three-stage pipeline of supervised fine-tuning, step-level reinforcement learning, and agentic reinforcement learning, with an error-driven data flywheel that turns failure trajectories into corrected actions, reflective explanations, and recovery demonstrations. On the group's in-house RealMobile benchmark the model reportedly hits 72.0% success, and on AndroidWorld 78.9%.

Why this matters if you are not building GUI agents yourself: mobile automation has been stuck between demos that look magical and deployments that quietly fall over on the second permission popup. If Xiaomi's read is right, the bottleneck is not model size, it is the training distribution, and closing the gap required literally running a fleet of real phones in the loop. That is a moat a handset maker has and an API-only lab does not.

The honest caveat is that RealMobile is Xiaomi's own benchmark, so the 72.0% headline should be read as a self-report until someone independent reproduces it. What the paper does not give you is the underlying model, its size, the compute budget, or how the agent is guarded when it is about to tap the confirm button on a payment sheet. And there is no word on whether any of this ships to a Xiaomi handset or stays as a research artefact. The direction, training on the environment you actually deploy into rather than the one that is easy to simulate, is the part I would watch.