huggingface.co web signal

Tencent's PhoneBuddy 4B Beats GPT-5.4 on Phone Agent Tasks

agents open source china ai agents mobile open-source

TL;DR

  • PhoneBuddy-4B trained with mixed real-app and mock-app RL reaches 45.33% task success on a 150-task real-phone evaluation, up from 36.67% after SFT alone.
  • The 4B model scores 83.2% on AndroidWorld, the highest result across all tested systems including closed-source Gemini 3.1 Pro, GPT-5.4, and Claude Opus 4.7.
  • Cross-app tasks remain a significant gap, with success rates actually declining slightly (22%→18%) across the three training stages.

A 4 billion parameter open model trained by Tencent Hunyuan researchers just beat GPT-5.4 on an overall phone-agent benchmark average, which is worth pausing on. The system is called PhoneBuddy, and it shows how much training environment design matters for this problem class, separate from raw model scale.

The training recipe has three stages, all starting from the same Qwen3.5-4B backbone. First, supervised fine-tuning on 950,758 action steps collected from both real Android apps running on physical devices and from PhoneWorld, a mock-app environment that reconstructs runnable Android apps from real GUI traces. Then reinforcement learning in two variants: real-app RL only, or mixed RL splitting rollouts evenly between real apps and PhoneWorld's resettable mock environments.

The numbers across the three stages are specific. On a 150-task human evaluation covering single apps, WeChat mini-apps, and cross-app workflows on real phones, task success goes from 36.67% (SFT only) to 40.67% (real-app RL alone) to 45.33% (mixed RL). On AndroidWorld the progression is 60.3%, 77.2%, and 83.2%, with the final PhoneBuddy-4B-Real+Mock model scoring best across all tested systems in that column. On the overall four-setting average, PhoneBuddy-4B-Real+Mock reaches 54.8 versus 48.2 for GPT-5.4 and 51.4 for Seed 2.0 Pro, while Gemini 3.1 Pro still leads at 59.1.

The caveat the paper is upfront about: cross-app tasks are the main failure mode, and they get marginally worse as training progresses — 22.0%, 20.0%, 18.0% across the three checkpoints. PhoneWorld's task pool is primarily single-app, and multi-app information handoff is not yet modeled there. The paper also deliberately sets aside safety, privacy, and runtime deployment questions, treating those as out-of-scope for this study.

What this points toward for open-model builders is that the training environment is a first-class design choice. PhoneWorld's resettable, automatically verifiable mock-app infrastructure is the enabling piece here, and the paper's own conclusion identifies extending it to cross-app workflows as the natural next step.