paper web signal July 1st 2026

Alibaba's ABot-M0.5 unifies robot navigation and arm control

TL;DR

ABot-M0.5 from AMAP CV Lab reports 46.6% average on RoboCasa365, versus 35.9% for Qwen-RobotManip and 16.9% for π0.5.
The model uses a dual-level Mixture-of-Transformers to disentangle base movement from arm manipulation inside one policy.
On composite tasks it has not seen before, success drops to 2.7%, marking generalization as the remaining hard gap.

A new paper from Alibaba's AMAP CV Lab reports 46.6% average success on RoboCasa365, a mobile-manipulation benchmark where the strongest prior baseline sits at 35.9%. The system, described in the arXiv preprint, is pitched as a unified world action model that handles moving the robot base and controlling the arm inside a single architecture rather than as two separate stacks.

The interesting technical bet is not scale but structural separation. The authors argue prior VLA policies treated a robot rollout as one action stream and ended up with what they call action-distribution conflicts between navigation and manipulation, plus errors that accumulate over long-horizon rollouts. ABot-M0.5 splits those into different subspaces via a dual-level Mixture-of-Transformers, adds an intermediate latent action layer to bridge video prediction and low-level control, and trains its inverse dynamics on the model's own predicted videos, a curriculum the authors call dream-forcing.

Why this matters if you are not building embodied models yourself: mobile manipulation is where warehouse, home, and last-mile robots have gotten stuck. Most labs have separate stacks for 'drive there' and 'pick that up', and the seams between them are where long rollouts fail. A single policy that jointly plans both, and that beats the previously best reported number on RoboCasa365 by roughly ten points, is a signal that architecture, not just more data, is doing real work here.

The honest caveat sits in the paper's own numbers. On composite tasks the model has not seen before, success drops to 2.7%. That is exactly the case a general-purpose home robot needs to handle, and it is still nearly zero. The reported wins on LIBERO (99.4%) and RoboTwin 2.0 (94.10%) are on standard benchmarks, and what the abstract does not give you is physical robot deployment data or a parameter count.

Still, if the disentangled base-plus-arm design holds up when other groups try it, expect that split to show up quickly in the next round of embodied models, and expect Alibaba's own AMAP mapping and delivery teams to have somewhere obvious to test it.

Originally reported by paper

Read the original article →

Original headline: Alibaba's ABot-M0.5 Is the First World Action Model to Jointly Handle Mobile Locomotion and Arm Manipulation