Study Finds 7 Leading VLA Models Lose Knowledge After Robot Tuning
TL;DR
- A new arXiv paper evaluates seven VLA models including π₀, OpenVLA, Magma, SmolVLA, SpatialVLA, Xiaomi-Robotics-R0, and InternVLA-M1 against nine VLM baselines.
- The authors introduce Act2Answer, a protocol that turns VLM knowledge questions into short tabletop episodes where the robot answers via a single object-placement action.
- VLAs handle simple concepts well but show larger gaps on richer semantic categories versus their source VLMs, and VQA co-training is linked to better retention.
A new arXiv preprint by Nikita Kachaev and collaborators, "Does VLA Even Know the Basics?", takes a question that has been nagging at people building on top of vision-language-action models and turns it into an actual measurement. If you start with a strong VLM and fine-tune it on robotics data to get a VLA, how much of the original commonsense and factual knowledge is still in there when the model is done learning to move.
The answer, at least in this study, is: some of it, but noticeably less on anything semantically rich. The authors run seven VLA models, including π₀, OpenVLA, Magma, SmolVLA, SpatialVLA, Xiaomi-Robotics-R0, and InternVLA-M1, against nine VLM baselines. Their protocol, Act2Answer, adapts standard VLM knowledge benchmarks by turning each question into a short tabletop episode where the agent answers by placing an object on the candidate it thinks is correct. The framing matters because failures on knowledge questions in a robot are usually ambiguous, you can't tell whether the model didn't know the answer or just couldn't execute the motion. Grounding the answer in one placement action strips out most of the control confound.
The headline finding, in the authors' own words, is that "VLAs show solid performance on simple concepts while exhibiting larger gaps on richer semantic categories relative to their source VLMs." Two secondary results are the more useful ones for practitioners. First, VQA co-training during robotics fine-tuning correlates with better knowledge retention, which is a design lever people can actually pull. Second, layerwise probing shows answer-relevant signals peaking in the middle layers of the VLM backbone and attenuating toward the upper layers used for action prediction, so the knowledge isn't gone from the network so much as it stops surviving the trip to the action head.
The honest caveat is that Act2Answer is a curated tabletop suite, not a field deployment, and the paper is a preprint. What the reporting doesn't give you is a real-world manipulation-task correlation or a clean answer on which specific fine-tuning recipes are worst. Take the ranking as a starting point rather than a leaderboard.
Still, the direction is the useful part. If you are trusting a VLA to do anything that depends on the base VLM's world model, this is a paper worth reading before you assume that trust is well-founded, and it hands the VLA training community a public benchmark to compete against on retention rather than just on task success.
Originally reported by paper
Read the original article →Original headline: π₀, OpenVLA, Magma, and 4 More Leading VLA Models Lose Commonsense Knowledge After Robotics Fine-Tuning