paper web signal

Halving OpenVLA-OFT's LLM blocks lifts LIBERO from 95.0 to 98.3%

TL;DR

  • Removing half the LLM blocks from OpenVLA-OFT lifted LIBERO success from 95.0% to 98.3% under the same fine-tuning budget.
  • Keeping just two language blocks still recovered baseline-level performance, pointing to deep redundancy in VLA language backbones.
  • The authors introduce GateProbe, a one-shot sensitivity metric, and a Drop-Then-Recovery protocol to identify removable blocks.

A new arxiv paper on Vision-Language-Action robot models lands with an awkward result, the kind that is interesting precisely because it is awkward. You can throw away half the language model inside an OpenVLA-OFT and, on the standard manipulation benchmark, the robot does slightly better. According to the paper by Sun, Feng and colleagues, removing half of the LLM blocks in OpenVLA-OFT lifts LIBERO performance from 95.0% to 98.3% under the same fine-tuning budget. Keep only two language blocks and you still recover baseline-level performance.

The method is straightforward. The authors call it Drop-Then-Recovery, a protocol that strips selected transformer blocks from a pretrained VLA and then fine-tunes what is left to see whether the removed capacity actually mattered for downstream control. To pick which blocks to cut they introduce GateProbe, a one-shot virtual-gate sensitivity metric that ranks blocks by their contribution to the action loss. Across multiple VLA architectures, manipulation benchmarks, and what the authors describe as real-robot industrial scenarios, the pattern is asymmetric. Language backbones absorb cuts easily. Vision and action pathways do not.

The implication the authors themselves draw is the uncomfortable one. If you can rip out half the language stack and the robot gets a little better on standard manipulation benchmarks, then those benchmarks may not be probing language understanding in the way the field has assumed. The pretrained VLM capacity, they write, "far exceeds what is needed for short robotic instructions" in this setting.

The honest caveat is that this is one paper, the headline 95.0% to 98.3% bump is on a single model under a single fine-tuning budget, and LIBERO is one benchmark family. What the reporting does not give you is whether harder, more compositional language tasks, the ones a robot deployed in a kitchen or a warehouse would actually face, would expose the missing capacity, or how the picture changes on long-horizon, ambiguous instructions outside today's benchmark suites.

The forward-looking part is the design freedom this hints at. If language redundancy is real at this scale, smaller and cheaper VLAs are on the table for teams that assumed they had to ship a full VLM on-robot. The bigger prize, though, is for whoever builds the next benchmark that actually punishes a model for losing half its language reasoning, because the current ones evidently do not.