arxiv.org web signal

Bowtie transformer reportedly trims FLOPs 22%, KV cache 15%

TL;DR

  • Zhaofeng Wu and colleagues propose >
  • The paper reports up to 22% lower pre-training FLOPs and 15% lower KV cache memory and I/O at fitted loss-matched scaling curves.
  • Architecture is tested at 200M to 2B dense parameters and a 3B MoE, using a parameter-free carry-forward residual to bridge width changes.

There's a quiet paper on arxiv from Zhaofeng Wu and collaborators proposing something unusually simple. Make a transformer wider at the top and bottom and narrower in the middle. They call it a bowtie-former, written ><former, decoder-only, with a parameter-free residual carry-forward that copies inactive coordinates upstream so you do not need trained projection layers between width changes.

The headline numbers, as reported in the paper, are 'up to a 22% reduction in overall pre-training FLOPs' and a '15% reduction in KV cache memory and I/O costs' under fitted scaling curves. They test dense models from 200M to 2B parameters and a 3B MoE configuration, and claim the bowtie consistently outperforms parameter-matched constant-width baselines on downstream language tasks.

The reason this is worth watching is straightforward. KV cache is the constraint that bites at inference time, especially for long-context serving. A free 15% on the KV cache, if it replicates at the scales practitioners actually deploy at, is the kind of architectural change that gets adopted across training stacks within a release cycle. The FLOPs savings help training budgets, but the KV cache savings help everyone running inference.

The honest caveat is that the 22% figure comes from fitted loss-matched scaling curves rather than a single head-to-head matched run, so it is a curve-fit claim. The largest model in the paper is 3B with MoE, and the bowtie shape's effect on instruction tuning, RLHF behavior, or long-context reasoning is not what is measured. What the reporting does not give you is whether the symmetric ×-shape is actually optimal or whether the win survives at the larger scales where production models live.

Still, the code is public, the idea is straightforward to graft into an existing training run, and the cost of trying it is small. That is how architectural changes actually propagate.

Shared on Bluesky by 2 AI experts