paper web signal

Program-as-Weights compiles LLM tasks into a 0.6B adapter

TL;DR

  • A 4B compiler emits parameter-efficient adapters for a frozen 0.6B Qwen3 interpreter, reportedly matching direct Qwen3-32B prompting on the paper's benchmarks.
  • The compiled function runs offline at about 30 tokens per second on a MacBook M3, using roughly one-fiftieth of the inference memory of the 32B baseline.
  • The compiler was trained on FuzzyBench, a newly released 10-million-example dataset built for fuzzy-function programming.

A group out of Harvard and Waterloo put a paper on arXiv this week that reframes what a small model is for. Instead of shipping a natural-language prompt to a big hosted model every time a user hits your app, they treat the big model as a one-time compiler that emits a small neural artifact you can then run locally, over and over, for that specific function.

The paper, titled Program-as-Weights, reports that a 4B compiler emits a parameter-efficient adapter for a frozen 0.6B Qwen3 interpreter, and that the compiled function matches direct Qwen3-32B prompting on the authors' benchmarks while using roughly one-fiftieth of the inference memory. They also report about 30 tokens per second on a MacBook M3, which is the number that makes the paper feel less like a lab curiosity. The compiler itself was trained on FuzzyBench, a new 10-million-example dataset the authors are releasing alongside the method.

Why this matters if you are not writing model papers: a lot of production LLM cost is paying a frontier API to redo the same fuzzy function on new inputs, thousands of times a day. If a small local adapter can carry a specific function at 32B-class quality, the shape of that bill changes, and so does the story for on-device and offline use.

The honest caveat is that the parity claim is measured on the authors' own FuzzyBench distribution, and the paper is a single-lab arXiv preprint from July 2026 rather than a reproduced result. What the abstract does not give you is how much the compile step costs per function, which task categories the 0.6B interpreter starts to fall behind on, or how this actually compares against just LoRA fine-tuning the same small model. Take the specifics as reported, not settled. Still, the direction, treating foundation models as tool builders rather than per-input solvers, is the part worth watching.