latent.space web signal

Ahmad Osman: local AI lags the frontier by four to eight months

TL;DR

  • Osman estimates open-source local models now trail closed frontier models by about four to eight months, a gap he says keeps shrinking.
  • A four-bit Qwen runs on a MacBook and, per Osman, modern phones now outperform cloud systems people used a couple of years ago.
  • Osman runs his own inference on 22 RTX 3090s and sketches several RTX Pro 6000 GPUs as the enterprise cost of self-hosting a frontier-class open model.

Ahmad Osman has been making the case for local AI since before it turned into a headline theme, and after an oversubscribed two-part workshop at the AI Engineer World's Fair this year, the case is starting to look less like advocacy and more like a describable trend. In an interview with Latent Space, the founder of Osmantic argued that "the gap between open-source models and closed-frontier models keeps shrinking," and put a rough number on it: a lag of about four to eight months behind the frontier.

The useful part of the interview is not the "open source is catching up" line, which people have been saying for a while. It is what Osman means by local. He built an interactive demo comparing hardware like the DGX Spark and AMD Strix Halo boxes against cloud models on quality, speed and latency, and he keeps returning to a point that gets lost in the leaderboard chatter: "a model is only one part of the system." When he plugged Claude Code into a local model and asked it to change the RGB lighting on a GPU, it failed, because it lacked the web search and documentation tooling that the cloud version relies on. The model was fine; the surrounding scaffolding was missing.

Why this matters if you are not personally stacking GPUs: the practical envelope has genuinely moved. A four-bit Qwen reportedly runs on a MacBook, and Osman claims that "on a modern phone, you can now run a model that outperforms systems people were using in the cloud only a couple of years ago." For enterprises willing to spend, he sketches several RTX Pro 6000 GPUs as the price of running a frontier-class open model in-house, and he expects companies to collect traces, messages and feedback from general models to train specialized versions on their own work.

The honest caveat is that this is one operator's framing of the state of play, and Osman runs his own inference on 22 RTX 3090s, which is not exactly a hobbyist rig. The four-to-eight-month gap is his estimate, not a benchmarked figure, and the reporting does not publish head-to-head numbers from his comparison site or say which of the open models he name-checks (Llama, Mistral, Qwen, DeepSeek, GLM, Kimi) won which specific tasks. What the piece doesn't give you is the enterprise cost math against paid APIs.

Still, the direction is the interesting part. If small teams and even phone users can run something serviceable, and if enterprises can plausibly justify their own hardware, the question stops being when local catches up and starts being which workloads still justify the cloud round trip.

Shared on Bluesky by 2 AI experts