technologyreview.com via Reddit June 19th 2026

Subquadratic Claims SubQ Breaks LLM Long-Context Cost Barrier

anthropic openai google inference generative ai model-architecture inference ai-startups

TL;DR

Subquadratic claims SubQ ran a RULER 128 benchmark for $8, versus $2,600 for Anthropic's Opus 4.6.
SubQ reportedly scored 89.7% on LiveCodeBench with a context window up to 12 million tokens, twelve times most top models.
The model reused weights from the open-source Qwen model rather than training from scratch, raising questions about architectural novelty.

The transformer architecture that powers modern LLMs has carried a quiet tax since its introduction: as context grows, compute cost scales quadratically. Double the input, quadruple the computation. It is why context windows have been expensive to extend and why long-document analysis has remained cost-prohibitive for most teams. A Miami startup called Subquadratic emerged from stealth in May 2026 claiming to have broken through that ceiling, and MIT Technology Review covered the claims this week.

The company's model, SubQ, reportedly uses a sparse attention mechanism that dynamically selects which word relationships to calculate rather than computing all pairwise interactions across the input. The company claims this reduces scaling from quadratic to subquadratic. The headline numbers are striking: 56 times faster than FlashAttention-based models in speed tests, a context window of up to 12 million tokens against a typical ceiling of 1 million for most top models, and a benchmark cost of $8 to run the RULER 128 test versus $2,600 for Anthropic's Opus 4.6. On LiveCodeBench, SubQ reportedly scored 89.7%, placing it alongside models from Google DeepMind, OpenAI, and Anthropic on coding tasks. Appen, brought in as an independent evaluator, confirmed the benchmarks, with Jeanine Sinanan-Singh, Appen's director of generative AI research, calling the results potentially "a game changer."

Take the specifics as reported, not settled. The mechanism is proprietary and the company has disclosed only that word relationships are dynamically selected on-the-fly for each input. More importantly, SubQ reportedly reused weights from the Chinese open-source Qwen model rather than training a new architecture from scratch, which makes it genuinely hard to separate the architectural contribution from what the base model was already capable of. Will Depue, an independent researcher, noted that sparse attention has been attempted extensively in the field, comparing a genuine breakthrough here to "running a four-minute mile." Broad independent verification has not happened yet: access remains limited to a waitlist of tens of thousands with around 500-plus enterprise customers in the current beta.

What the reporting does not give you is an ablation showing how the model performs with the Qwen weights removed. That is the number that would distinguish a real architectural advance from a very efficiently adapted baseline.

Still, the economics are the part worth watching regardless of how the specific claims settle. Long-context inference at sub-hundred-dollar cost is a different product category than long-context inference at four-figure cost. Legal, financial, and scientific teams that currently treat document-scale analysis as out of budget would see their options change materially if the cost differential holds once the technology is tested beyond the company's own benchmark environment.

Originally reported by technologyreview.com

Read the original article →

Original headline: Miami Startup Subquadratic Claims SubQ Architecture Is 56× Faster Than FlashAttention With 12M-Token Context — Benchmark Cost $8 vs. $2,600 for Claude Opus