A Miami Startup Says It Slashed LLM Costs. Should India Care?
Subquadratic claims its sparse-attention model runs 56x faster and cuts long-context inference from $2,600 to $8. The receipts are now trickling out, and the stakes are highest where compute is dearest.
The News
Subquadratic, a Miami-based startup that emerged from stealth in May 2026, says it has cracked one of the oldest cost problems in modern AI: the way transformer models slow to a crawl as the text they process grows longer. After weeks of thin detail and loud scepticism, the company has begun publishing benchmarks to back its claim.
The firm, led by chief executive Justin Dangel and chief technology officer Alex Whedon, is targeting what engineers call quadratic attention scaling. In a standard model, doubling the input roughly quadruples the computation. A 10,000-word passage already demands around 50 million multiplications. Subquadratic's answer, which it calls dynamic sparse attention, multiplies only the token pairs it judges relevant rather than every possible combination.
The headline figures are striking. The company reports its model is 56 times faster than systems built on FlashAttention, scores 89.7% on the LiveCodeBench coding test, and handles context windows of up to 12 million tokens against roughly one million for leading rivals. On a long-context retrieval task, it cites 98% accuracy at the six-to-twelve-million-token range. Most arresting of all is the price: $8 to run the RULER 128 test, versus $2,600 for Anthropic's Opus 4.6.
Why It Matters
If the numbers hold, the implication is not a smarter model but a far cheaper one. Inference cost, the price of actually running a model after training, has quietly become the binding constraint on whether AI products make money.
The caveats are real. Independent researcher Will Depue likened the claim to "running a four-minute mile" - not impossible, but extraordinary. Jeanine Sinanan-Singh of the data firm Appen, which validated some results, noted the awkward truth that shocking numbers are hard to credit when the company reports them itself. And Subquadratic reused weights from Qwen, an open-source Chinese model, rather than training from scratch, which dents its story of a clean architectural reinvention.
The pattern is familiar. When efficiency breakthroughs land, the field tends to absorb the idea rather than the company. FlashAttention itself began as an academic technique before becoming an industry default.
Indian Angle
Nowhere does cheaper inference matter more than in markets that price in rupees. Indian AI startups serve users who will not pay developed-world subscription fees, which makes the gap between a $2,600 run and an $8 run less a curiosity than a survival question. Long context windows also suit Indian workloads directly: parsing lengthy legal contracts, multilingual government documents and dense regional-language corpora is exactly where token counts balloon.
Domestic model builders such as Sarvam and Krutrim have leaned on efficiency rather than raw scale, partly out of necessity given limited access to top-tier GPUs. A proven sparse-attention method would be a gift to that strategy, lowering the compute bill for firms that cannot match American capital. It would also strengthen the case behind the IndiaAI Mission, the government's roughly Rs 10,372 crore programme, which is betting that India can compete on cleverness and cost rather than brute spend.
There is a cautionary note too. Subquadratic's reuse of Qwen weights mirrors a live Indian debate: how much of a "sovereign" model is genuinely homegrown, and how much is fine-tuning atop someone else's open weights. For regulators at MeitY weighing what counts as indigenous AI, the distinction is about to get harder to police, not easier.
FAQ
What exactly did Subquadratic claim to solve?
The quadratic attention bottleneck, where computation grows roughly fourfold each time input length doubles. Its dynamic sparse attention processes only token pairs it deems relevant, which the company says cuts cost and lets context windows stretch to 12 million tokens.
Are the benchmark numbers independently confirmed?
Partly. The data firm Appen validated some results, and the company has published figures such as 89.7% on LiveCodeBench. But researchers remain cautious, and the model is not yet broadly available for outside testing despite long waitlists.
Why does this matter for Indian startups?
Inference cost is the main barrier to profitable AI products in price-sensitive markets. If a method delivers comparable quality far more cheaply, it directly improves the unit economics for firms like Sarvam and Krutrim and for the wider IndiaAI Mission.
Where can I read the original reporting?
MIT Technology Review published the detailed account of Subquadratic's claims, benchmarks and the scepticism around them. The link to the full piece is below.
This story was reported by MIT Technology Review. Read the full original coverage at MIT Technology Review.
Sources & Citations
- A startup claims it broke through a bottleneck that's holding back LLMs — MIT Technology Review
Frequently Asked Questions
What exactly did Subquadratic claim to solve?
The quadratic attention bottleneck, where computation grows roughly fourfold each time input length doubles. Its dynamic sparse attention processes only token pairs it deems relevant, which the company says cuts cost and lets context windows stretch to 12 million tokens.
Are the benchmark numbers independently confirmed?
Partly. The data firm Appen validated some results, and the company has published figures such as 89.7% on LiveCodeBench. But researchers remain cautious, and the model is not yet broadly available for outside testing despite long waitlists.
Why does this matter for Indian startups?
Inference cost is the main barrier to profitable AI products in price-sensitive markets. If a method delivers comparable quality far more cheaply, it directly improves the unit economics for firms like Sarvam and Krutrim and for the wider IndiaAI Mission.
Where can I read the original reporting?
MIT Technology Review published the detailed account of Subquadratic's claims, benchmarks and the scepticism around them. The link to the full piece is in the attribution paragraph at the end of this article.