Probably Raises $9M From a16z to Validate AI Outputs
Key insights
- Probably raised $9M from Andreessen Horowitz to build a validation harness targeting 99.99% accuracy in AI outputs.
- The harness runs on models four classes weaker than frontier models, enabling local deployment and cutting token costs.
- Founder Peter Elias says major AI labs avoid this approach because they profit from repeated model corrections.
Why this matters
The 'stronger harness, weaker model' architecture challenges the prevailing assumption that frontier-scale parameters are necessary to achieve accuracy in high-stakes applications. If the approach holds commercially, it creates a viable procurement path for precision-sensitive enterprises to sidestep cloud API dependence and the token costs that come with it. For founders and practitioners, it positions harness engineering as a distinct and fundable product category separate from model development.
Summary
Probably raised $9 million from Andreessen Horowitz to build what founder Peter Elias calls a "data science mech suit," a validation harness that catches LLM hallucinations before they reach users.
Better harness engineering means weaker models can still hit 99.99% accuracy. The system runs four classes below frontier models, enabling local hardware deployment and lower token costs.
Essentially: Probably (backed by a16z) bets harness engineering beats model scale for precision-sensitive tasks.
- First product: dataset summaries with citations and audit trails for data science teams.
- Roadmap targets accounting and medical verticals.
- Elias says major labs avoid this because "they make money the more times you have to correct the model."
That incentive misalignment is the market Probably is moving into.
Potential risks and opportunities
Risks
- If frontier labs respond by shipping their own validation layers, Probably loses its primary differentiator before reaching accounting and medical markets.
- Local hardware deployment dependency means accuracy claims could degrade on customer infrastructure that falls below undisclosed minimum specs.
- The 99.99% accuracy target, if it fails in production medical or accounting deployments, exposes Probably and early adopters to regulatory and liability risk.
Opportunities
- Regulated industries with strict data-residency requirements benefit directly from Probably's local-deployment model, opening procurement channels typically closed to cloud AI vendors.
- Enterprise data science teams spending heavily on frontier API token costs have an immediate cost-reduction incentive to pilot the harness approach.
- Audit and compliance tooling vendors could integrate with Probably's citation and audit trail outputs to address emerging AI governance mandates.
What we don't know yet
- The benchmark methodology behind the 99.99% accuracy figure is not disclosed in public reporting.
- Whether Probably's local-deployment performance claims have been independently validated outside the initial data science use case.
- No enterprise customers or pilots are named in the article; how commercial adoption is tracking since the seed close is unknown.
Originally reported by techcrunch.com
Read the original article →Original headline: Probably Raises $9M Seed From a16z to Build Deterministic Validator That Prevents Hallucinations Before They Reach Users