mistral.ai via Hacker News

Mistral Launches OCR 4 With Structured Output and Self-Hosting

mistral enterprise ai rag ai-builders enterprise-ai

TL;DR

  • OCR 4 returns bounding boxes, typed-block labels, and per-word confidence scores, making it directly usable in RAG and agentic pipelines.
  • Independent annotators preferred OCR 4 at an average 72% win rate; benchmark scores are 85.20 on OlmOCRBench and 93.07 on OmniDocBench.
  • API pricing is $4 per 1,000 pages ($2 for batch), with a single-container self-hosted option for compliance-driven enterprise deployments.

Document parsing has always been the unglamorous part of enterprise AI: where scanned invoices, dense PDFs, and multi-column reports pile up and the extraction layer gets treated as a plumbing problem. Mistral's launch of OCR 4 takes that problem seriously, and the more interesting story is not the text recognition itself.

The useful shift is in what comes back alongside the text. OCR 4 returns bounding boxes, typed-block classifications covering titles, tables, equations, and signatures, plus per-word confidence scores. That combination matters because downstream RAG pipelines and agentic workflows need structured metadata to operate reliably, not just raw text dumps. The model covers 170 languages across 10 language groups, with the company noting particular attention to rare and low-resource languages. On public benchmarks, Mistral reports scores of 85.20 on OlmOCRBench and 93.07 on OmniDocBench, and says independent annotators preferred OCR 4 over competing systems at an average 72% win rate. Users cited in the release reported '4x faster per page' processing versus prior providers and '8x lower cost and 17x lower latency' compared to agentic document parsers on financial QA workloads.

Pricing is $4 per 1,000 pages via the API, or $2 with the Batch API. A higher-level Document AI product powered by OCR 4 runs $5 per 1,000 pages. The model is available through Mistral Studio, Amazon SageMaker, and Microsoft Foundry, with Snowflake Parse Document integration listed as coming.

The caveats are real. Mistral itself flags benchmark limitations, citing ground-truth errors, equation formatting variations, and reading-order assumptions that can penalize correct outputs. The company also explicitly states OCR 4 is not intended for medical diagnosis, legal judgment, or real-time critical applications, drawing a clear line around where the model's reliability is warranted. What the announcement does not provide is the methodology behind the 'independent annotator' win rate, which makes that figure hard to assess independently.

For teams in regulated sectors, the on-premises single-container deployment option may be the most consequential feature: healthcare, finance, and legal operations that cannot route documents through third-party cloud APIs now have a credible structured-extraction path that stays inside their own infrastructure. What remains open is concrete throughput and latency specs for the self-hosted setup, and any timeline for the Snowflake integration, both of which matter to anyone planning a production deployment.