French developer Mistral AI has released OCR 4, a document-recognition model that returns bounding boxes, block classification and confidence scores alongside extracted text.
The model moves beyond clean text and tables to a structured view of each page. It labels blocks such as titles, tables, equations and signatures, and scores its confidence per page and per word.
Mistral says that structure suits retrieval-augmented generation, agent workflows and compliance tasks such as redaction and human-in-the-loop verification. The output feeds citation-ready inputs for search and ingestion pipelines.
OCR 4 is an ingestion component of Search Toolkit, Mistral's open-source, composable search framework.
OCR 4 accepts common enterprise formats, including PDF, DOC, PPT, and OpenDocument and supports 170 languages across 10 language groups. It runs in a single container for fully self-hosted deployment. That lets organisations keep document data inside their own infrastructure for residency and compliance.
Mistral claims independent annotators preferred OCR 4 over every rival system tested, with average win rates of 72 per cent. The company says OCR 4 also leads the public OlmOCRBench.
Mistral says rival scores reflect its own internal reproductions and recommends testing on your own documents. It also flags limits in common benchmarks, noting that errors in reference data and equivalent maths notation can mark correct output as wrong.
Through the API, OCR 4 is priced at US$4 per 1,000 pages, halving to US$2 with a batch discount. The model is available via Mistral Studio, Amazon SageMaker and Microsoft Foundry, with Snowflake support listed as coming soon.
Mistral notes OCR 4 is a document-understanding model, not a decision-maker. It is not intended for medical, legal or safety-critical use.