Document Simulator to help train your AI

AYR, a company specializing in Intelligent Document Processing (IDP) and Intelligent Automation, has announced the launch of the 3.0 version of its patent-pending Intelligent Document Simulator (IDS). IDS is claimed to solve two of the most significant challenges in the industry: the scarcity of training data for customer use cases and the ever-evolving formats and layouts of business documents.

Businesses struggle to provide training data for intelligent automation due to the sensitive nature of their documents, containing confidential or personally identifiable information. Additionally, technology teams may have limited access to business documents, making it challenging to gather the necessary data to train Intelligent Document Processing systems.

IDS promises to overcome these challenges by generating synthetic data that mimics the appearance and content of real-world business documents.

The first version of IDS, released in early 2022, enabled users to change the business fields of documents to a random selection from user-provided dictionaries, which could be automatically created or manually assembled. This allowed users to generate as many sample documents as needed to train their IDP models.

“While this proved helpful, the latest breakthroughs in IDS take this capability to new heights by enabling users to input a sample document and generate various layouts, such as swapping columns in a table containing line items or shuffling sections of the document horizontally or vertically and therefore creating new training documents,” said Dr. Tianhao Wu, CTO and co-founder of AYR.

“This makes it possible to train AI and machine learning models to recognize and process a wide range of document layouts, which is crucial for handling the diverse and ever-changing documents that businesses deal with daily.”

The AYR team’s latest innovation goes further by allowing users to change all words to similar fields or values, making the synthetic data even closer to real-world documents.

Instead of using dictionaries like the first iteration, AYR supports two mechanisms to produce synthetic contents: AYR’s own language model to produce similar phrases, words, or lines of text; or leveraging the widely-used GPT-3 engine to produce similar contents dynamically.

As with previous IDS versions, users are still able to further augment their data and document samples, including blurring, rotating, and making source documents more difficult to read, similar to the challenges companies face in the real world. The augmented data is used to push the boundaries of machine learning models and increase speed to market.

“As evidenced by the market growth in IDP, there is rising demand for automated processing of documents, at greater speed and accuracy, despite limited availability of training data,” said Anil Vijayan, Partner at Everest Group.

“Innovations in synthetic data generation and data augmentation will help enterprises overcome training challenges and facilitate even greater adoption of IDP solutions across a wide variety of use cases.”