Optical Character Recognition Can Mean Bad Analytics

By Greg Council

Even in today’s digital world, documents proliferate in every part of an organisation – from accounts management (invoices, checks, and remittances) and human resources (employment applications and benefits forms) to engineering and manufacturing (design documents) and sales and marketing (sales plans and marketing collateral). Documents remain a key vehicle for making transactions and business processes work, and so document management continues to be a challenge. Optical character recognition (OCR) was the initial answer to many document-management woes, but in the age of big data analytics, OCR shortcomings can result in bad analytics.

OCR and Forms Capture

There is a steady increase in documents that are born and managed in digital form, but the volume of paper documents continues to increase as well. Dealing with these documents typically has been a labour-intensive and, therefore, costly process.

OCR made it possible to convert the words on a page to computer-readable text. From there, staff could work with digital documents and even copy and paste the text into their word processing or spreadsheet software. It didn’t do anything to aid with reducing the data entry requirements so common with any modern-day applications. Enter forms capture. Because OCR can convert images of documents into text and determine the position of that text, it was possible to design software that “mapped” a form and could extract and use only certain information. A claims form, for example, could be converted to text and the data could be broken into patient and service data that would be entered into a claims system automatically, without data entry.

Unfortunately, most documents contain more complex data that OCR and forms capture cannot solve.

OCR Is Not Information

Modern OCR output is like the data contained within a Word document. A Word document contains text and the attributes of that text, including fonts and positional data. The contents of a Word document do not describe the data within the document – there’s no automatic metadata about that document or analysis of which data points are most important. Both of these are key to document organization and management.

OCR has the same data problem. If you use OCR on a form or a document, it does not become information that is directly usable within a business process or transaction. A lot of work must still be done by humans, and automation is limited.

Take, for instance, a business cheque. When the payee receives a cheque, an employee has to enter the information into an accounting system. Doing this requires staff to view the cheque and locate the payer, amount, and date information. If he simply scans the cheque and uses OCR, he will get the text of the cheque without any information about which data corresponds to the information required (that is, the date, amount, or business issuing the check). For that, you need additional capabilities beyond OCR.

In the example of the claims form, the processor determines the type of data based on its position on the document and uses a “document map.” This technique cannot be done with business cheques because the required data can appear in different places on a business checque; positions alone will not solve the problem.

Further, OCR does not inform the user that the document is a cheque. If a series of documents needs to be processed, including remittance advice, a staff member first will need to identify the type of document and then use her knowledge to parse it and enter required information into the system. Without human intervention, this can result in bad analytics.

Turning OCR into Information

Organizations that need to process all types of documents have an even more difficult challenge. The complexity of turning OCR output into meaningful and useful information is the primary reason its overall adoption rate is relatively low, hovering around 30 percent for business processes and 40 percent for classification, based upon a research we did with AIIM last year.

The ability of “brittle” document maps or the raw OCR output to identify the document type and provide transactional or business process information is limited to non-existent. A decade ago, the hope was that search technologies would add more meaning to OCR output. The result was more time spent sifting through search results that were mostly irrelevant. A lot of work in recent years has been put into deriving useful information from the text output of OCR.

Initially, the state-of-the-art technology was the implementation of keyword and numeric pattern matching to identify target document types and then locate the required information. More recent advances employ a number of sophisticated algorithms and machine-learning techniques, which reduce the up-front work required as well as some of the production maintenance.

For example, document classification now can use both the text output of OCR as well as visual elements that are beyond the capabilities of OCR to determine a document class, with machine learning providing the automation of the document class rules. An invoice has a fairly common “look” that visual classification can identify. And invoices use common terms that text classification can use.

These new technologies also can improve automation for data extraction through analysis of the proximity of groups of data to improve the accuracy of field location and to learn from user input, all without having to manually create special rules.

When it comes to documents that involve less form-based information and more prose, such as contracts and marketing materials, these new, advanced technologies can take something that is difficult to describe and provide actionable metadata automatically, efficiently, and accurately.

Big Data and Documents

For document-based information, it all comes down to locating and extracting the most relevant information. Often, this is the metadata. Document metadata can be added to the wealth of database-oriented data, which allows organizations to perform the following functions:

  • Information governance. With accurate document classification and metadata, organizations can understand what information assets they have and control access both internally and externally. Organizations can use this information to manage storage and retention.
  • Data analysis. Organizations can review key data, such as contract dates and terms, to provide summaries of all in-force agreements against actual revenue the contracts are meant to drive. Spend data located in invoices, or even receipts, can be aggregated, categorised, and reported.
  • Straight-through processing. With transactional data automatically located and extracted, traditional document-based workflows can be transitioned to more automated processes that do not rely on human intervention. And the data within these documents can be added to other databases to compare planned revenue and expenses to their actual counterparts.

Accurate metadata makes it possible to fully leverage the value of an organization’s documents.

All of these capabilities would be impossible without OCR as a base technology. Scanned documents are just images. Understanding how to put OCR output to use is a fundamental capability that is a critical step in an organization’s progress toward further automation and gaining more intelligence from its document-based information. It is not the end, but rather the start.

Ultimately, adoption of OCR is a function of understanding the benefits of automation for document-based processes and transactions: reduced processing costs, faster workflows, improved data visibility, and enhanced data governance. For good analytics, it’s also critical to understand OCR’s limitations. OCR is simply an enabling technology. It gets document-based data into a condition that must be further refined into meaningful information through classification and data extraction. It’s important to identify organizational areas that rely significantly on documents and to understand what document types and corresponding data can be identified and extracted to support each automation effort real business needs.

Greg Council is the Vice President of Marketing and Product Management at Parascript. Formerly, he led product management at Evolving Systems and Captaris, now OpenText. He can be reached at greg.council@parascript.com.