​Solving “Bad Data” -- A $3 Trillion-Per-Year Problem

Data quality is still a struggle for many enterprises, and new types of data and input techniques are adding to the burden. Jonathan Grandperrin, CEO of Mindee, explains the impact of bad data and how enterprises can use deep learning and APIs as part of their “good data” strategy.

What are some examples of “bad data” and how has bad data changed over the last five years or so?

Jonathan Grandperrin: Bad data is incorrect or even inaccessible information that exists within the enterprise. Although it has always existed, what has changed in the past five years is the effect it can have on an organization's well-being. As the world has become more digitized, the reliance on accurate and available data has increased. Today, more than ever before, leaders need relevant information at their fingertips, and failure to achieve such agility results in leaders drawing inaccurate conclusions that can have costly short- and long-term repercussions.

What are the causes of bad data and what does bad data mean for businesses?

Bad data stems from the erroneous entering of information into systems. Take, for example, the struggle that enterprises face in extracting data from paper and digital-based documents. Many companies manually enter key information from important documents (usually scanned docs, PDFs, images, or even pictures of said documents), which results in poor or unreadable data and an increased chance of human error due to time-exhaustive extraction processes. Although it's 2022 and a lot of what we do is virtual, you might be surprised to learn that there are 2.5 trillion PDF documents in the world. It only makes sense to focus efforts on perfecting document processing.

Businesses running on bad data experience grave monetary loss, among other things. A few years ago, IBM reported that businesses lost $3 trillion per year due to bad data. Today, Gartner estimates the yearly cost of poor-quality data to be $12.9 million. Apart from the major impact on revenue, bad data (or the lack of data) also leads to poor decision-making and business assessments in the long run. The truth is, data can’t help business leaders if it’s not accurate and accessible. To maximize efficiency, processes need to run with real-time data.

Luckily, businesses can act fast by implementing advanced technology. Application programming interfaces (APIs), for instance, help organizations build fast and efficient workflows that operate smoothly, decreasing error and efficiency waste. The problem is that multiple industries and companies have yet to implement such technology into their processes, which is why they lack proper access to their data and real-time agility.

Why is the ability to access data in real time so critical for today’s organizations?

Data powers the modern world. Organizations of all types depend on digital efficiency to deliver more intelligent services and achieve business growth. Given its importance and ubiquity, data must be easily accessible across the enterprise. True decision-making power lies in being able to pull together company data quickly and with the peace of mind that it is accurate. Controlling data holds an enormous value because it ensures the quality of the information used to build your business, make decisions, and acquire customers. In our fast-paced world, making the right decision early in the cycle can make or break a company or a new product launch. Agility is necessary to survive industrial globalization because companies are no longer competing locally but with everyone, even those on other continents.

In today’s digital landscape, data serves as a main source of efficiency. How can companies put together “good data” strategies that will power their frictionless digital environments?

The first step for putting together "good data” strategies is establishing a strong information base. By adopting advanced technologies that help with robust data management, leaders can begin to power frictionless environments. When it comes to setting up the base for success, data extraction APIs, as I mentioned, are a game-changer because they can make data more structured, accessible, and accurate, increasing digital competitiveness. A few other things are important: ensuring data portability, adopting proper data extraction algorithms, being security conscious, and establishing strong learning models.

What key organizational challenges do APIs solve? What are the benefits for data-driven organizations? Are there any drawbacks?

Taking the example of extracting text from a document, this is now a common thing in the tech industry with the help of optical character recognition (OCR) technology. Being able to translate bits of information into malleable text is the first step, but more often than not, not all the information in a document is needed -- it creates noise. Too much data or too much non-useful data can prevent proper data analysis. Selecting just the necessary information is the key for being able to make the right decisions.

However, extracting the right information from documents is not an easy task. Documents can come from different sources, and even if they contain the same type of information, they may not display it in the same fashion.

Think about the receipts you get when buying clothes at different stores. They both contain pricing, taxes, items bought, date, store name, and so on, but how this information is displayed differs from store to store. For example, the date can be written in U.S. format as 05/26/2022 (MM/DD/YYYY) or in non-American format starting with the day first, 26/05/2022 (DD/MM/YYYY), or may even be written textually, as May Fifth 2022 (MMM DDD YYYY). Not to mention that different vendors use different fonts and that many documents include handwritten information.

With the intake of so many different document types, leaders looking to have efficient, effective, fast, and reliable processes need to turn toward machine learning and computer vision. Those services often provide APIs in two categories. The first category consists of APIs with predefined data models -- meaning the information and type of information we want to extract from a document are preset -- and algorithms already trained with massive amounts of documents for common use cases such as receipts and invoices.

Second, there are data extraction providers that make it easier to extract special cases of information. Without deep learning knowledge, users can define specific data models for their API and train the API by uploading their documents and selecting the proper information on the document related to the specific field to be extracted. The more documents used to train the model, the better the results will be.

This second option requires a bit more work than using a predetermined API, but it is still easier than building all the logic yourself. However, keep in mind that properly implementing this type of API requires a certain level of expertise to use and define the right algorithm for each occasion and to best manage the data at hand. It requires a team of data scientists and machine learning software engineers to go from the idea to the data-driven reality, which is why adopting tools already in existence may be the easier path for busy enterprise leaders looking for a faster and more robust implementation.

Jonathan Grandperrin is co-founder and CEO at Mindee. You can reach the author via email or through Twitter (@JGrandperrin) or LinkedIn.