Five Questions To Ask About Your Machine Learning Data

Ciarán Daly

November 15, 2018

5 Min Read

by Mark Brayan

Across industries, there has been an explosion of AI applications, and a corresponding demand for machine learning-based solutions. As we ask computers to solve increasingly complex problems that mimic human functions like speech and sight, these problems we're building solutions for have become more complex than the mathematical models that traditional programming can handle. AI requires machine learning, and machine learning requires data - a lot of the right kind of data. Without high-quality training data, teams may not build the right kind of solution, and models may not make the best decisions for the customer.

In a recent study from Oxford Economics and ServiceNow, of 87% CIOs reported that machine learning provided 'substantial value' or 'transformative value' to the accuracy of decisions - but 51% of respondents cited data quality as a barrier to their company's adoption of machine learning.

To build a successful machine learning program, it’s critical to have a data strategy in place. These are the five questions innovation leaders should be asking, to establish an initial data set and continuously improve training data quantity and quality.

1. Does the data even exist for your model?

Historically, businesses haven’t always collected all the attributes and behaviors that it takes to create an accurate model. As more companies start recognizing the need for data-driven automation, they are dedicating efforts to collect the right kind of data to build machine learning solutions. For example, an eCommerce retailer might want to connect data around customers’ past purchases, browsing behavior, and communication preferences to build a model for recommending a next best action to a customer. To do this requires a wealth of data, which typically, only the largest retailers have the volume of significant data to maintain.

For companies that lack large volumes of internal data, there are alternatives. For some use cases, there are publicly available and for-purchase data sets that companies can buy or create. For companies that have some data, but don’t have enough significance, it’s critical to collect a higher volume — which can generally be done with a crowdsourced data collection and annotation service.

Related: Where AI Is Going - And How You Can Get There

2. Do you have enough data?

Think of machine learning data like survey data: the larger and more complete your sample size, the more reliable the conclusions will be. If the sample data set isn’t big enough, it won’t take all the variations into account, and the machine may reach inaccurate conclusions, learn patterns that don’t actually exist, or fail to recognize patterns that do exist.

Take a speech recognition system, for example. Some experts recommend at least 10,000 hours of audio speech data to get a recognizer to begin working at modest levels of accuracy. Spoken languages and human voices are extremely complex, with infinite variations among speakers of different genders, ages, and dialects.

A mathematical model could train a machine on textbook English, but the resulting system would likely struggle to understand anything that strays from the textbook: loose grammar, people with foreign accents or speech disorders, and those who use slang, jargon, and filler words or sounds like “ah” and “um.” The more machine learning data accounts for real-world variation, the better the AI system will be.

3. Is the data structured?

One of the primary reasons companies struggle to build machine learning and AI-powered products is a lack of access to data. Even after companies go through the effort of aggregating the disparate information that their data science team needs to build models, often, the data isn’t structured in such a way that models can be built around it.

For example, consider an eCommerce retailer that wants to build a model for extracting attributes like brand name and size from product titles. If the retailer has thousands of SKUs of user-generated content, there are bound to be differences in syntax and labeling.

A human touch is usually required to annotate and label the training data before feeding the SKUs into the model. When humans annotate and categorize the unstructured data, it provides additional context and nuance that an algorithm could not otherwise parse.

4. Is the data clean?

It’s generally accepted knowledge that incorrect or poor quality input will always produce faulty output. Or more simply put — garbage in, garbage out.

With even the most appropriate model, a machine trained on bad data will learn the wrong lessons, come to the wrong conclusions, and fail to work as you or your customers expect. On the flip side, a basic algorithm will provide value if you have good data at sufficient volumes.

What defines “bad” data? The data may be irrelevant to your problem, inaccurately annotated, misleading, incomplete, or biased. In a given training data set, it’s imperative that every column is labeled correctly, and that any human annotators who are categorizing or tagging the data are free from biases.

5. Is the data easy to refresh?

An initial data dump is only the first step to training your model. If your model is part of a long-term solution, it’s vital to continuously train the model on updated data. If you don’t have the requisite amount of users or data pipeline, crowdsourcing the necessary behaviors is an easy way to make sure you can build solutions for real human behavior, without diverting your engineering team’s vital resources

AI is only as good as the data that trained it. With high volumes of significant, correctly annotated, consistently refreshed data — made accessible to the teams who are solving business problems — companies can reliably build better models and better machine learning solutions.

Mark is the Chief Executive Officer at Appen, a global leader in the development of high-quality, human-annotated datasets for machine learning and artificial intelligence.

Appen brings over 20 years of experience capturing and enriching a wide variety of data types including speech, text, image and video. Appen has deep expertise in more than 180 languages and access to a global crowd of over 1,000,000 skilled contractors.

Keep up with the ever-evolving AI landscape
Unlock exclusive AI content by subscribing to our newsletter!!

You May Also Like