The Crux of Supervised Learning: Annotated Training Data

Ciarán Daly

June 26, 2019

4 Min Read

by Jelani Harper

SAN FRANCISCO - Machine learning is one of the best examples of exactly how resource-intensive the numerous applications of artificial intelligence are. This sentiment is particularly true for supervised learning which, although responsible for some of the core facets of automation AI is acclaimed for, necessitates sizable quantities of training data, labeled training data, and human input for the latter.

“People hear a lot about machine learning and they think that everything goes automatic but, in reality supervised learning is what works the best,” commented Franz CEO Jans Aasman. “Supervised learning takes a lot of human effort to label things.”

This effort is necessary for the vast majority of implementations of supervised learning, whether it’s applied to speech recognition, image recognition, aspects of computer vision, and most other tasks. Although there are varying techniques to reduce the amount of training data required, at some point humans are needed to label training data so supervised learning models have examples of what they’re used to predict.

Coming up with that training data and getting it appropriately labeled by humans is one of the most longstanding challenges associated with machine learning—especially supervised learning. “Most machine learning is supervised learning, and most supervised learning is incredibly expensive,” Aasman indicated. Nevertheless, by utilizing an artful combination of human intelligence and IT resources, organizations can overcome the challenge of training supervised learning models with annotated training data.

Related: Fulfilling AI's promise of micro-segmentation

Human-managed AI labelers

When building supervised learning models to train image recognition systems to distinguish cats from dogs, for example, one of the initial steps is to get a sufficient number of images of cats serving as training data. Exorbitant quantities of training data are required for this task and other advanced applications of supervised learning to account for the implicit complexity. If there were only limited numbers of cat pictures showing their faces, the models wouldn’t be able to distinguish cat pictures that focused on other parts of their bodies. Thus, immense, diverse training datasets are a vital part of teaching models all the dimensions or characteristics for recognizing cats.

“When you train a machine learning model on cats for pictures (this is what I mean by an AI labeler), you go to Google and ask for Google Images of cats,” Aasman revealed. “You get 10,000 pictures and then a human being takes out everything that’s not a cat.” This process isn’t necessarily a manual one, which would be inordinately time consuming. Aasman mentioned, “You don’t have to look for all the cats; all you do is use some other computer stuff to make it easy to find cats. Now you have a guaranteed set of cats and then you let loose with your machine learning.”

Related: The essence of explainable AI - interpretability

Alternative labeling approaches

The importance of labeling training data based on human authority is beyond dispute. It’s also one of the foundational reasons why labeling such data for supervised learning is highly resource-intensive. “If you train on cats then you have a little bit of help from other systems to give you the cats,” Aasman noted. “But if you do not train on cats, how is the system ever going to know what a cat is?” Similarly, in speech recognition systems or even certain aspects of text analytics, humans are required to label different entities used in context as positive or negative, so they’re appropriately classified as such when models encounter similar data in the future. In these situations and others, such example data for training purposes (or the mechanisms for expediting the labeling process) may not always be available.

Consequently, a considerable market is growing to accommodate annotated training data for supervised learning applications of machine learning. “There’s a lot of people in low labor countries that can do the labeling for you,” Aasman said. “There’s whole companies around labeling.” In other instances, consumers are unwittingly used to provide training data for machine learning systems. CAPTCHA systems are in place to verify that humans are not robots when signing up for services, and requires people to identify images of crosswalks, store fronts, etc. “It’s using people to train the AI model,” Franz VP of Global Sales and Marketing Craig Norvell observed. “When you sign up for something, it gives you images and they don’t know [what they are]; it’s just feeding back into the machine learning.”

Related: Operationalizing regulatory compliance with deep learning

Counteracting bias, increasing trust

Annotated training data is the basis for teaching supervised learning models (as well as initiating the learning process in unsupervised learning models) to accomplish the specific tasks for which they’re deployed. Organizations are responsible for both procuring adequate amounts of training data as well as labeling them for this technology to work, which places the incipient burden of this approach on humans, not the machines deploying the technology. Large, diverse, annotated training datasets are essential to fulfill this aspect of the model building process to diminish the instance and degree of bias while burgeoning transparency, accountability, and trust.

Jelani Harper is an editorial consultant servicing the information technology market, specializing in data-driven applications focused on semantic technologies, data governance and analytics.

Keep up with the ever-evolving AI landscape
Unlock exclusive AI content by subscribing to our newsletter!!

You May Also Like