The quest for better training data

Aristotelis Kostopoulos from Lionbridge answers questions about data preparation, and the challenges of data labeling for machine learning models

by Max Smolaks 10 December 2019

Aristotelis Kostopoulos

American localization specialist Lionbridge Technologies has been employing machine translation tools for many years. Eventually, its customers started asking for multilingual training data. Today, Lionbridge has a separate division entirely dedicated to AI, doing everything from collection of chatbot training data to image annotation, audio transcription and even multilingual content moderation services.

To find out more about the work of the division, AI Business talked to Aristotelis Kostopoulos, vice president of product solutions, artificial intelligence at Lionbridge.

Q: The AI division at Lionbridge grew out of the machine translation business, but today it does so much more. What are some surprising examples of the third-party machine learning projects that Lionbridge has been involved in?

AK: Lionbridge has been involved in artificial intelligence for many years now. Our main focus is to provide data to our customers to train and test their models, so customer requests are diverse and cross many disciplines, including natural language processing, computer vision, and data collection and labeling.

One collection task we recently completed, for example, involved the video recording of 3,000 participants across different demographics. Labeling examples include text annotation, speech transcription, and image labeling. While we are focusing heavily on providing our customers with data, we have also been involved in projects where we did the end to end AI development. These projects involve the creation of part of speech (POS) taggers where we employ active learning strategies to efficiently train models with less labeled data, as well as other NLP tasks like named entity recognition (NER) and sentiment analysis.

Q: When surveying data scientists, data preparation constantly comes up as one of the the most labor- and time-intensive parts of their job. Many also say it’s the least enjoyable. Why, and what is the solution?

AK: Data preparation is important, mission-critical work that ensures AI projects are set up for success. Before feeding a machine learning algorithm with data, you need to start with data collection, then move to data pre-processing and finally data transformation. These tasks can make up more than 80% of any given AI project. Usual challenges are lack of necessary data, messy data – duplicates, missing data points, et cetera – unbalanced data, data not ready for ingestion. This data is machine learning scientists’ nightmare. Two ways to alleviate process pain and reduce time spent are by using self-service data preparation tools or by outsourcing complete parts of the data preparation process. Data collection and processing are very good examples of work that can be outsourced to specialized companies. These companies can collect high quality data according to provided specifications, then augment, clean, and label it. Companies specializing in this area can also scale more quickly and produce desired data in shorter times, reducing engineers’ data preparation work and enabling them to spend time where it matters most.

Q: Why is unstructured data such a hot topic in machine learning?

With advances in machine learning and the introduction of deep learning, we are now able to process and derive insights from unstructured data, which was impossible a few years ago. Today, text, images, audio, and video — all different manifestations of unstructured data — are produced and can grow exponentially. Imagine the number of emails or tweets generated every single day, of images posted on social media, and of sensory data produced by IoT — using this data together with structured data from machine learning, we can get better insights. For example, Lionbridge is part of a six company/two university consortium leveraging cross-lingual economic signals from social media, online newspapers, trade publications, and blogs to extract sentiment data. This European Union-funded innovation project is called SSiX (Social Sentiment analysis financial IndeXes) and the data was used to create indices that support investment decision making, enhancing traditional predictions methods from structured data – more precisely, stock prices.

Q: As part of its data preparation services, Lionbridge employs humans to annotate training data for machine learning projects. Why is this important? Do you see a time when this work will be done entirely by algorithms?

In machine learning, there are three main learning paradigms: supervised, unsupervised, and reinforcement learning. The state-of-the-art algorithms and most AI-enabled products in the market today are based off the supervised learning paradigm, which requires high-quality, labeled data. While data is abundant today, getting high-quality, labeled data is mainly manual work conducted by humans. Humans with their knowledge and intuition prepare the data for machine learning experts to train and test AI products. Of course, there are also some challenges to human-performed data labeling. The process is time-consuming, requires workforce skills that are task-dependent, and can be very complex. Quality is always a major factor causing issues and can be expensive.

There’s a lot of work done to alleviate these challenges, both by the data service provider as well as by scientific communities employing strategies like machine learning-assisted labeling, transfer learning, domain adaptation, active learning, et cetera. While these techniques are successful for many areas, the need for human data labeling is still increasing — and will be for years to come. The main reason is that machine learning has only just now began to expand to certain industries and everything built is domain-specific. As you change business areas, you need new, domain-specific labeled data — even when you can leverage knowledge from previous-built models. This expansion outgrows any gains provided by these strategies, leading to increased demand for labeling.

Lionbridge saw that need few years back and invested in quality workflows that address the most important factor in the learning process: high labeling quality. We’re also heavily investing in data services and crowd management technologies. Our acquisition of Gengo, for example, brought in a sophisticated data annotation platform which we’ve continuously developed and enhanced with new features.


Meet Aristotelis and the Lionbridge AI team at the AI Summit New York, December 11-12