by Aristotelis Kostopoulos, Lionbridge
20 February 2020
It’s no secret that we’re living in the era of big data, with new data points being generated at extraordinary rates. Today as a society, we generate roughly double the amount of recorded information we did two years ago.
And by 2025, we’ll triple that amount, with overall volumes projected to reach an overwhelming 175 zettabytes of information. How much data is that, really?
Well, a single zettabyte is equal to one trillion gigabytes. To understand the size of a gigabyte, the standard smartphone only stores 32 of them. So to house 175 zettabytes, every person on earth would have to use 1,000 phones – six trillion phones in total. In other words, 175 zettabytes is a lot. As a result, the industry’s traditional data storage and processing methods are no longer adequate, presenting computer scientists and engineers with a myriad of challenges. New solutions are essential. And once we put AI on the case, this data explosion will give us an equally large number of new opportunities for analysis that smaller data sets simply couldn’t provide.
Today’s engineers use three primary machine learning paradigms: supervised learning, unsupervised learning, and semi-supervised learning. The first, supervised learning, relies on labeled data. Engineers have clear outputs and inputs they use to train a model. Once trained, new instances from the same distribution tend to be categorized correctly.
Then, on the other end of the spectrum, there’s unsupervised learning, which works with an unlabeled data set. Developers don't know the instance labels, so algorithms have to use underlying information about the actual data sets in order to perform clustering and other AI tasks. Finally, there’s semi-supervised learning – a hybrid between the two. It uses smaller sets of labeled data and larger amounts of unlabeled data in order to create a learning framework where the model improves by iteration.
As the amount of data grows, supervised
learning faces five data labeling challenges: budget, quality, complexity, time,
and workforce skills/availability.
- Budget is exactly what it sounds like: in a world where Google processes 3.5 billion search queries per day, where Instagram users post 54,000 photos in a single minute, and where 3,000 tweets are sent every second, there’s no shortage of data – but there’s almost always a limitation on the amount of money clients have to process it.
- Quality is the number one predictor of project success, with the adage “garbage in, garbage out” being a true one. If a model isn’t trained with high-quality data, then there’s nothing that even the best AI engineers can do to keep the end product from failure.
- Complexity can come from high volumes of data, of course, but it’s also found in project planning. Does a project have too many guidelines, or just enough? The longer the specs, the more time it takes for labelers to read through and fully understand them before beginning their work – and the more room for possible human error.
- Time is always a challenge. As users, we may be creating more data every day, but the majority of businesses have a fixed date for product release.
- Workforce skills and availability include engineers and project managers, of course, but for multinational or multicultural products, a linguistically-skilled workforce is required as well.
Fortunately, the five challenges have two solutions: quality setup/workflows, and smart labeling, which generally happens in one of three ways – through machine learning + human-in-the-loop, transfer learning, or active learning.
Until recently, data labeling was a long and tedious process where people meticulously validated each and every piece of new data. But today, by combining the power of technology and people, AI-powered labeling tools tremendously speed up the process. This is called 'machine learning + human-in-the-loop.' Using this technique, text labeling can now more rapidly identify named entities and parts of speech for transcriptions and image annotations. Before, people had to manually annotate individual words for natural language processing (NLP). Today’s labelers use object detection algorithms for the early work, then just go in and fix any incorrect labels.
Traditional learning, on the other hand, is
isolated and occurs purely based on specific tasks and datasets. No knowledge
is retained or transferred from one model to another. But you can leverage it,
using features, weights, and similar data from previously trained models to train
newer ones. For example, in NLP, AI engineers can
leverage embeddings like Word2Vec or GloVe that were previously trained, for
tasks like sentiment analysis.
With active learning, the algorithm is allowed to choose the data it learns from, which not only makes it perform better, but requires substantially less data for training. This can really reduce the amount of data a company needs to label in order to get started. The main active learning scenarios are membership query synthesis, stream-based selective sampling, and pool-based sampling.
With the first, learners can request labels for unlabeled instances in an input space, including queries that the learner generates from scratch, rather than those sampled from some underlying natural distribution. This doesn’t tend to work well with human-in-the-loop, though; while a data point can come from the underlying distribution, it can also be too ambiguous, hard to separate from other data for annotation.
But with stream-based selective sampling, the key assumption is that unlabeled instance acquisition is either free or inexpensive, so instances can be sampled from the actual distribution. Then the learner can decide whether to request the corresponding label, making the decision individually for each instance.
Pool-based sampling is the most common. With it, a small set of labeled data and a large pool of unlabeled data are both available. The learner is supplied with a set of unlabeled examples from which it can select queries. More than one instance can be created at a time.
While this may sound like a lot of vocabulary – a sort of introductory primer for artificial intelligence 101 – the laying out all these techniques and methodologies does have a larger point: data labeling is complicated. We face an abundance of data today, that is laterally unstructured and unlabeled. Today’s dominant machine learning paradigm is supervised learning, which requires a lot of labeling. But in order to handle tomorrow’s data, the AI industry has to look to these other methods – and new techniques that are yet to be developed – in order to solve the issues of budget, quality, complexity, time, and workforce skills/availability.
Aristotelis Kostopoulos is vice president for product solutions involving artificial intelligence at Lionbridge - a company specializing in language translation, localization, software development and testing, interpretation, and content development services.