by Jelani Harper


The application of the 80/20 rule to the time required to engineer data for analytics versus that spent leveraging analytics results has been well documented—most notably for data scientists. 2018 witnessed the applicability of this tedious latency period to machine learning in particular, as organizations struggled with the data management fundamentals to leverage this technology at scale across their core business processes.

The problem was decidedly oxymoronic. Every vendor was hyping some variation of machine learning as part of its purported Artificial Intelligence offerings; libraries of models and Machine Learning as a Service have become some of the most in demand cloud options. However, these factors simply reinforced the real problem in which, as Paxata EVP and Global Head of Marketing Piet Loubser so artfully observed about organizations, “[They] can buy a model online. If [they] can just get their data then [they] can put it into this format and put it through the model. But they don’t know how.”

The days in which enterprise use of machine learning is circumscribed by a lack of knowledge (and surfeit of latency) of effective data preparation are set to end in 2019. There’s currently a plethora of techniques for amassing and contextualizing the proper quantities of training data, mitigating bias, and implementing feature engineering to access some of the more advanced applications of this technology.

“If everybody is perceiving the preparation of data as the bottleneck for machine learning, then technologies like graph, automated query generation [and others] can unblock the bottleneck,” offered Cambridge Semantics Chief Technology Officer Sean Martin.

Training data

The immense amount of training data necessary to build accurate machine learning models has long been one of the most consistent impediments to leveraging this technology. According to TigerGraph COO Todd Blaschka, “Every customer wants to bring in more training data because they’re not happy with their current progress with machine learning.” There are certainly progressive data science techniques to circumvent training data requirements for machine learning. In general, however, the more training data organizations have, the more effectively they can “mine through that looking for patterns to create these features,” Blaschka added.

There are numerous approaches for organizations to scale to utilize as much training data as necessary for machine learning models, including graph technology. “I think people universally agree that graph is a good way of capturing unstructured data,” Martin mentioned. Although there are other approaches such as data lakes and cloud stores providing the same scalability for machine learning training data needs, Blaschka noted “with graphs you’re looking at patterns” to detect relationships between data that are ideal for informing machine learning models.

Overcoming bias

The notion of bias is central to contemporary discussions of machine learning’s overarching value, particularly in the wider context of its use as part of AI. According to ASG CPO Swamy Viswanathan, ideal training datasets “have to be of a minimum volume, and it cannot be the same data. It cannot be repetitive, because then there is no new learning. There are various attributes that you need to have.” Without these minimal quantities and variation in training data, machine learning models will inevitably produce biased results.

These models will also engender biased outputs if the selection process of the input data is flawed. Even if organizations have copious datasets with abundant attributes, relying on traditional ETL processes in which they’re merely sampling the data for an extraction “usually gets you very quickly to a bias that’s not your bias, but a bias in the sample of the dataset, and you’re going to be a victim of that process,” Loubser said. A much more preferable alternative to reduce biases from sampling is to leverage visual options for looking at entire datasets so “the data scientist can get a snapshot, a view into all your data, skews and everything,” Loubser remarked.

This way, data scientists can see potential biases in datasets and rectify them before inputting training data into machine learning models. Intelligent algorithms can also redress issues of data quality which may contribute to poor model quality. Still, visual mechanisms enable organizations to “look at all the data, because you don’t have to sample just certain portions of it,” Looker Pre-Sales Data Analyst Jonathon Miller-Girvetz affirmed.

Feature detection

Feature engineering is a substantial part of the typical machine learning bottleneck; features are training data characteristics that impact machine learning models’ abilities to produce desired outputs from inputs. “A feature is basically describing one column in a table that’s used to train data,” Martin said. “A great deal of time is spent by data scientists cleaning and preparing data in the art of feature engineering.” Nevertheless, there are myriad techniques for significantly accelerating the feature engineering process. In graph settings, for example, “technologies like automated query generation can take the art of feature selection from days and weeks to hours,” Martin maintained. TigerGraph VP of Marketing Gaurav Deshpande described a fraud detection use case in which streaming data from a large telecommunications provider’s call detail records “are integrated with what comes from a customer master in real time to calculate all of the features that are computed. We compute 118 of these in real time as calls come.”

As Martin alluded to, part of the challenge of feature engineering is simply standardizing data in a presentable form to readily extract features from them. Data preparation solutions have multiple means of standardizing aspects of data quality such as “standardizing certain values, filling in blanks of missing values” Loubser said, in addition to inferring schema on read. Such standardization is key for feature engineering because “in terms of preparing the columns representing possible features that you think might be valuable, there are tons and tons of different techniques,” Loubser continued. “You click on something, and there’s a function available for you to apply.” Miller-Girvetz commented that most aspects of preparing data for machine learning models, including feature engineering, are enhanced by visual capabilities of “being able to scope how much data we have access to and quickly choose what we want.”

Expanding machine learning outputs

Equipping machine learning models with dependable, unbiased training datasets for reliable outcomes is unambiguously the most difficult aspect of deploying this transformative technology. Graph mechanisms and visual approaches to managing data can empower organizations to access data at the scale required for credible machine learning inputs, facilitate feature selection, and assist with overall data quality measures important for this statistical branch of AI. The greater accessibility and overall ease of implementing machine learning attributed to the methods described above will certainly broaden the array of machine learning inputs and outputs in the coming year. Viswanathan described a future in which even coding software is impacted by “using the coded software as a platform on which you can actually bring in a learning model” to reduce the means of manually coding future applications for professional and personal use.

Other fascinating machine learning deployments yielding tangible business value involve image and video recognition use cases in which certain information—including that which is personally identifiable—is redacted. Whether in financial service deployments in which transactions are viewed yet the customer is kept anonymous, health care use cases in which there are training videos for medical procedures but patient identity is confidential, or insurance cases for vehicular accidents, such redactions at scale are only possible “with machine learning, because the actual thing to be redacted could be anywhere,” Viswanathan said. “It will not be in a fixed place. So you have to train [machine learning models] to find it, no matter where it is.”


Jelani Harper is an editorial consultant servicing the information technology market, specializing in data-driven applications focused on semantic technologies, data governance and analytics.