Modern Data Preparation: Reinforcement Learning, NLP, Search and Graph Algorithms

Ciarán Daly

July 4, 2019

5 Min Read

by Jelani Harper

SAN FRANCISCO - Just as the various dimensions of cognitive computing for practicalities like data preparation involve more than machine learning, the specific algorithms impacting Artificial Intelligence’s automation extend well beyond supervised and unsupervised learning.

Some of the most foundational work in the AI field was predicated on static, rules-based algorithms powering expert systems. Today, these algorithms and others supplement the efforts of supervised and unsupervised learning to considerably aid these dimensions of machine learning for true machine intelligence.

Some of the less discussed, yet equally meritorious algorithms responsible for the automation for which cognitive computing is revered include:

  • Reinforcement Learning: Although technically considered part of machine learning, reinforcement learning models learn considerably different than do those involving supervised and unsupervised learning.

  • Natural Language Processing: Distributed NLP algorithms are useful for identifying and rectifying differences in spelling for data management mainstays like data quality.

  • Graph Theory: In almost all cases, graph algorithms are highly influential for pinpointing relationships in data, enabling much more efficient, effective completion of jobs for everything from analytics to transformation.

  • Search Algorithms: Search algorithms tokenize data to make it easier—and quicker—to parse for a number of different business functions.

When deployed in conjunction with machine learning staples like supervised learning and unsupervised learning, these algorithms can solve some of the more intractable business problems for data-driven processes including data preparation, data integration, and transformation.

Related: Making AI work with edge computing

Reinforcement learning

The basic precept of reinforcement learning is that algorithms are trained to produce a result and, when they do so, they learn to get better at it. However, they do so largely without the hefty quantities of labeled training data necessary for supervised and unsupervised learning.

Ilknur Kabul, SAS Senior Manager of AI and Machine Learning Research Development denoted that “in reinforcement learning there’s a sequential decision-making process. We learn through sequentially interacting through the agents.” Practical applications of reinforcement learning are highly useful for integrating datasets during the data preparation phase.

According to Paxata CPO Nenshad Bardoliwalla, “When we’re comparing values in two different columns in two different datasets, we are boosting the signal every time we find a match, and we are de-boosting the signal every time we do not find a match.” This approach enables data prep platforms to “build up a model of which columns are likely to be related to each other,” Bardoliwalla maintained. That model is influential in creating tailored recommendations for integrating datasets prior to analytics or application use.

Related: The crux of supervised learning - annotated training data

Search techniques are also valuable for recommending how to create intelligent joins and data integrations. Search methods are responsible for transfiguring “each of the values inside one of the datasets into a series of tokens which, by the way, is exactly what a search engine does,” Bardoliwalla revealed. Tokenizing the values in datasets expedites the means of searching through them for potential ways of joining them for processes like integration. This is the first step in the ability to “create this very interesting data structure called a trie where we are able to lay out the values in a way that they can be very quickly probed,” Bardoliwalla commented. Reinforcement learning algorithms then scan the data to issue end user recommendations for the likelihood of the success of specific joins.

NLP algorithms

Various NLP methods are essential to disambiguating data from a data quality perspective so organizations have complete, de-duplicated data for applications or analytics. Algorithmic techniques like metaphone and ngram are useful for the ability to “automatically determine like values in a column,” Bardoliwalla mentioned. Regardless of the unending variation in the ways that different users refer to a common term (such as all the different variations of referring to a word such as ‘Los Angeles’), distributed NLP algorithms can work together for functions like cluster and edit in progressive data preparation platforms to “automatically figure out that all these values are talking about the same thing,” Bardoliwalla said.

Not only is this approach applicable to hundreds of millions of rows, but also “based on the statistical model that we build [predicated on NLP empowered cluster and edit functionality], we can then make a recommendation to the user and say actually, what you should be calling all four of these different cities that are spelled differently is Los Angeles, spelled this way,” Bardoliwalla added.

Related: Fulfilling AI's promise of micro-segmentation

Graph techniques

The incorporation of graph theory (involving both graph data structures and graph algorithms) into democratized, self-service data preparation processes is instrumental for automating workflows. In particular, graph approaches assist with converting ETL (Extract, Transform and Load) from manual, code-intensive projects to ones enabled by a click of a mouse. Intelligent data preparation platforms create knowledge graphs of the metadata from information management projects like ETL to offer detailed data lineage.

By back propagating these processes—via graph algorithms—to understand where the data came from for certain jobs, “we will automatically generate a graph that describes a workflow that the end user very literally has to just click a button, they don’t have to do anything else, they click on a button and the system says aha, you’re trying to automate these six projects on this frequency, and we have basically generated ETL automatically,” Bardoliwalla proclaimed. 

The many shades of automation

The fundamentals of preparing data for consumption—ensuring data quality, transforming data, joining tables and integrating data—will almost always exist so long as data-driven processes are required. The ability to automate these necessities is attributed to various manifestations of cognitive computing and algorithms beyond the typical realms of supervised learning and unsupervised learning. Techniques from reinforcement learning, search, NLP, and graph theory are just as formidable for effecting such automation, and will likely continue to impact AI throughout the enterprise.

Jelani Harper is an editorial consultant servicing the information technology market, specializing in data-driven applications focused on semantic technologies, data governance and analytics.

Keep up with the ever-evolving AI landscape
Unlock exclusive AI content by subscribing to our newsletter!!

You May Also Like