Automating data quality in the Internet of Things era

How Master Data Management can help create order out of chaos

by Jelani Harper 21 August 2019

As enticing as the massive amount of streaming data in the Internet of Things is, it presents a number of pronounced data management challenges for ensuring that information adheres to enterprise conventions of data quality. Consequently, many organizations aren’t getting the full value out of their IoT initiatives.

“With the IoT producing massive amounts of data, a lot of enterprises are leaving it out of analytics because it’s just too cost prohibitive with time and resources to try and map all of this into the analytics models,” said Katie Horvath, CEO of data accuracy specialist Naveego

The timely incorporation of machine learning technologies can help improve data quality, data profiling, and aspects of Master Data Management (MDM) to consistently automate these operations for reliable analytics models.

This process not only reinforces the transformation of the IoT into the Intelligent Internet of Things (IIoT), but also delivers “insight into that [IoT] data such that it becomes part of data quality and part of the golden record,” Horvath said. “Really, what it leads to down the road is auto mapping of IoT devices.”

The golden record in MDM is a single, all-encompassing dataset that captures all the necessary information from enterprise systems of record and is assumed to be 100 percent accurate.

Intelligent data profiling

Machine learning is directly responsible for the automation of data profiling capabilities that are essential for ensuring good data quality. Data profiling is the rapid statistical assessment of core data attributes to determine structure, format, and other key characteristics. When it’s coupled with machine learning and its advanced pattern identification, it becomes a perfect fit for data quality operations.

This facet of artificial intelligence allows organizations to “use profiling to learn about what the data is inside of a data source, as well as build automated data quality checks,” Naveego CTO Derek Smith explained. “We use the profiling and pair it with machine learning to build quality checks off of the profiled data.”

Machine learning techniques are effective for pinpointing patterns in sensor data related to sensitive or personally identifiable information, and other characteristics of interest in specific use cases. They enable organizations to “make sure that the data is what they expected,” Smith said. Competitive options in this space can profile data at the cloud’s fringe to support edge computing and additional IoT deployments.

Data quality

Machine learning is instrumental in automating data quality measures based on data profiling, which is essential for rapid deployment of IoT. “There’s a human workflow side of it,” Smith said. “We are making data quality suggestions and then allowing the user to see those and put those in place.”

Data quality suggestions might include advice to protect sensitive information (with mechanisms like masking) or to simply issue notifications that pressure readings in the oil and gas equipment, for example, are outside of specified ranges—possibly warranting user action. This approach automates four dimensions of data quality:

  • Accuracy: Automated data quality checks confirm that data conforms to ranges and characteristics outlined by users, informing them of any variation.
  • Consistency: Implicit to the accuracy of these data quality measures is the consistency of the data profiled.
  • Recentness: Connecting data profiling capabilities with those typical of MDM (such as the golden record) provides visibility into whether users have the most recent data—which is an ongoing issue in the IoT. Moreover, golden records allow users to “look across all the values you have in your system and choose the most common one, for example, if several sources agree upon what that value is,” Smith said.
  • Completeness: The timely usage of golden records also provides insight into “how complete the information you have really is, because you can see whether the systems that should have all this information have it, and whether different systems that make up the whole have the information that they should as well,” Smith added.

Master Data Management

Producing the golden record of data collected in the IIoT is useful in a variety of ways. A golden record helps form the basis of the data quality measures for comparing current results of machine-generated data to the desired ones. In healthcare, for example, “hospital systems have a whole bunch of different devices plugging into their network and they want to make sure that for security purposes they have a golden record of allowed devices,” Horvath said.

In manufacturing, this approach can create even more tangible business value. “You think about all of the different assembly line machines, and even two different lines having the same machine or having multiple different sensors in the machine,” Horvath said. “That becomes an exponential headache for IoT devices.”

Advanced analytics

Most importantly, the golden record of MDM serves as an optimal source for training datasets for machine learning models. Such a golden record offers an “analytics-ready stream,” Horvath said. The advantages of training machine learning models on streaming data are well documented. “When we think about a training dataset being a snapshot of static [data], well, a leap forward is to make training data in motion,” she added.

These are just some of the benefits a golden record can provide for the IIoT, in addition to ensuring data quality at scale, to help operationalize enterprise information with cognitive computing technologies.


Jelani Harper is an editorial consultant servicing the information technology market, specializing in data-driven applications focused on semantic technologies, data governance and analytics.