A small world: consolidating data management with supervised and unsupervised learning

by Jelani Harper 26 September 2019

Most organizations are aware that diverse Artificial Intelligence technologies are impacting almost every area of data management, from initial ingestion of data via intelligent integrations to consumer-facing Business Intelligence tools that are democratizing analytics.

Widespread vendor adoption of AI technologies like supervised and unsupervised learning is producing an even greater effect on the enterprise, one that’s unequivocally advancing the self-service movement and the empowerment of business users.

These facets of AI are consolidating the various data management domains, so that tools designed to solve one problem are endowed with a host of functionality to solve additional business problems. This fact is evident in the surplus of BI tools that deal with data preparation, data quality platforms that rely on analytics, and data discovery instruments that underpin aspects of search, and smart search in particular.

The pairing of machine learning and natural language processing is particularly effective for enabling users to quickly search through data that’s been unearthed with data preparation solutions. “It lets end users search the data assets that have been discovered in a pretty easy manner,” said Rohit Mahajan, CTO of Io-Tahoe. “So, the techie users, the non-techies, the business users and so on, will be able to go on a ‘Google-like’ search path, and be able to search for the meta[data] and for the content.”

Intelligent data discovery

Both unsupervised and supervised learning techniques are critical to the data discovery process that’s enriched by the capacity for smart search. These dimensions of machine learning “take care of the foundation like nobody else does,” Mahajan said. “[Users can] look for the data and get to the right foundation by discovering the relationships and so on.”

The granular understanding of data and metadata supplied by these intelligent algorithms ensures better search efficiency. This capacity is especially notable when it’s leveraged at scale for some of the more pressing enterprise concerns such as regulatory compliance, data governance, and enterprise content management. “If we’re talking about content, it’s become pretty much a needle in a haystack because with terabytes of data, hundreds of terabytes, figuring out a policy number requires really solid work behind the scenes in terms of various indexing and hashing techniques to go and look for that policy,” Mahajan said.

Smart search: beyond metadata

Architecturally, the smart search capabilities sit atop the data discovery stack fortified by machine learning. Users can search an array of sources including PDF, Oracle, JSON, or Mongo, Mahajan said. A critical point of distinction for competitive solutions in this space is the capability to not only search through the metadata, but the actual data or content of information assets. For instance, if users want to search for accounts, all they’d have to do is type in ‘acc’ and they’d immediately access “all the meta[data] that contains ‘acc’: it could be a rules table; it could be an accounts table,” Mahajan explained. “And, it’s going to show you the content where the name contains ‘acc’.”

Search is rapidly becoming one of the more utilitarian tools throughout the enterprise because it’s a self-service mechanism enabling business users to work the way they do in their personal lives with tools like Google search. In addition to the indexing and hashing Mahajan referenced, search capabilities are partially facilitated by intelligent data tagging. NLP techniques such as n-grams can tag data elements to support search functions, and are useful when combined with unsupervised and supervised learning methods.

Data quality

Another recurring business problem that supervised and unsupervised learning resolves is duplication issues for data quality, in which the nomenclature for data elements (such as names, addresses, etc.) is inconsistent. Variations in spelling can contribute to situations in which users “have five million customers and send them brochures every month, and end up sending them duplicates because Elizabeth is spelled as Liz, for example,” Mahajan said. Organizations can initially rectify this situation with supervised learning approaches involving regression and correlation techniques.

The former is crucial because “you want to start [with] the most slowly regressed data points and you want to start eliminating the outliers,” Mahajan explained. Correlation methods such as fuzzy matching help with the de-duplication process by “bringing Elizabeth and Liz, and Mike and Michael, together based on various attributes,” Mahajan said. Finally, unsupervised approaches such as clustering are useful for presenting users with possible duplications based on user-defined conventions such as names or cities.

Reinforcing self-service

Applications of AI are widely revered for their automation capabilities. However, they’re also responsible for breaking down the barriers between traditional data management domains, enabling users to achieve more with the tools they have. Thus, organizations can leverage solutions for data preparation with features traditionally ascribed to other domains, such as search. This consolidation of the capabilities required for data management reinforces the ability to provide self-service for business users and demonstrates AI’s transformative effect on how people work.