by Jelani Harper
SAN FRANCISCO - The essential paradox of Artificial Intelligence is that it’s most prominent technology, machine learning, is unequivocally dependent on humans. Machine learning’s dynamic algorithms won’t replace static rules-based ones, much like machine intelligence won’t outright supplant human intelligence.
Instead, human input in the form of labeled training data, rules, and even rules-based algorithms is an indispensable necessity for the creation and accuracy of machine learning’s predictive prowess. “Rules based algorithms are always good because they start with a human baseline, and then you implement machine learning on top to actually learn from that rules-based algorithm and then do the right prediction,” reflected Io-Tahoe CTO Rohit Mahajan.
It’s a synthesis of these approaches, both rules-based algorithms and evolving machine learning ones, that accomplishes some of the most requisite data-driven tasks today including smart data discovery, regulatory adherence, and data governance.
Human training, intelligent models
The basis of almost every machine learning model is a training process predicated on human expertise. Some machine learning approaches such as Bayesian models (which Mahajan characterized as “one of the oldest: it’s Bayesian theory, if you will; it’s a statistical inference”) are founded upon the knowledge incorporated from human subject matter experts. Almost each supervised learning technique requires humans to provide machine learning models sets of training data, either labeled or otherwise, as examples of the tasks their algorithms are to perform.
One of the most difficult tasks due to the surfeit of regulations involving personally identifiable information (PII) and the rapidity of the data involved is “sensitive or restricted data discovery, which could be for GDPR, CCPA, HIPAA or other regulations”, on streaming datasets, Mahajan commented. In most instances, such machine learning tasks not only require human input for training, but also algorithms from human devised rules for successful predictions or classifications.
Mahajan mentioned a use case in which accurately tagging streaming data according to PII regulations for different mandates necessitates “machine learning involvement, absolutely. It’s also rules-based. It’s a hybrid approach.” Even unsupervised learning deployments require training periods in which humans teach models.
In certain instances, human developed policies—such as those for the California Consumer Privacy Act, the General Data Protection Regulation and the Health Insurance Portability and Accountability Act—provide the basis for training models to recognize data types under the jurisdiction of these respective regulations. Again, the time-sensitive nature of streaming data from event-based, satellite, and Internet of Things use cases is particularly challenging for adhering to these regulations.
Once machine learning models have been trained in the various factors to consider for PII, they’re useful in smart data discovery approaches that leverage “sensitive data discovery policies and so on that we have out of the box, from various regulators,” Mahajan said. Particularly effective approaches incorporate rules-based algorithms alongside dynamic machine learning ones.
The merit of the machine learning models in place in this use case and others is based on their ability to assimilate the information they learned from humans (and rules-based algorithms) and correctly apply it to future datasets. Thus, they can categorize information as either sensitive or not as deemed by the appropriate policy.
Although machine learning algorithms can effectually tag data at the scale and speed of streaming datasets prior to that data landing (and even persisting) in target systems, it’s human data modelers that effectively “train our models based on existing policies that we have developed,” Mahajan explained. “And, based on those trained models, the machine automatically predicts if data is sensitive or not. That’s where machine learning comes in and gets trained based on the policy.”
Reinforcing data governance
Much of the current media focus on cognitive computing and machine learning pertains to the abundance of ways in which these technologies aid consumers, enterprises, and society as a whole. However, these gains would not be possible were some of the integral aspects of machine learning, such as the model training period, as well as the refinement and re-training period often necessary for improving model accuracy, not facilitated by humans. The capability to spontaneously determine whether or not streaming datasets are sensitive according to specific regulations exemplifies the effectiveness of combining human intelligence and machine intelligence, cognitive algorithms and traditional rules-based ones.
By implementing this potential into the data discovery process prior to the actual ingestion of data, organizations increase their ability to enforce data governance policies while reducing the risk of regulatory non-compliance. Furthermore, they’re able to prove that no matter how advantageous machine learning may be for automation, data governance, and enterprise risk reduction, these benefits are ultimately rooted in human intelligence—which speaks volumes about the nature of Artificial Intelligence and general Artificial Intelligence, in particular. “I do not see a day and age where all the humans have been replaced by AI and ML and now we have unemployment at 90 percent,” Mahajan remarked.
There may be a dependency between humans and machine learning. If so, it’s the latter dependent on the former, not the other way around.
Jelani Harper is an editorial consultant servicing the information technology market, specializing in data-driven applications focused on semantic technologies, data governance and analytics.