November 5, 2018
by Jelani Harper
The propagation of the enterprise’s ability to capitalize on data-driven processes—to effectively reap data’s yield as an organizational asset, much like any other—hinges on data governance, which arguably underpins the foundation of data management itself.
There are numerous trends impacting that foundation, many of which have always had, and will continue to have, relevance as 2019 looms. Questions of regulatory compliance, data lineage, metadata management, and even data governance will all play crucial roles in how data governance is implemented next year.
Nevertheless, the most profound question facing data governance in the coming months is somewhat less time-honored, but perhaps even more important than the others. TopQuadrant CEO Irene Polikoff best summarized it as: “Machine learning: who is responsible for its results, and how data governance teams have to be responsible, and what has to be put in place, and what do they need to do to understand and govern it, and ensure its quality."
The short answer is model governance, in which machine learning models adhere to the very rules, roles, and responsibilities that other data governance dimensions do. Perhaps the larger question is how - and which method is most profitable for organizations looking to reduce risk and enlarge data’s benefits.
What there’s no disagreement about, however, is the salience of the question.
“Ultimately, what every company wants with machine learning is a few things,” ASG Chief Product Officer Swamy Viswanathan confirmed. “They want to make sure that when you change either the [machine learning] model or the training, will the output be repeatable? Meaning, is what it did the last time going to be the same that it’s going to do going forward. That has always been an extremely difficult thing to replicate.”
Governing Machine Learning Models
Machine learning is at the forefront of data governance concerns for two reasons. Firstly, it’s an artificial intelligence component manifesting in some of the most avant-garde, lucrative applications of analytics today. Its potential in this regard is almost matchless among other technologies.
Secondly, it’s been entrenched in a number of data governance processes fundamental to some of the core objectives of this discipline—so it must be reliable and accountable. One of the problems in both of these instances is “models and algorithms have biases in them, by their very nature,” ASG EVP of Product Management and Product Marketing David Downing said. In most cases, those biases are directly attributable to the training datasets for those models.
“Machine learning just learns what you give it,” Franz CEO Jans Aasman explained. “If there’s a bias in your data, the bias will come out.”
Proactive Model Governance
To that end, organizations must proactively govern their machine learning models in a holistic fashion. It’s simply not sustainable to solely expect data scientists to account for what’s actually an enterprise enabling technology.
One way to meet this obligation is at the macro and micro levels, which involves governance councils and even upper level management ensuring there’s a “governance layer and policy management authorization at the model layer, as well as at the dataset layer,” Downing recommended.
For more granular model governance, Viswanathan mentioned open source options that “give you the ability to define a model, version and manage the model, define the data, associate the data through a version of the model, and keep everything together much the same way as you would keep source code.” Once, organizations would have to do this process themselves with little outside assistance. Today, Viswanathan assured, “those frameworks exist.”
Machine Learning Governance
It’s also essential to sufficiently govern the various forms of machine learning and targeted automation functioning as enablers of critical governance requisites.
The deployments for machine learning are extremely diverse, including everything from automating the best way to join multiple datasets to identifying which data is apposite for a particular use case during the data discovery process, or even expediting facets of transformation. But as TopQuadrant CMO and VP of Professional Services Robert Coyne noted, “It’s going to take a lot of effort, a lot of expertise, and new roles people will have to play, with a lot of training, to monitor these things and configure them. That’s all got to be a part of this automation.”
Thus, continuous monitoring and validation of machine learning outputs for the aforementioned data governance fundamentals are required for governance products with machine learning, in addition to whatever tailored solutions organizations devise with this technology.
Incorporating Business Rules
Machine learning is commonly deployed to assist efforts for tagging or classifying data for regulatory requirements. Automation is critical to issuing these benefits at scale for documents or content repositories, and necessitates human oversight.
Of equal value, however, is the deployment of business rules for governance policies, which are the basis for machine learning’s identification of data pertinent to them. Those rules and the policies they represent are made by humans, and are critical to ensuring machine learning is performing as needed for automating governance processes, so long as there’s “versioning on rules, and you have rules traceability, and you know what rules fired when, and who changed it,” Viswanathan said.
Still, as Aasman denoted, “It’s extremely complicated to make fair [machine learning] models with all the context around them.” Both rules and human supervision of models can furnish a fair amount of context for them, serving as starting points for their consistent governance.
The ongoing worth of rules in aiding the governance of machine learning directly contributes to the importance of data provenance in the coming year. Traceability has long been one of the most capable means of demonstrating and effecting regulatory compliance, both with machine learning systems and otherwise.
It’s unsurprising that some of the most remarkable innovations in governance for the coming year relate to provenance. Accenture’s Global Data Governance Practice Lead Bob Doyle referenced Data Lineage-as-a-Service options in which “you can actually go into your source code, understand these [numerous] languages, and follow lineage across file systems, databases, [or] any technology.”
Polikoff described data provenance measures for “the interactive visualization of lineage so you can drill up or drilldown” with intuitive drag and drop mechanisms. Aasman mentioned the use of semantic standards in direct acyclical graphs—that function similarly to how blockchain does—in which Uniform Resource Identifiers (URIs) are examined “to lookup for a URI the type of product that was interchanged, then you can look up what the description of the product was”—which Franz VP of Global Sales and Marketing Craig Norvell noted includes “the data lineage.”
Such innovations for data provenance are particularly valuable because of the large amount of semi-structured and unstructured data organizations are utilizing outside of their traditional firewalls. According to ASG Senior Vice President of Product Management Marcus MacNeill, these trends simply reinforce the criticality of provenance’s ability to trust one’s “information supply chain… in terms of understanding where data comes from, ideally how it originates, and how it moves and flows across the organization.”
Metadata management operates at the nucleus of many data governance processes. Although its particularities are influenced by visual techniques and source code comprehension, metadata management still functions as an integral means of facilitating data provenance. EWSolutions President and CEO David Marco observed metadata management revolves around “the who, what, when, where, how and why of our data. Are you going to be able to do predictive analytics if you don’t understand your data?”
Organizations certainly won’t be able to do so in a governed, sustainable fashion. Metadata management is likely the single governance element to connect all of the most pressing trends for data governance, from governing machine learning models and automation to understanding the provenance necessary to prove it for regulatory adherence.
Effective metadata management not only provides a blueprint for lineage, but also is used “to identify datasets that you’re more interested in,” Downing commented. “The whole world of data preparation, getting ready to do analytics, is helped now by classification: categorization [you] can do based off metadata and…machine learning. Metadata is the core that lets you do both.”
Proving Data Quality
Readily identifying metadata and cataloging it in a consistent fashion will remain critical to implementing trustworthy data governance, resulting in maximum data quality. The assurance of data quality is the lynchpin for trustworthy data, which contributes substantially to determining a uniform truth for specific use cases.
Although extending that governance to machine learning would ideally include facilitating explainability, successful model governance is rooted in determining, monitoring and validating the policies affecting this technology, actuating business rules, and ensuring data provenance for dependable data quality.
Jelani Harper is an editorial consultant servicing the information technology market, specializing in data-driven applications focused on semantic technologies, data governance and analytics.