By Jelani Harper

Despite the gradual progress of self-service options, data science is still inordinately slowed by tedious data preparation processes. The creation of machine learning models in particular is plagued by time consuming data engineering procedures, partly attributed to the large data quantities required to sufficiently train them.

The reality is that all too often, such efforts go to waste or must be constantly repeated once models are put into production. According to Cambridge Semantics CTO Sean Martin, “When you want to do this in an operational context and actually predict something, those [machine learning] models are very brittle. They have to get the exact sort of type of data that they were getting when you were doing the training; otherwise, you have to retrain them and make separate models.”
Timely, accurate, and consistent data provenance, however, can greatly decrease enterprise resources and time dedicated to re-building or retraining machine learning models once they’ve been operationalized. By understanding where the training data came from, what was done to it, and how it specifically affected a machine learning model, organizations can retrace those steps for models in production to generate maximum yield of the valued efforts of data scientists.
Thus, provenance is an integral aspect of facilitating both supervised and unsupervised learning, as well as producing lucid machine learning explainability with a white box, rules-based approach that overcomes the black box questions of several Artificial Intelligence predictive models.
“Having the provenance where you can understand the journey of the data before it got into the model, that provenance helps you understand that journey so it’s much easier for somebody to take the model and recreate that journey in whatever system they’re going to use in an operational context,” Martin remarked.

Operational Data Provenance

Graph database technologies are particularly adept at facilitating data provenance, especially when imbued with semantic understanding of the business meaning of data. The linked data ideology of these tools offers complete visibility into exactly what was done to data—and how—since they were ingested by the enterprise. “Another thing that’s important about graphs is it’s easy to create a chain of provenance that’s important when you operationalize models,” Martin said. That traceability is instrumental in allowing organizations to operationalize machine learning models with data as similar as possible to that used in training.
Oftentimes, data scientists have to do a number of extensive transformations of data to get the inputs that engender the desired outputs of predictive models. With the simplified traceability provided by graph technologies, organizations can “recreate the stream of data that you used to make the model itself operational,” Martin noted. It’s important to understand that such data lineage is accessible to data scientists (and, largely determined by them), but ultimately benefits those operationalizing them. “The wooly haired data scientist is not the same guy that’s putting this into operations,” Martin commented. “It’s somebody who’s more of a DevOps person who’s going to take the model that’s been trained and make it actually work outside of a lab and inside of a production environment where it’s operating to the fullest.”

Supervised and Unsupervised Learning

The merit of data provenance for machine learning models is realized whether organizations are deploying supervised or unsupervised learning. Data lineage is vital for supervised learning, in which machine learning models almost exclusively focus on the input training data as the means of implementing accurate predictions.
Although the reliance on training data is not as great in unsupervised learning (in which models are initially trained, yet eventually devise their own patterns and predictions from datasets) as it is for supervised learning, it’s still necessary in the former instance as well. “They still need the data to come from somewhere initially, but then they start iterating on what they’re learning themselves,” Martin mentioned about unsupervised learning models. In the vast majority of cases in which training data is involved, it’s critical to recreate the provenance of that data for effective production.

The primary point of commonality between these two branches of machine learning is “whether you’re using a relational database with a big model with lots of tables and stuff is keyed in, or you’re doing that in an equivalent data model that reflects what’s in the relational model, either way reality is being reflected in those models,” Martin posited. “The schema of the data models reflect some undeniable reality that you’re trying to capture.”

That reality is formed by the training data used so “you’ve still got to get stuff out regardless if you’re doing supervised or unsupervised; you’ve still got to get the data out of that complex structure into something that’s digestible by the next phase,” Martin said. When that phase involves operationalizing machine learning models, retracing the manipulations of the data used for those models is key for successfully putting them into production.

Related: Upping the Ante: Transmuting Machine Learning into Verifiable Knowledge

White Box and Black Box Explainability

The traceability of the training data used for machine learning models also has a positive effect on the proverbial black box of explainability associated with machine learning, as well as its rules-oriented, white box AI counterpart.
For the former, it’s still necessary to know from whence training data came as an initial step for trying to decipher the results of machine learning. “Even if you’re doing black box, you still have to recreate the flow of data, unchanged; otherwise, the model’s not going to work,” Martin cautioned. However, provenance takes on renewed emphasis with white box rules systems, since those rules are the basis for the action taken by AI.
“With white box you can actually read the rules, in which case provenance is absolutely essential,” Martin said. With white box systems, organizations need to know what data was used and what happened to that data to see which rules are applicable to them. Provenance provides that insight.

Performance Value

Altogether, traceability is a necessary element for ensuring machine learning models function in productivity as well as they did during their testing phase. It’s useful for both supervised and unsupervised learning, as well as for white box and black box applications of this technology. Those who understand how data was configured to work in testing know which measures to duplicate in operations so models perform up to par.

Provenance not only benefits those operationalizing machine learning models, but also decreases the time data scientists devote to retraining models. “Provenance is what gives that DevOps person or the data engineering people the recipe, if you like, for how to construct the data flow for the particular model that the data scientist has made,” Martin said. “Traceability describes all the transformations that data undergoes.”

Jelani Harper is an editorial consultant servicing the information technology market, specializing in data-driven applications focused on semantic technologies, data governance and analytics.