Idealizing Big Data Logistics for Machine Learning ModelsIdealizing Big Data Logistics for Machine Learning Models
Idealizing Big Data Logistics for Machine Learning Models
July 27, 2018
By Jelani Harper
Operationalizing machine learning for a specific business purpose has traditionally been an exacting process. Data scientists were tasked with procuring representative data samples, understanding business objectives in relation to them, then working on a seemingly ceaseless cycle of testing and retesting for useful predictive results.
Attempts to alter the model, or perhaps supplant it with another, only aggravated the process, considerably delaying time to value for business users attempting to leverage the performance benefits of Artificial Intelligence. Moreover, there were oftentimes intrinsic delays simply with putting those models into production, which created the same effect.
“When you’re deploying a new model you have to figure out how to test it,” MapR Senior Vice President of Data and Operations Jack Norris said about this traditional method. “You have to sample the data, expose it, do some tweaks, and then deploy that new model, figure out how to use it, etcetera.”
Nonetheless, organizations can considerably shorten this process by effectively engraining data science into the core—as opposed to the fringe—of their organizations to build machine learning models with the same data business units deploy. By optimizing this data into a consistent stream readily available for multiple enterprise purposes, they can build machine learning models faster, more accurately, and with much greater flexibility than they can using conventional silo methods that involve “a lot more moving parts and a lot more issues to get wrong,” Norris said.
Machine learning data logistics become simplified when they’re based on continuously streaming event data that reflect or affect the specific business task at hand. This approach greatly contrasts that of typical machine learning model building, in which data scientists have to access data from a data lake or a store. The primary use of the foregoing data stream may be it’s applicability to business processes, but it’s also rife for influencing the creation of machine learning models designed to enhance that process.
Norris noted it’s critical to organize this streaming data “by topics, so you can slice and dice it in ways that make sense” to business objectives. The efficiency of this approach is attributed to what’s essentially multiple uses of the same data, which impacts “a model that’s using that input and processing it, and then publishing results of that model, which could be part of a larger business process,” Norris explained.
Testing and Performance
One of the premier boons of relying on streaming data in this capacity is the agility of calibrating and comparing additional machine learning models. For instance, data scientists may opt to select different models emphasizing distinct data traits. If so, they simply “take a new model and have it subscribe to the same stream,” Norris said. “Because the stream has persisted, I can have that read from the beginning of the year and go through all the same events that the original model went through and compare on the same dataset what was the output of the new model.”
The ability to build, train, and test models on the exact same data helps indicate differences in their performance. Undergoing this process with different data for respective models could create biases and is a distinct possibility with the conventional approach to creating machine learning models.
Operationalizing in Production
There’s also a host of advantages to relying on streaming business data for machine learning model inputs when it comes to putting models into production. Again, organizations can deploy different models without interrupting production since the same stream is applicable to both the models and larger business processes. “Production’s happening all around the model,” Norris commented. “All I did was add a new subscriber.”
In addition to eschewing the issues which usually accompany putting new models into production—in which production temporarily ceases for this alteration in the workflow—these simplified machine learning logistics give data scientists several options for how their models impact production. Since they can implement additional models with newfound adaptability, they may choose to “start paying attention to the new model and ignore the old model, or…split traffic, or take the aggregate results,” Norris said. “I’ve got a lot of flexibility in terms of how do I leverage that model.” That flexibility enables organizations to pinpoint the model or combination of models that’s most appropriate for business tasks, since they’re no longer encumbered by the sheer logistics of simply building those models and putting them in production.
Centralized Data Science
A substantial amount of the value derived from classic or advanced machine learning pertains to the logistics of creating their underlying models. By centralizing the data science function throughout the enterprise, organizations can more readily involve it with various business processes. The result not only includes increased speed and agility for building machine learning models, but also the penchant for “getting people to think differently about the data,” Norris said.
Jelani Harper is an editorial consultant servicing the information technology market, specializing in data-driven applications focused on semantic technologies, data governance and analytics.