Streamlining Data Science and Analytics Workflows for Maximum ROI
Streamlining Data Science and Analytics Workflows for Maximum ROI
August 2, 2018
By Jelani Harper
Initially, the relationship between data science and analytics is causal. Accurate analytics is one of the outputs of data science; the more effectual the latter, the more so the former. Nonetheless, there’s a marked tendency for most organizations to isolate each of these functions, which inevitably circumscribes their value.
“A lot of times, business intelligence, analytics [and] reporting all live in one silo, and data science lives in a totally different silo,” Looker Chief Data Evangelist Daniel Mintz observes. “Each of them builds out their own totally independent, parallel workflow. A lot of those workflows, they branch off and do their own thing, at a certain point. But their beginning steps are all the same.”
Organizations can increase the business value of both data science and analytics by streamlining their fundamental underpinnings of quality, accurate data in a format meaningful to business end users. Those successful in this endeavor not only decrease the impact of silo culture in data-centric firms, but also improve the speed, usability, and ROI of predictive analytics—and the data scientists empowering it.
Efficient Data Wrangling
Despite the multitude of tasks associated with the data science position, its basic workflow (in terms of analytics) is readily codified into three steps. The first is data preparation or data wrangling; where the data scientist starts with raw data and "just tries to make sense of it before they’re doing anything real with it,” Mintz explains.
“Then there’s the actual model building when they’re building a machine learning model. Assuming they find something valuable, there’s getting that insight back into the hands of the people who can use it to make the business run better.”
Typically, data scientists approach building a new analytics solution for a specific business problem by accessing raw data from what might be a plethora of sources. Next, they engage in a lengthy process to prepare the data for consumption. “So much time and energy goes into that,” says Mintz. “You look at the surveys of data scientists and they say 70-80% of my time goes to data cleaning.”
A much more efficient alternative is for data scientists to get their initial data from platforms designed for universal business access to data—regardless of where data physically reside.
The benefits of this method are multifaceted. Data scientists can decrease the complexity of the preparation process by using data already deployed for specific business purposes, which “schematically syncs up the workflow and…also means they’re more likely to come up with something useful for the business,” Mintz remarks.
Moreover, accessing highly distributed data from a centralized platform also decreases the time spent on data preparation, since it’s streamlined according to both uses: those for the business and for data science. The result is efficient data wrangling, enabling organizations to “unify that, leverage everybody’s expertise, do that once rather than twice, and then go and do their own thing with that data once it’s clean,” Mintz says.
Operationalizing Analytics
Perhaps the most important aspect of a data scientist’s job is translating insight from advanced analytics into tangible business value. Conventional data science tools are ideal for facilitating advanced analytics and machine learning models, but much less so for informing business users how to maximize their yield.
Centralized data access platforms are worthy tools in this regard because they enable data scientists to directly inform business users of action associated with the findings of Artificial Intelligence or other predictive analytics. Subsequently, there’s an enhanced capacity to operationalize analytics results, which is the point of performing them.
“Try building an interactive map that all your business users can use in R or Python,” Mintz says. “That takes a lot of steps. Try building a scheduling server or an alerting server in R or Python or Databricks. That’s not what they’re for.” However, partly because comprehensive data access platforms are designed with end users in mind, they’re equipped with a number of alerts, schedules, and interactive maps that can address almost any requisite analytics response.
Both data scientists and business users benefit from this functionality. Such detailed notifications help the former empower the business with analytics results, enabling the latter to do its job better.
Building Machine Learning Models
Another effective means of streamlining the workflows between data scientists and business users is to leverage a common access platform to build analytic models. Perhaps more than the other two steps discussed herein, this capability effectively democratizes data science while reinforcing some of its self-service attributes.
Top data access solutions function as all-inclusive environments in which users access their data then leverage the capabilities of additional machine learning tools to “run the queries and build the [machine learning] model,” Mintz explained. Thus, all three codifications of data science are performed in a single environment; the data never has to move and sophisticated end users—“data analysts” according to Mintz, or what Gartner terms citizen data scientists—are able to build their own models.
With preset coding building blocks, these users don’t need to write their own script to modify it for their own purposes. Consequently, there’s “none of the coding that people have come to associate with doing advanced machine learning,” Mintz commented. “It’s just a ton simpler. So from that end user’s perspective it’s one tool that they’re using for doing all three steps.”
Governed flexibility
The primary advantages of streamlining data science and analytics workflows are centralized (and accelerated) data wrangling, improved delivery of analytics results, and democratization of machine learning model building. Implicit to all of these benefits, however, is the ability for all users—both data scientists and otherwise—to operate in a structured, governed environment with a reduced reliance on data silos.
In this regard, streamlining analytics and data science culminates in a form of governed self-service data access integral to the longstanding reuse of data as an enterprise asset so “you can standardize, you can get that governance back, but you can retain the flexibility and agility that folks had before,” Mintz added.
Jelani Harper is an editorial consultant servicing the information technology market, specializing in data-driven applications focused on semantic technologies, data governance and analytics.
About the Author
You May Also Like