2020 trends in data science: Vanquishing the skills shortage for good

by Jelani Harper 25 October 2019

Data science has surged to the forefront of the data ecosystem, with demonstrable business value derived from the numerous expressions of Artificial Intelligence currently being adopted in the enterprise.

It represents the nucleus of the power of predictive analytics, and the extension of data culture throughout modern organizations. Consequently, data science trends are more impactful than those in other data management domains, which is why its increasing consumerization (beyond the realm of data scientists) is perhaps the most meaningful vector throughout IT today.

“You can’t find the data scientist talent to build models? Well, how about if those models can be built by a business analyst with one mouse click and one API call?” asked Oliver Schabenberger, CTO and COO at SAS, in conversation with AI Business.

“You don’t know how to make analytics operational, how to deploy it in the real world? Well, how about if deployment and model management are built right into the framework? You do not know how to make AI work for you? Well how about if [it’s embedded] for you into tools and solutions?”

This summarizes the state of data science as it heads into the next decade. Self-service access to intuitive options for visualization, data preparation, data modeling, virtual agents, and more, are quietly enabling organizations to overcome the skills shortage ascribed to data science, leaving them with its considerable operational boons.

Data modeling

Data modeling is the veritable foundation of data science. Data scientists perform a plethora of time-consuming tasks to perfect the data models necessary for achieving business objectives via analytics or applications. Specific data modeling rigors involve “building a model, testing a model, deploying a model,” Schabenberger said. “Models are at the heart of analytics and data science.”

However, simply getting to the model building process requires addressing a number of issues, including “lack of talent, lack of data of the right quality and quantity, difficulty operationalizing analytics [and] taking it from the science project to operational excellence,” Schabenberger added. Self-service platforms embedding machine learning and natural language capabilities assist with virtually all these issues, including:

  • Lowering barrier to entry: The automation of these AI elements make data modeling possible for non data scientists, allowing organizations to accomplish more with existing resources.
  • Selecting variables: Once relevant data sources have been identified, automated data modeling solutions can identify variables pertaining to model outcomes; users select the one most representative of their use case.
  • Selecting models: Machine learning can “consider the variety of different models, finding the best model type for your data,” explained Susan Haller, director of advanced analytics R&D at SAS. “It’s going to look at things like gradient boosting models, neural networks, and random forest, to name a few.”
  • Iteratively Reassessing Models: Competitive options in this space use cognitive computing to refine models once they’ve been selected. “At each step along the way the [cognitive computing framework] is going to continually reassess,” Haller said. “It’s going to add steps to the model; it’s going to remove things that are unnecessary. It may revisit existing steps and make modifications to them.

Automated model building processes also include ensemble modeling and displaying predictive attributes.

 Data engineering

Data scientists must find and prepare the appropriate datasets for achieving business objectives prior to building data models. Although such data engineering has typically been the most time-consuming aspect of data science, developments in data preparation have substantially reduced the time required. In addition to data discovery (to find appropriate datasets) and data cleansing, data preparation also includes the need to “explore my data: are there’s any issues that I need to resolve?” Haller said. “Second, you have to iterate through different data preparation techniques: transformations, imputations, things like that.” Modern data preparation methods automate these different procedures via:

  • Interactive visual methods: Whether the data is prepared by a formal data scientist or citizen data scientist, both benefit from working in data engineering settings that are “point and click,” remarked Piet Loubser, SVP of global marketing at Paxata. “You can reuse, you can drag and drop things. If you want to move steps around like, ‘I did a join then a find-and-replace, I should’ve done a find-and-replace and then a join’, you just drag and drop it.” Conventional data science frameworks like R require writing and rewriting code for this functionality.
  • Intelligent algorithms: Machine learning capabilities are supplemented with “dozens of other algorithms, search based algorithms and so on that can help make the non technical expert proficient with data,” Loubser said.
  • Natural language generation: Certain data preparation tools provide simple, natural language explanations for model performance or different feature engineering factors.

Feature engineering

Feature engineering is the process whereby data scientists determine which factors or attributes in datasets determine the specific outcome of an advanced analytics model. For example, data scientists may want to analyze different data related to fraud detection to see which events are precursors of fraudulent activity. Modern data science tools can automate this process with machine learning algorithms “running different feature engineering techniques,” said Ilknur Kabul, senior manager of AI and machine learning R&D at SAS. However, feature engineering is particularly challenging on high dimensionality data and what Kabul termed “wide data, where there’s so many more features than you will use.” Various techniques useful for implementing feature engineering include:

  • Embedding: With embedding, “you’re transforming data that is meaningful to people into numbers,” explained Cambridge Semantics CTO Sean Martin. “You’re basically vectorizing the data so it becomes numbers. And then you’re looking to see if there’s any sort of patterns in those numbers that will allow you to have an equation that predicts if you have one number, can you predict the other number.” Embedding in graph settings is effective for implementing the mathematical data transformations required for machine learning, because they’re primed for maintaining the necessary relationships.
  • Imputations: Organizations can circumvent issues with missing information in datasets, or specific values missing in tables, with imputations. According to Kabul, one method of facilitating imputations is to average the values that are present and insert that average into the missing value. Thus, organizations have more complete datasets for determining features.
  • Ensemble Modeling: Ensembling techniques (such as stacking, in which the predictive prowess of multiple models is combined for a more accurate model in a method akin to how the layers of neural networks are stacked atop each other) can aid feature engineering processes. With stacking, one can use an ensemble “output as a feature to another ensemble; it can get bigger,” Kabul said.

Regardless of which techniques are used, the goal is the same—to minimize the time spent on feature engineering. According to Kabul, these temporal concerns include trying to “help data scientists save time. Also…computational time too, because resources are important. You don’t want to run those models forever. If you’re running in the cloud, that’s money.”

 Virtual agents

Another automated means of deploying contemporary data science staples like machine learning throughout the enterprise is with bots or virtual agents. The practicalities of this form of automation are applicable to production settings, specifically for ingraining data science into process automation. According to Joe Bellini, COO of One Network, an integral part of going from data scientist sandboxes to deploying their handiwork in operations is “you have the AI machine learning autonomous agents that can run on top so you can execute in real time.” More importantly, virtual agents can effectively democratize the dimensions of AI utilized by the enterprise.

Whereas most people consider the statistical branch of AI (typified by machine learning) as synonymous with AI itself, the knowledge-based, rules-oriented side of AI is equally vital in verticals like supply chain management. For this use case, virtual agents can “deploy heuristics or deploy the optimizer or the rule-based engine,” Bellini said. “Those are all things people call AI in the market today.” Regardless of which dimension of AI is utilized—statistical or knowledge-based—virtual agents are a credible means of implementing these data science outputs. “Now that you’ve got real-time visibility into the data, there’s a lot of decision opportunities that you just don’t have enough arms and legs [for] on your own,” Bellini added. “So, you program the system to do it.”

 Real progress

Although the talent shortage still exists in data science, developments in automated model building, data engineering, feature engineering, and implementation (via virtual agents) has all but overcome its effects on organizations. These trends ensure that data science is more accessible to a wider range of enterprise users than ever before. “People have to be able to contribute if their skill set does not include a master’s in statistics or in data science,” Schabenberger noted. “This is one important way in which to address the talent gap that we’re seeing in data science. If the business analyst can easily, with access to data, build models, deploy models, and contribute back to the data science team, that is real progress.”

And, it’s possibly an indicator that other enterprise users will soon be able to make the same claim.