The Cornerstone of Data Science: Progressive Data Modeling
by Ciarán Daly
By Jelani Harper
One of the direct consequences of the increasing diversity of today’s data landscape is the growing complexity of data modeling. Organizations are consistently broadening their array of data sources, incorporating more structures, formats, and types of data in order to maintain competitive advantage.
However, such diversification can considerably exacerbate the data modeling process—particularly for those still relying on typical relational methods. Each time business requirements change or additional sources are added, data modelers must recalibrate the underlying schema for repositories or applications. This recurring cycle considerably delays time to value.
[caption id="attachment_11793" align="alignright" width="247"] Jans Aasman, CEO of Franz Inc[/caption]
Contemporary data science and artificial intelligence requirements simply can’t wait for this ongoing, dilatory process. According to Jans Aasman, CEO of Franz, they no longer have to. By deploying what Aasman called an “events-based approach to schema”, companies can model datasets with any number of differences alongside one another for expedited enterprise value.
The resulting schema is simplified, uniform, and useful in multiple ways. “You achieve two goals,” Aasman noted. “One is you define what data you trust to be in the main repository to have all the truth. The second thing is you make your data management a little more uniform. By doing those two things your AI and your data science will become better, because the data that goes into them is better.”
The key to the simplicity of schema in this approach is converting business data into schematic events. Aasman observed that doing so is usually a lot easier than it may initially sound: “You can look at any interaction with a customer: whether it’s the sale of a product, or if they return something, or if they have a complaint, or call a call center, or whether they got an invoice or maybe not paid their invoice. All that can be an event.”
Similarly, organizations can convert other data-driven occurrences—including healthcare procedures, sensor data from the Internet of Things, or almost anything else—into an event that’ll be modeled with the same schema, regardless of data format or source. “We’re still talking about the same simple structure,” Aasman revealed. “We have an event type, start time, end time, and one or more actors.”
Such schema can adequately describe any type of event, especially with the addition of key-value pairs for details critical to data’s use or meaning. “Every type of event will have a few very specific key values or attributes,” Aasman said. “That’s why this events approach works far better in a graph database, although you still can do it in a relational database.”
The events-based schema methodology only works with enterprise taxonomies—or at least with taxonomies spanning the different sources included in a specific repository, such as a Master Data Management hub. Taxonomies are necessary so that “the type of event can be specified,” Aasman said.
Moreover, taxonomies are indispensable for clarifying terms and their meaning across different data formats, which may represent similar concepts in distinct ways. Therefore, practically all objects in a database should be “taxonomy based” Aasman said, because these hierarchical classifications enable organizations to query their repositories via this uniform schema.
A fairly widespread mistake is to create such a taxonomy and not fully implement it into the underlying repository, which complicates the query process while negating the value of even having taxonomies. Ideally, users should ensure that each significant concept in their repositories correlates to a taxonomy, which functions as the means of querying those repositories and taking advantage of uniform schema.
In addition to assisting with uniform queries for assorted data types, singular schema implementations also simplify other foundational aspects of data science. Foremost among these is decreasing complications associated with time-series data and temporal representations.
Time-series analysis is crucial to more advanced machine learning applications for streaming or continuously generated data. Aasman denoted that traditional approaches to schema may include several hundred ways “to specify time as the name of a column…it’s utterly crazy.” However, event-based approaches standardize temporal representations by using “one word to specify what a begin time and an end time is,” Aasman said.
Additionally, conventional approaches to schema involve considerable forethought of what sorts of analytics questions users will have for data as the basis for the schema’s design. Hence, there are lengthy recalibration periods for new questions, changing business requirements, or additional datasets.
Conversely, “We get our simplification by addressing the complexity by turning everything into an event so now you don’t have to think about all the different ways to approach the data,” Aasman said. “You just have only a few basic queries to get you going.”
Accelerating Data Science
Data modeling is arguably the foundation of data science. It’s a vital prerequisite for integrating, analyzing, and deriving most action from data. By simplifying schema concerns with an events-based method supported by taxonomies, data scientists can accelerate this step to spend more time solving business problems.
“Some people think that if you simplify schema you leave data out,” Aasman said. “But that’s not the case; it’s more that you make the data uniform, and in the process you simplify it.”
Jelani Harper is an editorial consultant servicing the information technology market, specializing in data-driven applications focused on semantic technologies, data governance and analytics.