by Jelani Harper
SAN FRANCISCO - The rigors of data modeling are perhaps the most longstanding obstacle to the ongoing decentralization of the contemporary data landscape.
As organizations compile various data sources from on-premises, the cloud, and the cloud’s edge—in addition to their geographic distributions—they encounter stringent data modeling restrictions preventing them from determining comprehensive value from the myriad interrelations of these datasets.
Although this concern is reflective of the increasing heterogeneity of the data ecosystem as a whole, it’s particularly acute in financial services. Financial organizations are mandated by regulations and international treaties to know their customers, requiring them to access data from traditional and non-traditional sources. Typically, this analysis is most effective when performed holistically.
Such holistic analysis for these and other use cases necessitates accounting for immensely different data models typical of centralized methods of dealing with heterogeneous data sources, like data lakes. According to TopQuadrant CEO Irene Polikoff, “Everyone today is looking to upload more and more data into data lakes, both for self-service access but also to save money.”
By modeling data of different types in a consistent, standard way, organizations get these advantages and contextualized, horizontal understanding of all their data. By relying on universal standards for data modeling, businesses (including one Polikoff describes as “the largest bank”) can model data of any type alongside one another for comprehensive analytics and application use.
Metadata data profiling
Frequently, metadata is the foundation for rectifying different data models for singular purposes or governed self-service access of centralized repositories.
The first step for accounting for differences in schema found in different data models is to “take data sources and profile their metadata and bring that metadata into a knowledge graph,” Polikoff reveals. Knowledge graphs based on Resource Description Framework (RDF) are instrumental for surmounting data modeling challenges since, “RDF is great as an inter-lingua because you can put any format into it and get any format out of it,” Polikoff says.
For example, the financial services institution Polikoff mentioned has a number of data sources in relational databases, and must initially profile the metadata from those dbs to overcome data modeling disparities. “In order to generate those datasets with consistent descriptions and consistent columns so they are connectable and well understood in a data lake, what they do is take the definitions of the databases, like what tables are in the databases, what columns, etc.,” Polikoff comments. “They bring those definitions, that metadata, into the knowledge graph.”
The universal standards environment of RDF is critical for aligning the schema of this multiplicity of data sources, particularly when leveraging self-describing schema formats such as JSON and Avro. According to Polikoff, profiling metadata helps to clarify the meaning of the data it’s about—as does profiling the actual data themselves. “For example, like how many unique values are in this field or allowed in this field,” Polikoff says. “You need to profile the data to get that enrichment of description.” Once the metadata of these respective databases is profiled, the RDF graph can produce uniform schema that is aligned—regardless of the origination data models—via self-describing data formats such as JSON and Avro.
This combination enables organizations to input data into centralized repositories so that “the schema’s already consistent; the schema’s already connected to each other,” Polikoff says. The coupling of Avro and JSON is an integral aspect of this process. “Avro schema’s in JSON,” Polikoff observes. The resulting data format in the financial services use case Polikoff discussed is JSON, which encapsulates Avro.
The mechanics of uniform data modeling
In this use case “we shaped these models that described tables, columns, [and] the kind of metadata you may have about tables and columns that describe datasets,” Polikoff remarks. “So we already have all those pre-built models that let you simplify the mechanics of it.” The critical facet of rectifying the differences in schema of the various databases used is data profiling. When the different data sources are profiled, “information about those sources is expressed in terms of those models,” Polikoff acknowledges. “It could be easily transformed into information that describes datasets, for example, because it’s all connected.”
The resulting datasets have a uniformity of data modeling that is linked within knowledge graphs and is ideal for centralized repositories for self-service analytics or applications. The standards-based environments of these graphs allow organizations to align data models that are inherently diverse at point of origin for singular use cases.
Moreover, this method lets organizations focus more on the value generated from aligning data of different formats, instead of worrying about building and re-building schema to do so. This process considerably “assists in generating consistent and well-annotated datasets that are then easy to discover,” Polikoff concludes. Such orderly datasets greatly enhance both self-service and business use of data.
Jelani Harper is an editorial consultant servicing the information technology market, specializing in data-driven applications focused on semantic technologies, data governance and analytics.