by Jelani Harper 28 November 2019
For a long time, there were only two main types of graph databases: Resource Description Framework (RDF) and Labeled Property Graphs (LPGs). Each of these graphs excelled at different things; you had to choose one based on your main use case and hope it didn’t change too much.
RDF graphs are either triple or quad stores. Their chief benefits are interoperable standards that eliminate silos, the ease of use of their data models (ontologies), and their superior inferencing that supports machine intelligence.
The principal benefit of LPGs is the richness of detail they give data. These graphs let users quickly add properties to nodes to describe them with any assortment of additional information, like dates. Properties easily enhance data with colorful descriptions.
Today, there’s a third graph type that combines the best of these approaches. RDF* (or RDF star) is a knowledge graph triple store that leverages properties. This emerging standard extends RDF’s benefits to include the level of detail of LPGs quite painlessly.
Consequently, it’s no longer necessary to choose between RDF and LPGs. RDF* offers each graph’s advantages so you don’t have to give up anything. Best of all, this option provides a new environment for solving almost any business problems related to data provenance, data science, and analytics.
Before and After RDF*
Traditional RDF knowledge graphs express data as triples. Triples are semantic facts made up of a subject, predicate, and object. For example, a triple might state “John is married to Sue”. But if you want to add properties to this fact—like the date they were married—it’s not a straightforward process in RDF. You have to go through complex reification steps and create extra nodes. Granted, if you’re adept at this process it can work well. Although there are supporters for adding properties in RDF with reification, many people find it confusing.
It’s much simpler to add properties to triples in RDF*. This cutting-edge standard was recently proposed to the World Wide Web Consortium, but hasn’t been accepted yet. Today, it’s used by more progressive graph databases. Reification and blank nodes aren’t necessary for adding properties to triples with this standard; they’re added similar to the way they are in LPGs. This simplicity makes adding properties less complicated and time consuming, while achieving two key benefits. It makes RDF as colorful as LPGs while maintaining their ease of adding properties. Additionally, you can now quickly add properties without creating silos, which is a consequence of using LPGs.
The swift addition of properties to knowledge graphs creates many benefits. The most convincing of these might be an improved capability for data provenance. RDF* lets you add properties to any subject, predicate, or object—for any purpose. This practically unlimited expressivity is valuable for data provenance. For example, you might think a reference in your data to Columbia is about a country. Depending on the data source and who’s asked, it may or may not be a legitimate country (but a reference to something else). With RDF*, you can document provenance with properties to determine data source and its journey through the enterprise. Previously, this use case and similar ones would require a quad store for data lineage.
Also, when you store data there are different confidence levels about the data sources. RDF* lets you rapidly add properties so you can not only keep track of the data sources, but also identify the confidence levels in them for more complete provenance. With issues of regulatory compliance and data privacy continuing to grow, it’s increasingly important to provide traceability with data provenance. Also, this type of granular data lineage is essential for operationalizing data science outputs—especially putting machine learning models in production.
Training data requirements are still the biggest data science obstacle for machine learning. Models require different types of data, factors, and vectors—at scale—for accurate predictions. With RDF*’s inherent flexibility, you can assemble factors from anywhere into a single graph to train models. Quickly adding properties to semantic statements can enhance their usefulness for training datasets. This framework lets you incorporate even disparate data sources that land in the RDF* standard to maximize training data for machine learning and AI.
The provenance provided by adding properties to RDF* is vital for operationalizing machine learning models. This data lineage is essential for detecting differences in model outputs and inputs, and understanding why they happened. It’s also necessary for understanding why results in production settings might vary from training setting results—providing a roadmap for what can be done to get better predictions.
The rare graph database designed to support analytics processing and RDF* helps analytics in several ways. Firstly, most graph databases—whether LPGs or knowledge graphs—are transactional databases not expressly created for analytics jobs. The few designed for analytics that support RDF* allow you to keep adding data from different sources until you get accurate machine learning models.
If you encounter missing values in datasets you can practically ignore them and just use the data without the missing values. Alternatively, you can fill in missing values with averages of the values you have. Either way, you can still use the datasets. You can also look at data from various angles in a triple store in a performant manner. It’s extremely difficult to do this in relational settings. Ultimately, this option lets you involve more data in more flexible ways for more effective graph analytics.
Graph users have many options today. LPGs and RDF graphs are still around, but RDF* is as well. This new standard merges the best LPG qualities into an RDF environment that lets you easily add properties to knowledge graphs to augment their intelligent inferences and standards-based settings with colorful expressivity. Downstream advantages include revamped data lineage, data science, and analytics capabilities, making RDF* the ultimate playground for implementing these data management staples. The result is a profound improvement in the ability to achieve organizational goals with data.
Jelani Harper is an editorial consultant servicing the information technology market, specializing in data-driven applications focused on semantic technologies, data governance and analytics.