Best practices for implementing data science sandboxes

Best practices for implementing data science sandboxes

Jelani Harper

January 21, 2020

5 Min Read
Stock image of sand with children's beach toys scattered on top

by Jelani Harper 21 January 2020

More often than not, the advanced analytics techniques that are enhancing data-driven processes across the enterprise are spawned from a data scientist’s sandbox. Even users that access cloud-based libraries of machine learning models on demand can credit their productivity to the output of information gleaned in these data science playgrounds.

Although there’s no shortage of tools data scientists can deploy to scrutinize information, credible options need to meet several requirements:

  • Data Modeling: Sandboxes should be free of data modeling constraints so data scientists can rapidly include data of any variety, regardless of format, schema, or point of origination.

  • Integration: Ideal sandboxes enable data scientists to quickly integrate and aggregate data to determine how different sources relate to one another and to business objectives for predictive models.

  • Scalability: Sandboxes that scale are essential for duplicating production environments or determining how analytic models function within them. This functionality is critical for tweaking algorithms once predictive models are deployed.

According to Steve Sarsfield, VP of product at Cambridge Semantics, the Resource Description Framework (RDF) methodology of knowledge graphs provides each of these capabilities to serve as viable sandboxes for testing data for the statistical side of Artificial Intelligence. “That’s the nice thing about a knowledge graph or RDF,” Sarsfield explained. “I can add any data, of any type, at any time, and I can take that and use that as part of my analytics or my machine learning.”

Machine learning factors

The sandbox is the setting in which data scientists initially examine data to see how it would affect advanced analytics models. These days, such models frequently involve various elements of machine learning. The triple and quad stores of knowledge graphs can readily incorporate “a huge amount of data and factors that you’re looking at when you’re trying to feed your algorithms,” Sarsfield said.

While in other sandbox settings data scientists may become slowed in their quest to rectify the schema differences of the inherently different data models, “the flexibility of RDF in general is you can add factors from a lot of different places and bring them all in one place, and then your machine learning algorithms can go after them,” Sarsfield said. Knowledge graphs harmonize data across data models to enable quick inclusion of factors of various types. “That’s the value of RDF: you have all these factors, you’re bringing them in, you can store them in triples, and then the algorithms can go at them very easily,” he added.

Training data issues

Once data scientists have ingested information into their sandboxes, it’s necessary to examine the datapoints in a variety of ways to understand their relationships to each other and to machine learning goals. The flexibility of the knowledge graph environment buttresses this critical use case for sandboxes, so scientists can understand potential training datasets. “If I want to look at data from various angles, it’s pretty easy to do that, and do that in a performant way, with an RDF triple store,” Sarsfield said. “[It’s] pretty tough to do that if you have a standard relational database, because it may involve joins and a lot of things that use a lot of different resources.”

Knowledge graphs also provide utility for surpassing common training data issues affecting the iterative process of data science. The general idea is to “incorporate a bunch of data together, and if it’s not enough to give accurate machine learning and AI, then I add more data and see if that really can impact the outcome of it,” Sarsfield explained. The adaptability and scalability of knowledge graphs makes makes doing this relatively painless. Moreover, if the situation arises in which “data has a lot of missing values in it and I want to incorporate that, it doesn’t matter,” he added. “We’ll sort of ignore the missing values and use the data that does have value.”

Refining data science

The preceding best practices illustrate some of the more tangible reasons why graph approaches are increasingly sought after to help with various aspects of Artificial Intelligence.

They furnish adaptive environments for data scientists to combine data, examine them, and create the predictive models driving machine learning. They’re useful for the iterations in this discipline in which scientists create models, assess their results, then improve them by adding different types of data. They’re also beneficial for viewing the myriad facets of highly dimensional data and selecting the features necessary to improve the results of predictive models. Knowledge graphs are a viable medium for helping data scientists go from the testing and training phase to the production phase—and back again for the best results.

Jelani Harper is an editorial consultant servicing the information technology market, specializing in data-driven applications focused on semantic technologies, data governance and analytics.

Keep up with the ever-evolving AI landscape
Unlock exclusive AI content by subscribing to our newsletter!!

You May Also Like