A closer look at building predictive models: The criticality of data access

What is machine learning doing to your storage?

by Jelani Harper 23 March 2020

Most organizations are aware that building predictive models with machine learning requires copious amounts of data. The more accomplished of them realize that data should come from a range of sources to mitigate the threat of bias. Oftentimes, these two requirements involve semi-structured and unstructured data.

Nonetheless, even the most AI-savvy organizations don’t realize the demands on data access and retrieval this combination of necessities creates—and its potential for producing bottlenecks that can slow or even derail these cognitive computing undertakings—until after they’ve begun.

“What happens with machine learning is, many times your access is random,” explained Surya Varanasi, CTO at Nexsan. “When it’s random, because you’re accessing all kinds of data and building models… HDDs, you know, spinning media, does poorly.”

Unfortunately, most corporate data still resides on spinning disk drives – but a number of contemporary advancements in storage can facilitate the random access demands required to build compelling predictive models for deployments at enterprise scale.

Accelerating data access

One of the initial steps in building machine learning models is to gather the data that will inform them. “If you have a lot of data and you can’t access it, the models take forever,” Varanasi said. The three capital considerations for the speed of data access pertain to compute, main memory, and storage. These concerns become amplified when constructing predictive models because of the quantity and variety of data involved, as well as their structural variations. “In order to build anything you need your compute, your main memory, and your storage together to form a holistic computer system, if you will, so you can access, process, and create,” Varanasi added.

One of the solutions for overcoming the limitations of HDDs for building predictive models is to utilize flash storage, which delivers noticeable speed benefits. Traditional storage options enhanced with just a fraction of flash will see performance gains for retrieving data for machine learning models. “By adding just five percent flash to 95 percent spinning media, you speed up random access filing,” Varanasi said. “I’m not saying it’s 100x like all-flash, but if you get 5x and the price is extremely incremental to what you [originally] had, then you have a solution that people can afford.”

Expediting model building

A specific option for hastening data access is to leverage Quad-Level Cell (QLC) flash. According to Varanasi, benefits of employing QLC include performance and cost. He characterized QLC as a new generation of flash, coming out with 4 bits per cell: “What it does against mainstream flash technology is it allows you to pack a lot more density into a single cell, but it’s endurance is much lower than standard flash. And, of course, it’s much more affordable.” When using this type of flash to help with the storage underpinning data retrieval for predictive models, “you give customers a couple choices,” Varanasi said. “Hey, you have a few hard drives, let me speed it up for you so you can get random access. Okay, your data is critical, you really need to speed it up but you don’t want to pay for mainstream flash. Okay, let’s add QLC for you and help you with that.”

The performance effects of updated storage on the process of training predictive models are considerable. Varanasi referenced a facial recognition use case pertaining to security video feeds. In this project, the organization was utilizing GPUs and it was still taking over a week to use the data it had to build—and refresh—its advanced machine learning models. By utilizing contemporary options for high speed storage it was able to reduce that time “to about a day, or a day and a half,” Varanasi estimated.

From training to production

Many people credit the recent advances in cognitive computing applications to advancements in processing power. Although these gains are certainly a large part of it, the ability to retrieve data quickly enough to build cognitive predictive models also required improvements to storage media. Options in traditional flash, QLC, and high speed storage can help accomplish this, and are becoming more and more necessary for the model building process at enterprise scale. More importantly, perhaps, they enable organizations to spend less time working with models in training settings, and more time doing so in production environments.


Jelani Harper is an editorial consultant servicing the information technology market, specializing in data-driven applications focused on semantic technologies, data governance and analytics.