By Jelani Harper

Successful deep learning deployments require copious amounts of annotated training data as inputs for their predictive models. Model outputs include dynamic algorithms capable of linear and non-linear pattern recognition at scale, encompassing scores of variables.
In several verticals, this technology’s utility has been circumscribed by a dearth of such labeled input data. The healthcare industry, however, is one of the few exceptions. For years, medical institutions have amassed labeled patient data for various diagnostic and treatment purposes; these data are optimal for training both classic and advanced machine learning models.
According to Hyland Global Director of Imaging Sales and Strategy, Sandra Lillie, numerous healthcare facilities “sit on hundreds of millions of very clinically rich images. Our belief is how do you make it useful?”
One of the most immediate ways to leverage that data is to train deep learning models for image recognition systems detecting medical conditions for clinical diagnoses and potential cures. When coupled with sophisticated metadata management for cataloging, classifying, and categorizing healthcare data, these data sources make a pivotal difference in the underlying value deep learning furnishes the healthcare industry—and the patients relying on it.
According to Bryan Schnepf, Hyland Leader of Global Healthcare Solutions Marketing, these sources are invaluable because, “The pixel data is really the pattern recognition stuff for doing the deep learning on being able to identify disease automatically.” 

Image Recognition

Image recognition systems are perhaps one of the more prominent use cases for deep learning. The basic premise is that by training deep learning models on annotated images for extremely discreet functions, those models will eventually recognize images evincing similar, if not identical, traits. “If you run a hundred thousand breast cancer images through the machine and teach it where the tumors are, then when you send the hundred and first thousandth, it should be able to find it on its own,” Schnepf mentioned. The health care industry is advantaged in this regard because its care facilities have a wealth of such data at their disposal.
Of particular use is the pixel data of the various imaging systems deployed throughout this vertical, which Schnepf described as “pure data” for training machine learning and deep learning models. Many of the foundational gains in deep learning involved image recognition systems and analyzing various healthcare images. “For a lot of what you’ve read, seen and heard, IBM type machines have been in full swing for years on this topic,” Schnepf commented. “The pixel data is what they’ve been analyzing.” Oftentimes, this data is already labeled according to particular medical procedures. The key to unlocking this data’s worth for image recognition systems, however, involves adequately cataloging and classifying them for the labeled inputs machine learning models rely on for training. Such data curation is vital for “ensuring that, it’s hard to say guaranteeing, but [for] giving confidence that the data for AI programs are the highest quality possible,” Schnepf explained.

The Merit of Metadata

Metadata management, then, is essential for curating data for deep learning. Medical imaging data contains all sorts of diverse metadata, involving everything from patients’ names, the sorts of examinations they required, their record numbers, and other means of identification. Moreover, this metadata also applies to “the actual machine that the images were recorded on by their make, their model, their serial number, so it can be traced back in case there was any kind of legal problem or a recall situation, etc.,” Schnepf said.
Such metadata provides the basis for classifying images and labeling them for machine learning training. It’s the foundation for deriving value from pixilated data so they “may be used for AI purposes and machine learning capabilities down the road,” Schnepf acknowledged. In this regard, metadata offers the initial step towards actually facilitating the data quality machine learning and deep learning models need for training. The irony is “all this metadata was kind of dormant and associated with the medical files for a long time, but in recent years that metadata has become really rich and valuable and become part of the wave towards AI and automation,” Schnepf said.  

DICOM and Data Quality

Although there are a number of ways to maintain and manage metadata, one of the most effective within the healthcare vertical is to use interoperable industry standards such as the Digital Imaging Communications in Medicine (DICOM) standard. This standard can accommodate the storage of data for any number of medical images, which might include “a CT scan, or an ultrasound scan, or an MRI scan or even a digital x-ray,” Schnepf said. Furthermore, DICOM has a bevy of metadata which is an integral part of the classification of medical images. “There’s tons of metadata that’s part of the DICOM standard,” Schnepf remarked. The management of this metadata, in an open standard accessible between systems and even between healthcare organizations, immensely helps with implementing the consistent data quality levels required for deep learning model accuracy. “These tools are only as good as the data that gets to them,” Lillie commented about the value of data quality for machine learning applications.

From the Bottom Up

Retrospectively, data quality is actually the foundation of the advanced image recognition systems facilitated by deep learning for the healthcare industry. Metadata management, enhanced by the use of open standards such as DICOM, is influential for categorizing the array of big data sources offering clinical value to medical images. The assortment of pixel data found in those images provides a fertile training ground for deep learning model inputs. The result of that training culminates in heightened image recognition systems that detect patterns in images correlating to diagnoses and treatment of specific healthcare conditions.

“There are two aspects of the AI story,” Schenpf summarized. “The metadata that we referred to is kind of this rich harness, this amount of data, that can be used for identification purposes for big data type scenarios because we know the type of exam done, we can categorize thousands of exams of similar types, and they can be used then with some algorithm.” Increasingly, those algorithms are predicated on machine learning and neural network technologies.

Jelani Harper is an editorial consultant servicing the information technology market, specializing in data-driven applications focused on semantic technologies, data governance and analytics.