On the offensive and defensive elements of data strategy
by Jelani Harper 20 December 2019
As the coming decade nears, the initial excitement attendant to technologies deemed “new” in the present decade is considerably waning. Organizations are devoting much greater focus to the business value derived from various applications of cognitive computing, the Internet of Things, and even Blockchain, than to the hype surrounding them.
Realizing the promise of this assortment of technologies and applications, however, necessitates overcoming the inherent obstacles of dealing with data at scale, external to the enterprise, and with low latency. Each of these factors, each of the aforementioned technologies and their applications, only increase the importance of data strategy to effectively dealing with what simply equates to more data, faster, and more distributed than ever before.
The basic paradigm of utilizing both offensive and defensive dimensions of data strategy remains as relevant as it ever was. “To give this analogy, if you are only doing offense when it comes to data strategy, or only doing defense, you are losing the game,” cautioned EWSolutions president and CEO David Marco. “Your data strategy must include a strong offense coupled with a strong defense.”
The defensive aspect of data strategy includes reducing enterprise risk by dealing with regulatory compliance, legal issues, discovery, and cost. The offensive side is predicated on the reusability of data assets to increase monetization, personalization, and optimization so that, according to David Schweer, product marketing director at Aprimo, organizations can ideally “reuse this [data] multiple times” for competitive advantage—even across use cases.
Shades of metadata
Metadata management will likely always remain the core of data strategy, data management, and data governance. Metadata management is foundational to accounting for the massive quantities of training data for machine learning models. When pairing facets of the Internet of Things with applications of Artificial Intelligence (the AIoT), organizations must standardize their metadata management. Whether utilized for offensive or defensive purposes, astute metadata management involves:
- Classifications: According to Marco, the most strategic means of classifying metadata is binary and based on business meaning: “When you look at it, you must first have your classification built from the business concept, which is a thing, state, city: something granular that I cannot break down further. And then also from a business group level, because then each one of those business concepts are tied to a technical instantiation.” Classifications are helpful for identifying PII.
- Tagging: Tags are used to describe various dimensions of data so, for example, users can “describe an image in as many unique ways as possible so I can find what I’m searching for,” Schweer explained. “Metadata is broader, but tagging is a form of metadata.” Tagging is particularly advantageous for reusing data or content. Numerous options exist for automating tagging via natural language technologies and machine learning.
- Taxonomies: The hierarchies of the terms used to describe data are especially important for leveraging a centralized approach to accommodate the burgeoning decentralized data landscape. Schweer said the relationship between tagging and taxonomies is “not important to the user experience; it is the user experience.” When several users are relying on the same repository for multiple purposes or use cases, “how that’s organized and what you have access to is courtesy of your taxonomy,” Schweer added. “My user experience and ability to get my personal job done is 100 percent driven by taxonomy, metadata, and tagging capabilities.”
The governance framework
Marco explained that while metadata management is the part of data management pertaining to the technical applications of data, data governance is “your people processes. That’s how we’re going to build a structure and an organizational framework that allows us to make enterprise decisions about our data.” Organizations must have a formal data governance construct in place as part of their overall data strategy, but especially for supervising the massive data amounts associated with contemporary cognitive computing applications. Both the defensive and offensive sides of data strategy involve data governance hallmarks of lifecycle management, data quality, and data provenance.
Lifecycle management is closely associated with the defensive concerns of data management. Organizations must become increasingly aware of how long they retain data in relation to stringent regulatory compliance for PII and other data types, as well as legal concerns. Marco predicted that a federal version of the General Data Protection Regulation—which focuses on data privacy and has counterparts in several states, most notably California and New York—will eventually be adopted within the U.S. Organizations can substantially decrease the risk of data assets by focusing on retention policies fundamental to their defensive data strategy which not only apply to regulations, but also to what Marco termed, “general legal defense. A lot of companies do a very poor job of purging emails and getting rid of old data sources.”
Lifecycle management is a critical aspect of Digital Asset Management and managing other types of data in which there are clear expiration dates. The complexity of the process is significantly ameliorated by modern centralized repositories. According to Schweer, centralized tools “help inform, hey, should I renew something? So, we can kind of turn risk around and say okay, maybe I should renew this, because this is used 500 different places.” The data provenance capabilities of these options are instrumental for mitigating risk, so when data or content have expired, you know every instance that they have been used, Schweer noted. “And if I need to pull it out immediately, I can pull everything down in an instant.”
The data lineage tale
The traceability of data provenance is utilitarian for both the offensive and defensive aspects of data strategy. In terms of defense, organizations must prioritize data lineage to understand where data was used, by whom, and how, to ensure—and well as demonstrate to regulators—regulatory compliance is being met. The same sentiment applies to any litigation issues involving risk. Provenance is also one of the best tools for reusing data across multiple use cases to enhance the value. In this respect, a data strategy best practice is to equip data assets with what Schweer called a “unique identifier”, which is immensely helpful in facilitating provenance and some of its advantages for offense.
Understanding the journey of data throughout the enterprise, and as deployed outside the enterprise, is not only useful for reusing it across use cases, but also enables organizations to “track how many times I’m using it so I can really get a good sense of, hey, that’s a really good image there,” Schweer observed. In terms of content, metrics based on the multiple elements of provenance are critical for reusing assets for targeted campaigns, specific audiences, and increased personalization so organizations can “start to really optimize my spend and the pieces of content that I have,” Schweer added. Provenance for data assets delivers the same advantages for reusing data across use cases. It’s also essential for ensuring machine learning models function in production as they did during training. In almost all instances, provenance is revealed by metadata.
Strategic outcome: data quality
Data quality has a reciprocal relationship with data strategy. The boons of trustworthy, recent, de-duplicated data are a verifiable output of successful data strategy. The reliability of any AI or IoT deployment hinges on whether or not data quality levels are met. From a defensive perspective, data quality typifies cost concerns, since redundancies and inaccuracies escalate operational expenses of maintaining IT systems. “People are using that data today to make decisions on their companies, and it’s leading them to the wrong decisions,” Marco suggested.
However, when organizations are able to act on the basic precepts of data strategy, data quality becomes immeasurably useful for offense, reusing data assets, and generating value from data. Simply standardizing all the different fields and terminology of data elements (such as dates, names, etc.), and employing this standardization for classifications and taxonomies, substantially diminishes duplications and inaccuracies. By implementing the proper data governance processes for delineating how people use data, monitoring lifecycle concerns and tracking data provenance, organizations can leverage quality data for the determinations necessary to exploit machine learning and the IoT.
Ultimately, the test for success in the data strategy domain is how well organizations deploy data assets and reuse them. As Schweer observed, “You want to make the minimum amount of content needed to deliver the most valuable experience.”