Securing AI systems: Addressing data piling and data access in AI training environmentsSecuring AI systems: Addressing data piling and data access in AI training environments
This article has one ambition: elaborate on reducing this security risk without blocking the daily work of the data scientists and the AI organization
July 13, 2021
For AI and Analytics, data is what fuel is for your car: the more you have, the further and faster you can go. There is just one big difference: you burn the oil, but the data remains in your systems.
A data science team piles up massive amounts of data within just a few months – a security nightmare. An unscrupulous competitor has to turn only one data scientist against you – and they have full access to all your commercial data and IP.
Consequently, this article has one ambition: elaborate on reducing this security risk without blocking the daily work of the data scientists and the AI organization.
Three measures are the ticket to success. Two require innovative thinking for IT traditional security experts and data scientists. So, all three are:
Generic system hardening
The generic system hardening covers the traditional IT security measures such as network zones and firewalls or IAM integration. It was and is mandatory for “normal” software and, as well, for any AI system or component. Thereby, AI organizations reduce their security risks associated with AI run-time servers. These servers run all the AI models for other applications, e.g., for webshops. Securing them means applications can only invoke AI models they are supposed to use, and attackers cannot (easily) break into a server.
The two other measures are AI-specific. They require a closer look at how the AI organization works and the systems they use. The training environment is where all AI magic happens. Data scientists (with the support of data engineers or AI translators) take the data from their data collection or data lake. They create models in training areas (e.g., Jupyter notebooks on VMs) and put the AI models, scripts, and documentation in a repository (Figure 1). The challenge is to provide data scientists access to all data they need and prevent unnecessary access – without creating a useless bureaucratic monster.
Figure 1: Understanding AI Training Environments
The distinction between, on the one hand, adequate and necessary data access and, on the other hand, unnecessary and dangerous accesses requires human judgment. Governance processes structure the decision-making, make its steps transparent, and guarantee repeatable outcomes. This decision-making has two aspects. First, is it adequate to load the data into the training environment (data upload governance process)? Second, is the use in a particular use case - and the use case itself - acceptable (usage governance)? (Figure 2)
The separation into two parts is a cost-optimization. It eases reuse without compromising the approval aspects. The first project needing telemetry data from a new turbine implements the data transfer and preparation and goes through the upload approval process. Ideally, ten or twenty later projects can reuse the data. They only need the approval to use the data themselves. They can skip the upload approval and reuse the implemented and approved data copy and preparation infrastructure, saving time and money.
Figure 2: Compartemenization and Governance Processes for AI Training Environments
The data upload governance process looks, from an approval perspective, whether particular data can reside in the AI training environment. Data owners, data privacy officers, legal teams, and IT security typically drive the decision-making. Obvious concerns relate to data privacy when AI training environments are in a different jurisdiction than the humans whose data the company analyzes. A second concern relates to companies in highly competitive and sensitive industry sectors or governmental agencies. They might hesitate to put every piece of information in every public cloud, no matter how good its AI features are.
While the data upload governance process has this kind of control function, it also helps make training data more valuable by easing (re-)use. The process can act as a checkpoint for documentation in the data catalog. Here, the company’s data governance function can enforce that data is only put into the training environment when the metadata is complete. Data scientists must describe the content of their data, have assessed and documented the data quality, clarified potential usage restrictions, and analyzed the data lineage. This governance process is the shortcut to high-quality data catalogs for all data sets, at least within the AI training environment.
However, there is also the option to discuss and implement measures to make data less sensitive. For example, shopping cart information related to individual customers is much more problematic than anonymous shopping cart data. The upload to the training environment is the natural point for anonymizing data, thereby reducing the overall risk exposure of the AI training environment and its data.
Ideally, the usage-related governance process is part of the organization’s management and approval processes for new (AI) projects. Following the data upload and before an actual use case implementation and AI model training, it focuses more on approvals than on enablement. It questions whether the intended data usage or the envisioned AI model violates data privacy laws, internal ethics guidelines, or any other policy or regulation. Again, data owners, data privacy officers, or legal and compliance teams are the natural decision-makers.
Compartmentalization balances two contrary wishes. There is a wish for fine-granular data access control plus a need for manageability. Achieving both at the same time is possible for repositories and the AI model training areas. If a specialist works on a project, he gets access to the specific training data; otherwise, not. Also, figuring out which application has to get models from a particular repository is straightforward. The challenge is the access management for data lakes or the data collection of the AI organization within their AI model training environment.
Access to all data for every data scientist is not an option. Neither is managing data access separately for thousands of data sets and data classes. The latter might be a surprise. Super fine-granular access rights appeal, at least at first glance. However, data scientists and data owners cannot work with them on a daily basis. The complexity is too high for the human mind. As a result, they would stop following the need-to-know principle to prevent the AI organization from being blocked. They would switch to a “request and approve access to data we do not know for sure it is not needed” model. So, super fine granular access control looks good on paper but is likely to fail in reality.
Suppose a data owner from the business knows his data and the data model very well. He can handle ten or twenty subsets, not one hundred and not thousands. AI organizations need a compartmentalization approach to limit the number of data access control roles and adjust them to their organization’s actual size and complexity. A starting point is following a three-dimensional compartmentalization approach (Figure 3). It works for medium enterprises and scales even to the largest corporations of the world.
The first dimension is the sensitivity categorization, typically with four levels: public, internal, confidential, and secret. Public data is (or could be made) available on the web, e.g., statistics published by public health organizations or marketing brochures. Then, there is internal data. It means every employee (with a good reason) can access it. Customer contact information is one example. So is a categorization whether a customer is a top or a standard customer, or a customer frequently causing trouble. Complex offerings in a B2B context or a WHO tender are examples of confidential data. Deal or no deal can have a significant impact on the company’s bottom line. And any potential disclosure of the offer to competitors might result in losing the deal for sure. Finally, there is secret data. Often, critical passwords or (master) keys for encryption and signing files belong to this category. Information potentially impacting a corporation’s stock exchange price might – before disclosure – fall into this category. The latter examples also illustrate that data sensitivity can change over time. Once such information is out, it becomes public data.
Additional standards and regulations impact the data sensitivity classifications for some sectors. Examples are GDPR, HIPAA, or PCI-DSS. First, they can influence the categorization of data into these four levels. For example, Monaco and Channel Island bank customer data is “secret,” other customer data “confidential.” Alternatively, organizations can introduce flags for each standard and regulation in addition to the four sensitivity levels. These flags state whether data falls into an additional dedicated category.
The second dimension is the data or business domain. Is it HR, Marketing, or R&D-related data? Data owners are responsible for one or more domains. The domain helps to involve the right persons in the governance processes. It is also the dimension on which the company size has the most significant impact. It determines the approval granularity. Is HR one big domain, or does an organization work on a sub-domain level distinguishing various types of HR data, such as recruiting, payroll, insurances, performance management, and trainings? As a rule of thumb: the more sensitive the data, the more it benefits from fine-granular approvals.
The third dimension reflects the company’s organizational structure and various jurisdictions in which the company operates and to which data might belong. Different subsidiaries might have different internal rules. Different jurisdictions regulate data storage and usage and artificial intelligence differently.
To conclude, domains, organizational structures, and jurisdictions are natural ways to divide work, route requests in governance processes, and determine quickly which rules and regulations apply. The data sensitivity dimension allows for assessing quickly whether a request requires a more profound analysis or can take a shortcut. These dimensions allow for compartmentalization and right-sizing the granularity for data upload and usage approvals. The approval processes are the organizational counterpart to the actual access control measures. Together, they actively mitigate security challenges related to data lakes and extensive data collections within an AI organization’s training environments.
Figure 3: Compartemenization Dimensions for Data Lakes in AI Training Environments
Klaus Haller is a Senior IT Project Manager with in-depth business analysis, solution architecture, and consulting know-how. His experience covers Data Management, Analytics & AI, Information Security and Compliance, and Test Management. He enjoys applying his analytical skills and technical creativity to deliver solutions for complex projects with high levels of uncertainty. Typically, he manages projects consisting of 5-10 engineers.
Since 2005, Klaus works in IT consulting and for IT service providers, often (but not exclusively) in the financial industries in Switzerland.