December 14, 2022
From driver assistance systems to early health care diagnostics, humans are benefiting from insights generated by Artificial Intelligence (AI) and Machine Learning (ML) models.To generate and iterate these ML models, algorithms often comb through copious amounts of raw data, which may include personally identifiable information (PII) or intellectual property (IP). Taking steps to preserve the privacy and confidentiality of that data within AI/ML is crucial to maintaining customer trust and preserve reputation.
There are several approaches for doing this, among them are well established encryption algorithms for data at rest and in transit, as well as data abstraction or anonymization. There are also relatively new methods in AI that learn from data without ever moving it.
More recently, methods have emerged to protect data while it is in use, including fast-evolving homomorphic encryption and the more mature confidential computing with trusted execution engines. These can operate at the virtual machine layer, or further down on the silicon itself.
These provide enhanced protection to data and workloads, irrespective of how they are packaged – virtual machines, containers or even bare-metal native applications. Each of these methods operates differently, and their benefits and implementations need to be taken into consideration along with data sensitivity and workload optimization.
To determine the level and location of risk exposure to your AI and ML, you can walk through the entire chain of data, from point of origin through inference. When doing that, consider five key questions: What? Where? Who? Why? How? From there, you can begin to determine how best to protect your models and data.
What collects the data?
Data is often generated at ‘the edge,’ that is, far away from the central learning model that will process the data, learn, and make inferences from it. The edge could, for example, be an imaging machine in a hospital generating an MRI of a patient, a satellite taking pictures of Earth, or a human being talking on her mobile phone as she walks around the city. In each case, the device at the edge is collecting data and doing some processing of the data on site. To trust the data, you must first trust the device that is collecting it.
To assess device security, begin at the hardware layer, and work your way up. Is the device what you think it is? Do you have a mechanism to authenticate the device and the components within it that run active firmware? Do you have digital assurance that the device has not been tampered with between manufacture and provisioning? Do you have a way to verify that the device has been provisioned correctly and is running the version of the firmware and operating system that you expect? Is the device up to date with hardware and software patches, and is it running security and manageability software?
Where will data travel?
In a Federated Learning model, raw data will remain where it originated, for example at the hospital where the patient was imaged, or locally, on the phone. Model aggregators ferry insights from the data to the central model. In this case, at minimum you should consider how to protect the aggregators to preserve the integrity of the model itself.
However, it is far more common in AI and ML for data to move to a central learning model for further processing and analysis. Data encryption while in transit is a well-established protocol. Future proofers will want to take into consideration how long their products will operate in the world, and whether that timeframe could intersect with quantum computing.
There are steps you can take now that offer some protection, including longer encryption keys or sending the encryption key in and out of band channel from the encrypted data itself.
Who has access to the data?
Once you have taken the steps above, you will need to verify the identity of any person who has access to the device. This includes end users and administrators. Multi-factor authentication has become commonplace. And biometrics are becoming increasingly user friendly. Technology is expanding from widely used fingerprint and facial recognition to typing patterns and even heart rhythms. Even without advanced verification tools, you can easily apply these best practices: Set strong passwords. Allow access only to those people who require it.
Even then, apply what is known as the principle of least privilege: Allow access only to that portion of data required by those who need to complete the task at hand. There are new advances in Confidential Computing that are designed to enable even verified administrators to run public clouds without the ability to access data ever − even when it is being processed.
Why are you collecting the data?
This is both an old school and a bleeding edge question. Data can be a gift or a liability. The least risky approach is never to collect or store any PII or proxies for PII that could be used to infer PII. If you can abstract or differentiate or separate PII from other parts of the dataset, then do so.
Other simple questions to ask are, “Do I require this information or is it a nice to have? What would I do differently if I did not have this information?” Ultimately, collect as little data as possible to accomplish the task, and store even less.
The latest thinking on responsible AI encourages data scientists to consider the reasons for data collection, who will have access to the data, and whether the questions asked are pertinent and fair to the people who are responding.
Some very basic questions to start this process include, “Do I have permission to collect this data? Have I explained what the intended uses are and any risks to the people implicated? Am I able to delete the data upon request? Am I able to share my inferences from the data with those who provided the data? Am I structuring the data and asking questions of the data in a manner that is consistent with desires of the people who are providing the data?”
Data privacy and risk assessment are key to ensuring a healthy and productive AI. If you understand the risks, you can better protect against them, and use the AI and ML to improve overall cybersecurity and meet Zero Trust requirements. If you want to learn more about data privacy and risk or AI, check out https://partnershiponai.org/ .
How can data be protected when running in the cloud?
As noted already in this article, most machine learning models run in a cloud environment. Clouds can be on premise, but for many reasons including flexibility, scalability, economics, and security, organizations are increasingly moving workloads — including machine learning — to public clouds with shared infrastructure. In some cases ML models are expressly run in public clouds to facilitate multi-party collaboration.
With the advent of Confidential Computing, data that was once processed in the clear in such environments can now be processed from within a Trusted Execution Environment. Before moving sensitive data or workloads to a public cloud environment, be sure to inquire whether the cloud provider can provide a Confidential Computing environment. For an industry-wide definition, check out the Confidential Computing Consortium at https://confidentialcomputing.io/.
About the Author(s)
You May Also Like