AI and Privacy
Artificial Intelligence(AI) is being used extensively in a variety of domains such as finance, automobile, health tech, and manufacturing industries. It is a fact that more the data/information, better the performance of AI models. But this voracious need for data must not compromise the privacy of individuals. The state of security and privacy on the adoption of AI is a question everyone has been pondering about ever since the incidents such as the Cambridge Analytica data scandal came into the limelight.
The anonymization of private data seems like a viable way of protecting privacy, but there are limitations to it. If the dataset to be studied remains within an organization or a department, anonymization may suffice to protect sensitive private information. If the dataset is to be released to the public or a different organization, anonymization has serious limitations. Even if all the private information in a database is anonymized, if a similar or related database is available to the public, it is possible to divulge the private information by analyzing the two datasets.
Data anonymization case studies
Netflix released a dataset as a part of a competition to build a recommendation system with anonymized information. A linkage attack on that dataset with the IMDb dataset revealed all the anonymized information. The medical records of a Massachusetts governor were identified from a de-identified database by matching it with the voter registration list .
These examples show the ease with which data scientists can re-identify individuals within de-identified data.
The goal of differential privacy is to ensure that different kinds of statistical analysis don’t compromise privacy. A modern definition of privacy given in the book The algorithmic foundations of differential privacy is as follows :
“Differential privacy describes a promise, made by a data holder, or curator, to a data subject: You will not be affected, adversely or otherwise, by allowing your data to be used in any study or analysis, no matter what other studies, data sets, or information sources, are available.”
Local differential privacy
In local differential privacy, noise is added to each data point in the dataset. Differential privacy always requires a form of randomness added to the query to protect from events like a differencing attack. Let us take an example and illustrate it. Suppose we want to find the percentage of the population that has a history of a certain disease in some locality. To ensure local differential privacy, let us add data to the dataset with the following condition.
For each subject, while collecting the data
- Flip a coin
- If the output is head, store the actual response in the database
- If the result is a tail, flip the coin again and if the result is head, store the actual response in the dataset as a response. Otherwise, store the opposite as the response in the dataset.
This way a certain degree of randomness is added to the dataset such that each individual is protected by what can be called as plausible deniability as a person is plausible to deny the answer by the randomness of flipping a coin.
Global differential privacy
Unlike local differential privacy, in global differential privacy mechanism, instead of altering the training database, randomness or noise is added to the query which is run on the original training set. Since the database is not altered at all, it gives more accurate results with the same level of privacy protection when compared with local differential privacy. In short, the database curator preserves the originality of the database and gives out the data to external interfaces with added noise or randomness.
Laplacian mechanism refers to the process of augmenting a function or query with laplacian noise (noise from Laplace distribution).
Let d be a positive integer and D be a collection of databases, and f:D → Rd be a function. The sensitivity of the function f is given by
∆f = max|| f(D1) – f(D2) ||1,
Laplacian noise takes as input parameter a constant beta given by beta b=sensitivity/ ε where ε is a constant characterizing the allowed threshold for data leakage.
Beta is then passed to the function computing the query result (beta denotes the laplacian noise).
Differential privacy for Deep Learning
Perfect privacy in the context of deep learning would be the case when removing a data point or row from the dataset and training the neural model results in a neural model that is comparable to the model generated with the original dataset. The fact that neural models rarely converge to the same location even when trained on the same dataset poses a problem to demonstrate the privacy of neural models.
Let us take an example and demonstrate how to develop a differentially private deep learning model. Let us develop a neural network for classifying medical images. Instead of getting annotated data from N different hospitals (as hospitals may have privacy concerns about sharing their patient data) to develop the model, we can do the following to develop a model which protects the privacy of the participating population.
- Develop a neural network and train the model on the dataset available at each hospital. So, we now have N trained models, trained at N hospitals.
- Use each of the N different models to predict on our dataset to generate N labels for each of the data point in our database
- Perform a differentially private query (max function in this case where max is the most frequent label assigned by N different model) for each data point in the dataset to generate labels for our dataset
- Retrain a new model on our dataset with generated labels in the previous step
Differential privacy is a powerful way of ensuring privacy and to easily quantify associated risks.