Collecting confidential information about their users have become a major part in the operational business of many businesses. It is estimated that the biggest American tech companies spend around 19 billion $ in 2018 to collect, buy and analyze personal information about existing and potential customers to enhance their knowledge about them.
The more an internet business knows about the preferences of an individual, the better it can recommend products, content, or third-party advertisement. Third party Advertisement revenue increased by 16% to 124.1 billion $ in 2019, making it an ever growing business part for all existing and upcoming internet platforms such as social media companies like Facebook and Tik-Tok. Thus, it is in their interest to personalize the experience for each user to gain the maximum profit. To achieve this, they must model the relationships between an individual user and their interest in certain product categories. Therefore, they have collected more and more detailed information about their users over the last decade. This made me wonder, what kind of information they collect, how is it stored and most importantly, is it save? While the latter concern is part of data security, the former questions are defined by data privacy. But what is data privacy and how is it preserved?
In general, data privacy should protect any individual to be identified based on the personal data that is stored in a database or being harmed by its statistical outcome. This goal is met, if the curator of the database knows no more about an individual in the dataset than before he had access to it. The first cornerstone of data privacy, prevent reidentification, requires some sort of anonymization. The anonymization of the confidential data must be done before any person has access to the data, thus, the anonymization is part of the data collection itself. Moreover, the anonymization must guarantee that its output will prevent reidentification based on any combination of attributes or by linking it to another database at any point in time. But how is this done and does it really work?
Unfortunately, not always … Recent adversary attacks from researchers in the field of data privacy protection showed that leaving out the most obvious personal traits such as names and private addresses are not sufficient to avoid reidentification of individuals based on publicly available datasets. With the use of linkage techniques that combine queries to multiple datasets that potentially hold that same individual, researchers were able to extract single individuals from what first seemed anonymized data.
Altogether, anonymization and therefore data privacy should not be treated as a property of the data itself, but as a property of the analysis. This concept defines that the output of a statistical analysis is privacy-preserving if the computation itself is privacy-preserving.
The recurrent demonstration of insufficient data privacy techniques being used have yielded in the importance to develop a technique that can mathematically guarantee the amount of data privacy its statistical outcome can offer. A theorem called differential privacy was defined. Differential privacy has shown to be a robust definition that is applicable to any kind of data analysis. Hence, leading tech companies have put effort into the development of new algorithms that allow a differential private analysis of their customers behavior.
Differential Privacy is not an algorithm, but a theorem that is applicable to any statistical algorithm
Before we will have a look at differential privacy, we will have a look at anonymization techqniues that were commonly used and are still used in the majority of statistical analysis. More importantly I will outline why they fail to meet the high standards of data privacy and thus, why we need differential private algorithms instead.
Why conventional Anonymization techniques fail
A common technique to preserve data privacy is to leave out personal identifiable information like the age and the private residential address. The first problem of this technique is the missing definition of a personal identifiable attribute. In some contexts, the age of a person can be personally identifiable, in others not. A similar example would be the zip code of the residential address. Both features can be strong indicators in different use cases that will improve the accuracy of the statistical model significantly, like the age in medical studies and the zip code in socio-economic scenarios. Leaving them out in the name of data privacy could hurt the performance of the statistical analysis but might also increase the privacy of each subject. No mathematical definition is given that would allow for a statistically based decision. This highlights the underlying problem of this technique, a lack of privacy quantification. Due to the missing mathematical foundation within this process, the data collector cannot identify the privacy loss that is linked to his/her decision regarding the removal of a feature.
Another problem that cannot be solved by simple anonymization is caused by the detail and sum of information that is collected for each subject. Naming describes the process where a combination of attributes allows the identification of a subject. The feature richness of a single dataset is decreasing the data privacy of everyone by allowing the creation of a personal profile that is unique to a single subject. The chance of creating a unique profile from a single dataset increases with the number of features that hold personal identifiable information. Furthermore, the chance of sharing personal identifiable features with any other dataset increases with each added feature as well. This makes any dataset vulnerable against linkage attacks, which use the combination of extracted information from different datasets that share features to identify a single subject. As an example, the medical records of the governor of Massachusetts were identified by linking the anonymized medical encounter data with the voter registrations records. To demonstrate the insufficiency of raw anonymization of a single dataset, two journalists identified a single person from a publicly available dataset provided by AOL, that stored around 20 million search requests. They identified the marital status, number of pets, region of living, age and career status from the search log of a specific user. They linked this information with a regional register to identify the person
To enhance the privacy, anonymized datasets that are accessible via queries can be further protected by auditing. Auditing tries to identify pairwise queries that aim to extract unique combinations of features to identify a subject. However, this technique has the following weaknesses. First, it can be computationally impossible or way too expensive to define if a pair of queries aims to break the privacy promise or not. Secondly, refusing a pair of queries can give as much information about a subject as the output of the queries itself.
In addition to the missing of a mathematical, robust, and thus quantifiable data privacy definition, the data stockholder itself is a privacy lack itsef. A common misunderstanding of data privacy in statistical analysis of confidential datasets is assuming that if the so-called typical subject is protected against privacy loss, it is fair to risk the privacy of a few special subjects that do have significantly different attributes compared to the rest. However, those subjects would most likely suffer the most from data leakage because of their special attributes, thus should be protected the most.
Okay now that we now, how data privacy should not look like, let’s have a look at Differential Privacy, the ultimate guarentee for data privacy?
What is Differential Privacy?
Differential privacy provides a measure of privacy loss and tools to handle it. The essential idea behind differential privacy is to learn nothing about an individual while identifying coherences in the overall population. If an individual participates in a data collection, the individual effect of the conclusions being drawn in the statistical analysis will be the same if the subject did not participate.
Differential privacy thus states that if the individual consequences of the study’s outcome are sorely dependent on the attribute of an individual and not whether they participated in this study or not, the privacy of each subject in the study was not weakened.
Therefore, differential privacy does not claim, that subjects that are part of a statistical analysis cannot be negatively affected by the outcome of the analysis. Moreover, differential privacy does not promise privacy where it was not existent before. The following example will illustrate those key ideas.
In a differential privacy medical study about lung cancer, each participant specifies if they are a smoker or not. The study’s outcome hints that smokers are at higher risk to get lung cancer. This result leads the health insurance companies to increase their premium for customers that smoke. As a result, each participant in the study that claimed they smoke are financially negatively affected by the study. However, since the premium will be raised for all smokers, not just the ones that participated in the study, the data analysis assured the privacy of everyone, because its outcome is only dependent of the individual attributes, not its participation.
How does Differential Privacy work?
At a high level, a differential private analysis introduces random noise to transform the statistical analysis from a deterministic analysis to a probabilistic analysis. In contrast to the deterministic output of a statistical analysis, the output of a probabilistic analysis may vary when executed with the same input data a few times because of the random noise. Consequently, an adversary would receive less information from a query to a differential private database, because he cannot be sure what the true value is. Randomization is the key to guarantee any level of privacy regardless of the medium where the data is presented (e.g. newspapers, databases, studies, …). The randomized algorithm should show similar behavior on different input databases.
In more detail, the output of a query to a differentially private algorithm from two datasets that only differ in a single record (called opt-out in the following) will at maximum differ ɛ. This specific property gave differential privacy its name.
The hyperparameter ɛ defines the strength of privacy being assured by an algorithm. Setting it to 0, would result in no meaningful information being extracted from the differential private analysis. In that specific case, the differential private algorithm would need to ignore all information from any individual in the dataset to meet the requirement to output the same value for any opt-out scenario. Hence, fine tuning of the hyperparameter ɛ starts with small values (e.g., 0.1). This allows a statistical analysis to draw meaningful conclusions but also preserve a strong privacy guarantee. The hyperparameter ɛ not only defines the strength of the provided privacy but also has a huge impact on the accuracy of the algorithm regarding its outputs. Decreasing its value will increase the privacy guarantee but also decrease the accuracy.
A more mathematically explanation of how differential privacy meets its expectations requires the following definition. As commonly used in probabilistic theory, an event E is defined as the subset of potential outcomes to an analysis. If we assume that E has the probability p to occur, the probability for E to happen in any opt-out scenario is p’. Differential privacy guarantees that the difference between p and p’ is defined as follows:
Moreover, differential privacy guarantees the ratio of p/p’ for any event E that can be defined for a given statistical analysis. The effect of the addition of random noise to the statistical analysis will be explained with the following example. Given a simple example where in a population n, k individuals correspond to a certain output of the statistical outcome. The result of a count query would not be k, but k + Y, with Y being the random noise. The privacy loss parameter ɛ defines the magnitude of Y. This can be achieved if Y is sampled from any distribution — most likely a gaussian normal distribution — with a mean µ = 0 and σ = 1/ɛ. This mathematical definition assures that a low value of ɛ corresponds to high values of Y and vice versa.
Using this concept, a more precise definition of the estimated difference for |p-p’| can be constructed. Therefore, two sources of error will be analyzed and added up. First, the sampling error of k can be estimated as:
Second, the added random noise leads to
Combining both errors, the expected difference between p’ and p is of magnitude
Hence, a large population n results in a much higher difference than what was originally added to preserve privacy. Consequentially, recent research has focussed on the development of more sophisticated techniques to add random noise to a differential private analysis than shown above to increase the expected accuracy of the analysis.
Differential Privacy in the industry
Due to the previously mentioned strengths of differential privacy, its usage in industry leading corporations like Apple, Microsoft and Google is continuously growing. Each company has put effort into the development of algorithms that meet the requirements of differential privacy while achieving the desired performance in specified use cases.
Google has introduced the Randomized Aggregable Privacy Preserving Ordinal Response, or RAPPOR. This technique allows researchers to study crowdsourced datasets without giving them the possibility to gain information about a single individual within this set. RAPPOR is based on the randomized response technique that is popular in social society analysis. The authors claim that RAPPOR allows the collection of population based statistics that do not harm the individuals’ privacy by strongly preventing linkage attacks.
Apple published their use of differential privacy in a technical report outlining their interpretation and implementation of the principles. They state the use of local differential privacy instead of central differential privacy which has the simple advantage that the central statistical machine, the server, never receives the unfiltered confidential data. This is because the data is already randomized on the individual local device before being sent to the server. The data is privatized by event-level differential privacy. This term defines the goal to make it undistinguishable whether the data sample relates to the presence or absence of a specific event. Apple claims that for each data sample of interest (e.g. emoji being used) the parameter ɛ that controls the privacy loss of a differential privacy algorithm is tuned specifically based on empirical experiments.
Private count mean sketch (CMS) is a subversion of the count-min-sketch algorithm. The CMS algorithm outputs a histogram of counts over a list of domain relevant elements, e.g. how often was a specific website visited within a predefined period of time. Apple has split this process into a client-side processing and server-side aggregation. On the device, the differential private algorithm randomly chooses a hash function which maps the element to a value, that is different with each hash function. A binary vector that has a predefined size is getting a 1 at the hash function value. All other elements have the probability of
to be flipped in their value. This vector together with the information which hash function was used to create the vector are send to the server. The server stores a sketch matrix M of size k * m, where k is the number of different hash functions and m the length of the hash vector. The server will add the newly received hash vector to the row of the corresponding hash function. In the end, M is scaled so that an unbiased estimation of the frequency of each element is given.
Private Hadamard count mean sketch (HCMS) is the successor of the CMS, which aimed to minimize processing overheat caused by bandwidth limitations. Instead of transmitting the complete hash vector, HCMS only transmits a single bit of the hash vector. A single random coordinate is drawn from the hash vector that was transformed with the Hadamard basis transformation, its values are flipped with the same probabilty p as with CMS and send together with the coordinate and the index of the hash function to the server.
On the server side, the sketch matrix M is transformed back to k* m with the inverse Hadamard matrix, scaled and aggregated (Apple 2017).
Both algorithms require the knowledge about the number of domain elements, e.g. the number of emojis. However, this space is too large or unknown for many applications. Thus, Apple developed a pre-processing step, called private sequence fragment puzzle, which down samples the number of domain elements to a finite, computationally acceptable size (Apple 2017).
Microsoft has focused their research on the problem of Composition related to differential privacy. They claim that differential private algorithms struggle to provide their guaranteed privacy when being used in repeated manner, which is a common practice for collecting data from customers. The LDP models, also known as randomized response models have the property that a given input varies little with any specific algorithm’s output. The main change they did to the existing randomized response models was adding a rounding technique that includes a hyperparameter α, that allows memoization in the context of private data collection. Memoization is a popular technique that caches results of recurring function calls to speed up the computation. Under the assumption that the user will not change its habits, memoization guarantees privacy. Their algorithm consists of three main steps. First, a LDP algorithm (something like HCMS) is used to extract confidential data. Afterwards, the α rounding is applied to the extracted information in a randomized manner prior to the last step, memoization.
Data Privacy gone a long way from simple anonymization techniques to a robust and quantifiable defintion of differential private algorithms. It is up to date the only theorem in data privacy that guarentees a pre defined privacy loss and is applicable to all kinds of statistical algorithm at the same time. It is important to remember that differntical privacy itself is a theorem and not the algorhtm itself. The current state of differential privacy still has some hurdles to pass to be applicable to all statistcial analysis, though. One of the biggest problems being composition.
In general, differential privacy as a concept is lacking the implementation into common statistical packages for Python, R, Java or C (to my knowdlege), which makes it hard for everyono to make full use of it. Once, we have those libraries ready to use, differential privacy will be the backbone of any statistical analysis in the future. This way, data privacy is not only preserved in data pipelines of big cooperatinos but also in smaller institues and companies.
I hope you got an idea of the concept of differential privacy. If you are intersted in the whole topic feel free to have a look at one of the sources I used for this blog post.
Apple (Hg.) (2017): Differential Privacy Team Apple. Learning with privacy at scale. Technical Report.
Duchi, John C.; Jordan, Michael I.; Wainwright, Martin J. (2013): Local Privacy, Data Processing Inequalities, and Statistical Minimax Rates. Online verfügbar unter http://arxiv.org/pdf/1302.3203v4.
Dwork, Cynthia (2006): Differential Privacy. In: David Hutchison, Takeo Kanade, Josef Kittler, Jon M. Kleinberg, Friedemann Mattern, John C. Mitchell et al. (Hg.): Automata, languages and programming. ICALP 2006, Venice, Italy, July 10–14, 2006, Bd. 4052. Berlin, Heidelberg: Springer Berlin Heidelberg (Lecture Notes in Computer Science), S. 1–12.
Dwork, Cynthia; Kenthapadi, Krishnaram; McSherry, Frank; Mironov, Ilya; Naor, Moni (2006a): Our Data, Ourselves: Privacy Via Distributed Noise Generation. In: Serge Vaudenay (Hg.): Advances in cryptology. Proceedings, Bd. 4004. Berlin: Springer (Lecture Notes in Computer Science, 4004), S. 486–503.
Dwork, Cynthia; McSherry, Frank; Nissim, Kobbi; Smith, Adam (2006b): Calibrating Noise to Sensitivity in Private Data Analysis. In: David Hutchison, Takeo Kanade, Josef Kittler, Jon M. Kleinberg, Friedemann Mattern, John C. Mitchell et al. (Hg.): Theory of Cryptography, Bd. 3876. Berlin, Heidelberg: Springer Nature (Lecture Notes in Computer Science), S. 265–284.
Dwork, Cynthia; Roth, Aaron (2013): The Algorithmic Foundations of Differential Privacy. In: FNT in Theoretical Computer Science 9 (3–4), S. 211–407. DOI: 10.1561/0400000042 .
Erlingsson, Úlfar; Pihur, Vasyl; Korolova, Aleksandra (11032014): RAPPOR. In: Gail-Joon Ahn, Moti Yung und Ninghui Li (Hg.): Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security. CCS’14: 2014 ACM SIGSAC Conference on Computer and Communications Security. Scottsdale Arizona USA, 03 11 2014 07 11 2014. New York, NY, USA: ACM, S. 1054–1067.
Muise, Daniel; Nissim, Kobbi (2016): Differential Privacy in CDFs. Privacy Tools for Sharing Research Data. Center for Research on Computation and Society. Harvard University, 2016. Online verfügbar unter https://privacytools.seas.harvard.edu/files/privacytools/files/dpcdf_usermanual_2016.pdf.
Narayanan, Arvind; Shmatikov, Vitaly (2008): Robust De-anonymization of Large Sparse Datasets. In: Proceedings of the 2008 IEEE Symposium on Security and Privacy. May 18–21, 2008, Oakland, California, USA. 2008 IEEE Symposium on Security and Privacy (sp 2008). Oakland, CA, USA, 5/18/2008–5/22/2008. Los Alamitos, Calif.: IEEE Computer Society, S. 111–125.
Warner, S. L. (1965): Randomized response: a survey technique for eliminating evasive answer bias. In: Journal of the American Statistical Association 60 (309), S. 63–66.
Wood, Alexandra; Altman, Micah; Bembenek, Aaron; Bun, Mark; Gaboardi, Marco; Honaker, James et al. (2018): Differential Privacy: A Primer for a Non-Technical Audience. In: SSRN Journal. DOI: 10.2139/ssrn.3338027 .