Is your personal data save?

Collecting confidential information about their users have become a major part in the operational business of many businesses. It is estimated that the biggest American tech companies spend around 19 billion $ in 2018 to collect, buy and analyze personal information about existing and potential customers to enhance their knowledge about them.

Why conventional Anonymization techniques fail

A common technique to preserve data privacy is to leave out personal identifiable information like the age and the private residential address. The first problem of this technique is the missing definition of a personal identifiable attribute. In some contexts, the age of a person can be personally identifiable, in others not. A similar example would be the zip code of the residential address. Both features can be strong indicators in different use cases that will improve the accuracy of the statistical model significantly, like the age in medical studies and the zip code in socio-economic scenarios. Leaving them out in the name of data privacy could hurt the performance of the statistical analysis but might also increase the privacy of each subject. No mathematical definition is given that would allow for a statistically based decision. This highlights the underlying problem of this technique, a lack of privacy quantification. Due to the missing mathematical foundation within this process, the data collector cannot identify the privacy loss that is linked to his/her decision regarding the removal of a feature.

What is Differential Privacy?

Differential privacy provides a measure of privacy loss and tools to handle it. The essential idea behind differential privacy is to learn nothing about an individual while identifying coherences in the overall population. If an individual participates in a data collection, the individual effect of the conclusions being drawn in the statistical analysis will be the same if the subject did not participate.

How does Differential Privacy work?

At a high level, a differential private analysis introduces random noise to transform the statistical analysis from a deterministic analysis to a probabilistic analysis. In contrast to the deterministic output of a statistical analysis, the output of a probabilistic analysis may vary when executed with the same input data a few times because of the random noise. Consequently, an adversary would receive less information from a query to a differential private database, because he cannot be sure what the true value is. Randomization is the key to guarantee any level of privacy regardless of the medium where the data is presented (e.g. newspapers, databases, studies, …). The randomized algorithm should show similar behavior on different input databases.

The output of a differentially private algorithm differs at most e for two queries to two datasets that differ only in a single record X. Taken from (Wood et al. 2018)
Differntial Privacy with different values of ɛ applied to a cumulative distribution function (cdf). In the top left corner, the original data is shown, top right with ɛ=0.005, bottom left ɛ=0.01, bottom right ɛ=0.1. Taken from (Muise and Nissim 2016)

Differential Privacy in the industry

Due to the previously mentioned strengths of differential privacy, its usage in industry leading corporations like Apple, Microsoft and Google is continuously growing. Each company has put effort into the development of algorithms that meet the requirements of differential privacy while achieving the desired performance in specified use cases.

Privitization Stage in Apple’s Eco System (Apple,2017)
Hadamard count mean sketch algorithm on the client side. Taken from (Apple 2017)

Take Away

Data Privacy gone a long way from simple anonymization techniques to a robust and quantifiable defintion of differential private algorithms. It is up to date the only theorem in data privacy that guarentees a pre defined privacy loss and is applicable to all kinds of statistical algorithm at the same time. It is important to remember that differntical privacy itself is a theorem and not the algorhtm itself. The current state of differential privacy still has some hurdles to pass to be applicable to all statistcial analysis, though. One of the biggest problems being composition.

Sources

Apple (Hg.) (2017): Differential Privacy Team Apple. Learning with privacy at scale. Technical Report.

Junior Data Scientist with a background in Machine Learning and Human Motion Analysis. Likes to learn new stuff about AI every day.