DOCTORAL SEMINAR

On privacy risk of releasing data and models

Speaker
Mr Ashish Deepak Dandekar
Advisor
Dr Stephane Bressan, Associate Professor, School of Computing


19 Dec 2018 Wednesday, 02:00 PM to 03:30 PM

Executive Classroom, COM2-04-02

Abstract:

Organisations are amassing data on an unprecedented scale. They can release either the raw data or the models trained on the collected data. In this talk, we address the risk of breach of privacy during the publication of datasets as well as machine learning models.

Synthetically generated datasets, which preserve the relationships among attributes from the original dataset, do not contain any data-point that relates to a data-point in the real world. Hence, they provide a convenient way to publish datasets without the risk of breach of privacy. We develop two generative models for two use cases. Firstly, we adapt and extend the Latent Dirichlet Allocation to handle spatiotemporal data. We use the trained model to synthetically generate travel records of commuters. Secondly, we propose a Recurrent Neural Network based model that learns patterns in an annotated sequence data. We use the trained model to synthetically generate news headlines for a specified topic.

Use of machine learning models for generating synthetic datasets does not completely nullify the risk of breach of privacy. We use differential privacy, which is a widely accepted privacy definition, to provide quantifiable privacy guarantees for the publication of parametric as well as non-parametric machine learning models. We observe two difficulties while using differential privacy in a real-world setting. Firstly, the privacy level $\epsilon$ in differential privacy is too abstract to be actionable in a business setting. We propose a cost model that bridges the gap between a privacy level and compensation budget estimated by a GDPR compliant business entity. Secondly, the privacy level is a bound on the worst-case privacy loss and it leads to a higher loss in the utility. We propose privacy at risk that is an extension of differential privacy that provides probabilistic bounds on the privacy level by accounting for various sources of randomness. Privacy at risk provides a way to compute the probability of satisfying a specified privacy level for a fixed loss in the utility. The proposed cost model also helps in balancing the privacy-utility tradeoff.

In conclusion, we study the problem of data privacy and propose solutions in two different aspects. Synthetic dataset generation lies at the heart of the publication of data whereas differential privacy lies at the heart of the publication of models.