On privacy risk of publishing data and models
COM2 Level 4
Executive Classroom, COM2-04-02
closeAbstract:
In this talk, we address the risk of breach of privacy during the publication of datasets as well as machine learning models.
Synthetically generated datasets, which preserve the relationships among attributes of the original datasets, provide a convenient way to publish datasets with a lowered risk of breach of privacy. We conduct experiments on traditional techniques of partially and fully synthetic dataset generation using various discriminative models. We complement our experiments by adapting and extending a generative model, namely Latent Dirichlet Allocation, to handle spatiotemporal data. We use the trained model for generating travelling records of commuters.
Recent attacks on machine learning models such as membership inference attack and model inversion attack instantiate the leakage information through the trained machine learning models. We use differential privacy to provide quantifiable privacy guarantees for the publication of parametric as well as non-parametric machine learning models.
We observe two difficulties while using differential privacy in a real-world setting. Firstly, the privacy level of differential privacy is an upper bound on the worst-case privacy loss. A loose upper bound leads to a higher loss in the utility. We propose privacy at risk that provides probabilistic bounds on the privacy level by accounting for various sources of randomness. Thus, privacy at risk quantifies the confidence for a privacy level at a specified value of utility. Secondly, the privacy level of differential privacy is too abstract to be actionable in a business setting. We propose a cost model that bridges the gap between privacy level and compensation budget estimated by a GDPR compliant business entity. The proposed cost model also helps in balancing the privacy-utility tradeoff.