Interpreting and Manipulating Models via their Training Data
PhD student in Computer Science, Stanford
17 Dec 2018 Monday, 02:00 PM to 03:00 PM
COM2 Level 4
Executive Classroom, COM2-04-02
Machine learning models base their predictions on training data. In this work, we use influence functions -- a classic technique from robust statistics -- to quantify the extent to which a model's predictions are based on each individual training point. We show that this technique allows us to better interpret model predictions, e.g., by showing the subset of training points that are most responsible for a given prediction.
Moreover, better understanding the link between a model's predictions and its training data gives us insight into data poisoning attacks, whereby an adversary injects malicious points into the training data in order to corrupt the learned model. In particular, we show that influence-function-based data poisoning attacks can increase the test error on the Enron spam detection dataset from 3% to 21% by adding just 3% poisoned data, even in the presence of a broad range of data sanitization defenses. These results underscore the urgent need to develop more sophisticated and robust defenses against data poisoning attacks.
This is joint work with Percy Liang and Jacob Steinhardt.
Koh Pang Wei is a PhD student in Computer Science at Stanford, advised by Percy Liang. He works on machine learning and its applications to biology and medicine. Pang Wei's research has been recognized by awards at ICML 2017 and ISMB 2017, and by Stanford's David M. Kennedy Honors Thesis Prize. He is supported by the Facebook PhD Fellowship. Before starting his PhD, Pang Wei was the third employee and Director of Partnerships at Coursera, an online education company that has served more than 30 million learners around the world.