Does the Data Induce Capacity Control in Deep Learning?
Accepted statistical wisdom suggests that larger the model class, the more likely it is to overfit the training data. And yet, deep networks generalize extremely well. The larger the deep network, the better its accuracy on new data. This talk seeks to shed light upon this apparent paradox. We will argue that deep networks are successful because of a characteristic structure in the space of learning tasks. The input correlation matrix for typical tasks has a peculiar ("sloppy") eigenspectrum where, in addition to a few large eigenvalues (salient features), there are a large number of small eigenvalues that are distributed uniformly over a very large range. This structure in the input data is strongly mirrored in the representation learned by the network. A number of quantities such as the Hessian, the Fisher Information Matrix, as well as others such as correlations of activations or Jacobians, are also sloppy. Even if the model class for deep networks is very large, there is only a tiny subset of models that fit such sloppy tasks. Using these ideas, this talk will demonstrate an analytical non-vacuous generalization bound for deep networks that does not use compression. It will also discuss how these ideas can be harnessed into algorithms that learn from unlabeled data optimally.
Pratik Chaudhari is an Assistant Professor in Electrical and Systems Engineering and Computer and Information Science at the University of Pennsylvania. He is a member of the GRASP Laboratory. From 2018-19, he was a Senior Applied Scientist at Amazon Web Services and a Postdoctoral Scholar in Computing and Mathematical Sciences at Caltech. Pratik received his PhD (2018) in Computer Science from UCLA, his Master's (2012) and Engineer's (2014) degrees in Aeronautics and Astronautics from MIT. He was a part of NuTonomy Inc. (now Hyundai-Aptiv Motional) from 2014???16. He received the NSF CAREER award and the Intel Rising Star Faculty Award in 2022.