PH.D DEFENCE - PUBLIC SEMINAR

Exploiting Gradient Information for Modern Machine Learning Problems

Speaker
Mr Chen Yizhou
Advisor
Dr Low Kian Hsiang, Associate Professor, School of Computing


28 Apr 2022 Thursday, 09:00 AM to 10:30 AM

Zoom presentation

Abstract:

Many deep learning achievements are attributed to the back-propagation (BP) algorithm, which exploits gradient information of the deep neural network (DNN) models: BP efficiently computes the gradient of the loss function with respect to the weights of a DNN for a batch of examples, and such gradient can be used by stochastic gradient descent to perform learning / optimization of the DNN model. Despite recent advances in deep learning like DNN training, there are still important scenarios where we can also use gradient to tackle optimization difficulty. In a broader aspect of deep learning rather than DNN training, a significant challenge faced by ML practitioners is thus whether we can design efficient algorithms to use the model gradient in the training / optimization in various deep learning scenarios. This thesis identifies four important scenarios and, for each of them, proposes a novel algorithm to utilize the gradient information for effective optimization that is both theoretically grounded and practically effective.

Firstly, the training process of a machine learning (ML) model may be subject to adversarial attacks from an attacker who attempts to undermine the test performance of the ML model by perturbing the training minibatches, and thus needs to be protected by a defender. Such a problem setting is referred to as training-time adversarial ML. We formulate it as a two-player game and propose a principled Recursive Reasoning-based Training-Time adversarial ML (R2T2) framework to model this game. R2T2 models the reasoning process between the attacker and the defender and captures their bounded reasoning capabilities (due to bounded computational resources) through the recursive reasoning formalism. In particular, we associate a deeper level of recursive reasoning with the use of a higher-order gradient to derive the attack (defense) strategy, which naturally improves its performance while requiring greater computational resources.

Secondly, a multi-layer deep Gaussian process (DGP) model is a hierarchical composition of GP models with a greater expressive power. The state-of-the-art DGP inference is our implicit posterior variational inference (IPVI) framework that can ideally recover an unbiased posterior belief and still preserve time efficiency. However, as a generator and a discriminator are integrated in each layer of the DGP, the training becomes unstable and is prone to optimization difficulties. To resolve such issues, we propose a novel gradient-bridging architecture of the generator and discriminator for the DGP model, which uses the inducing inputs as the context, thus leads to faster training and more accurate predictions.

Thirdly, we present a novel implicit process-based meta-learning (IPML) algorithm that explicitly represents each task as a continuous latent vector and models its probabilistic belief within the highly expressive implicit processes (IP) framework. We tackle the meta-training in IPML with a novel expectation-maximization algorithm based on the stochastic gradient Hamiltonian Monte Carlo sampling method. Our delicate design of the neural network architecture for meta-training in IPML allows competitive meta-learning performance to be achieved. IPML offers the benefits of being amenable to the characterization of a principled distance measure between tasks using the maximum mean discrepancy, active task selection, and synthetic task generation.

Last but not least, in the problem of active task selection which involves selecting the most informative tasks for meta-learning, we propose a novel active task selection criterion based on the mutual information between latent task vectors. Unfortunately, such a criterion scales poorly in the number of candidate tasks when optimized. To resolve this issue, we exploit the submodularity property of our new criterion for devising the first active task selection algorithm for meta-learning with a near-optimal performance guarantee. To further improve our efficiency, we propose an online variant of the Stein variational gradient descent to perform fast belief updates of the meta-parameters via maintaining a set of forward (and backward) particles when learning (or unlearning) from each selected task.