PH.D DEFENCE - PUBLIC SEMINAR

Variational Distribution Designs for Approximate Thompson Sampling in Deep Reinforcement Learning

Speaker

Mr Siddharth Aravindan

Advisor

Dr Lee Wee Sun, Professor, School of Computing

12 May 2022 Thursday, 04:00 PM to 05:30 PM

Zoom presentation

Abstract :

Exploration is a vital ingredient in reinforcement learning algorithms that has largely contributed to its success in various applications. Standard exploration strategies used in deep reinforcement learning such as $\epsilon$-greedy exploration, Boltzmann exploration or action-space noise injection, are effective in simple tasks, but do not perform well in tasks with high dimensional state-action spaces as they are undirected, i.e., do not make use of the agent's understanding of the environment. Thompson sampling is a directed, well-known and principled approach for balancing exploration and exploitation. But it requires the posterior distribution over the value-action functions or environment models to be maintained; which is generally intractable for tasks that have a high dimensional state-action space.

We interpret the successful NoisyNets method as an approximation to a variational Thompson sampling method for Deep Q-Networks to show that such approximations while being effective are also computationally feasible. NoisyNets, however, does not exploit the domain knowledge of a task in its variational design. In this thesis, we argue that incorporating domain knowledge during the formulation of variational distributions for approximating posterior distributions is useful in reinforcement learning. We explore this assertion by designing variational distributions for two different scenarios.

In the first scenario, we learn variational distributions for Deep Q-Learning agents that aim to learn good policies on tasks that have a variety of \textit{high risk} and \textit{low risk} states. For such agents, we propose State Aware Noisy Exploration (SANE), a variational distribution design, which seeks to improve on the distributions used by NoisyNets by allowing a non-uniform perturbation, where the amount of parameter perturbation is conditioned on the state of the agent. This is done with the help of an auxiliary perturbation module, whose output is state dependent and is learnt end to end with gradient descent. We hypothesize that such state-aware noisy exploration is particularly useful in problems where exploration in certain \textit{high risk} states may result in the agent failing badly.

In the second scenario, we propose Event-based Variational Distributions for Exploration (EVaDE), variational distributions that are useful for Model Based Reinforcement Learning, especially when the underlying domain is object-based. We leverage the general domain knowledge of object-based domains to design three types of event-based convolutional layers to direct exploration, namely the noisy event interaction layer, the noisy event weighting layer and the noisy event translation layer respectively. These layers rely on Gaussian dropouts and are inserted in between the layers of the deep neural network model to help facilitate variational Thompson sampling.

In both the abovementioned scenarios, we demonstrate the effectiveness of our variational designs empirically, comparing agents equipped with our designs with popular baselines on a suite of Atari games selected to suit each scenario. Finally, through an empirical study, we find that the event-based layers, when used appropriately, may also help drive meaningful exploration in model-free agents that operate in object-based domains.