PH.D DEFENCE - PUBLIC SEMINAR

New Advances on Bayesian and Decision-Theoretic Approaches for Interactive Machine Learning

Speaker
Mr Hoang Trong Nghia
Advisor
Dr Low Kian Hsiang, Assistant Professor, School of Computing


21 Jan 2015 Wednesday, 10:30 AM to 12:00 PM

MR3, COM2-02-26

Abstract :

The exploration-exploitation trade-off is a fundamental dilemma in many interactive learning scenarios which include both aspects of reinforcement learning (RL) and active learning (AL): An autonomous agent, situated in an unknown environment, has to actively extract knowledge from the environment by taking actions (or conducting experiments) based on its previously collected information to make accurate predictions or to optimize some utility functions. Thus, to make the most effective use of their resource-constrained budget (e.g., processing time, experimentation cost), the agent must choose carefully between (a) exploiting options (e.g., actions, experiments) which are recommended by its current, possibly incomplete model of the environment, and (b) exploring the other ostensibly sub-optimal choices to gather more information.

For example, an RL agent has to face a dilemma between (a) exploiting the most-rewarding action according to the current statistical model of the environment at the risk of running into catastrophic situations if the model is not accurate, and (b) exploring a sub-optimal action to gather more information so as to improve the model's accuracy at the potential price of losing the short-term reward. Similarly, an AL algorithm/agent has to consider between (a) conducting the most informative experiments according to its current estimation of the environment model's parameters (i.e., exploitation), and (b) running experiments that help improving the estimation accuracy of these parameters (i.e., exploration).

More often, learning strategies that ignore exploration will likely exhibit sub-optimal performance due to their imperfect knowledge while, conversely, those that entirely focus on exploration might suffer the cost of learning without benefitting from it. Therefore, a good exploration-exploitation trade-off is critical to the success of those interactive learning agents: In order to perform well, they must strike the right balance between these two conflicting objectives. Unfortunately, while this trade-off has been well-recognized since the early days of RL, the studies of exploration-exploitation have mostly been developed for theoretical settings in the respective field of RL and, perhaps surprisingly, glossed over in the existing AL literature. From a practical point of view, we see three limiting factors:

1. Previous works addressing the exploration-exploitation trade-off in RL have largely focused on simple choices of the environment model and consequently, are not practical enough to accommodate real-world applications that have far more complicated environment structures. In fact, we find that most recent advances in Bayesian reinforcement learning (BRL) have only been able to analytically trade off between exploration and exploitation under a simple choice of models such as Flat-Dirichlet-Multinomial (FDM) whose independence and modeling assumptions do not hold for many real-world applications.

2. Nearly all of the notable works in the AL literature primarily advocate the use of greedy/myopic algorithms whose rates of convergence (i.e., the number of experiments required by the learning algorithm to achieve a desired performance in the worst case) are provably \emph{minimax optimal} for simple classes of learning tasks (e.g., threshold learning). While these results have greatly advanced our understanding about the limit of myopic AL in worst-case scenarios, significantly less is presently known about whether it is possible to devise non-myopic AL strategies which optimize the exploration-exploitation trade-off to achieve the best expected performance in budgeted learning scenarios.

3. The issue of scalability of the existing predictive models (e.g., Gaussian processes) used in AL has generally been underrated since the majority of literature considers small-scale environments which only consist of a few thousand candidate experiments to be selected by single-mode AL algorithms one at a time prior to retraining the model. In contrast, large-scale environments usually have a massive set of million candidate experiments among which tens or hundreds of thousands should be actively selected for learning. For such data-intensive problems, it is often more cost-effective to consider batch-mode AL algorithms which select and conduct multiple experiments in parallel at each stage to collect observations in batch. Retraining the predictive model after incorporating each batch of observations then becomes a computational bottleneck as the collected dataset at each stage quickly grows up to tens or even hundreds of thousand data points.

This thesis outlines some recent progresses that we have been able to make while working toward satisfactory answers to the above challenges, along with practical algorithms that achieve them:

1. In particular, in order to put BRL into practice for more complicated and practical problems, we propose a novel framework called Interactive Bayesian Reinforcement Learning (I-BRL) to integrate the general class of parametric models and model priors, thus allowing the practitioners' domain knowledge to be exploited to produce a fine-grained and compact representation of the environment as often required in many real-world applications. Interestingly, we show how the nonmyopic Bayes-optimal policy can be derived analytically by solving I-BRL exactly and propose an approximation algorithm to compute it efficiently in polynomial time. Our empirical studies show that the proposed approach performs competitively with the existing state-of-the-art algorithms.

2. Then, to establish a theoretical foundation for the exploration-exploitation trade-off in single-mode active learning scenarios with resource-constrained budgets, we present a novel \epsilon-Bayes-optimal Decision-Theoretic Active Learning (\epsilon-BAL) framework which advocates the use of differential entropy as a performance measure and consequently, derives a learning policy that can approximate the optimal expected performance arbitrarily closely (i.e., within an arbitrary loss bound \epsilon). To meet the real-time requirement in time-critical applications, we then propose an asymptotically \epsilon-optimal, branch-and-bound anytime algorithm based on \epsilon-BAL with performance guarantees. In practice, we empirically demonstrate with both synthetic and real-world datasets that the proposed approach outperforms the state-of-the-art algorithms in budgeted scenarios.

3. Lastly, to facilitate the future developments of large-scale, non-myopic AL applications, we further introduce a highly scalable family of anytime predictive models for AL which provably converge toward a well-known class of sparse Gaussian processes (SGPs). Unlike the existing predictive models of AL which cannot be updated incrementally and are only capable of processing middle-sized datasets (i.e., a few thousands of data points), our proposed models can process massive datasets in an anytime fashion, thus providing a principled trade-off between the processing time and the predictive accuracy. The efficiency of our framework is then demonstrated empirically on a variety of large-scale real-world datasets which contains hundreds of thousand data points.