Generalization Techniques in Deep Reinforcement Learning
COM2 Level 4
Executive Classroom, COM2-04-02
closeAbstract:
Reinforcement learning (RL) has achieved massive success in recent years, enabling computers to solve problems that are traditionally thought to be "impossible". An important factor in the success of RL is due to the advances in deep reinforcement learning (DRL). DRL combines the representational capability of deep learning and the generality of reinforcement learning. This combination allows DRL to work on complex problems with high-dimensional input, such as in the case of robotics and games. However, despite its success, DRL suffers from several significant issues. DRL is often sensitive to changes in the environment and can be unstable to train; moreover, it may also require a large amount of data to work well. As DRL is multi-faceted in nature, solving these issues is not trivial as it may require studying many aspects of DRL that could be interrelated.
Among many of the aspects of DRL, generalization is often considered to be one of the most important. Improving generalization in DRL could lead to agents that perform better, are consistent, and are more robust in various situations. The importance of generalization is also shared by supervised learning -- another branch of machine learning -- where many techniques to improve generalization have been developed. In this thesis, we seek to investigate some of these techniques in the context of DRL to understand when they are useful, examine the details and changes necessary to make them work well, and identify the remaining issues.
We first investigate a generalization technique that does not impose any assumptions on the problems: ensemble learning. The performance of ensemble learning largely depends on its capability in producing diverse estimators. This relies on the assumption that there is multiple training datasets, which is generally not achievable in RL. Despite this fact, much research has demonstrated the unreasonable effectiveness of ensemble in RL. We investigate this phenomenon by proposing an ensemble agent of random learners. We then derive a refined bias-variance-covariance analysis with a more appropriate assumption for RL settings. Unlike the existing analysis, we do not impose any assumption that multiple datasets must be available and instead assume that randomness comes from the learning algorithm. We theoretically show how this type of ensemble agent is useful and empirically validate these findings. We also discuss the limitation of our analysis.
Next, we extend the above approach by incorporating a generalization technique that "weakly" imposes assumptions on the problem. Contrary to the previous approach that relies on randomness in the learning algorithms to provide diversity in the estimators, we consider a more explicit way to induce diversity by imposing preference bias through auxiliary task learning. We propose a set of auxiliary tasks suitable for DRL through the use of some common assumptions general in RL and augment an ensemble agent with it. We then use the refined bias-variance-covariance analysis we derived earlier to analyze the ensemble and propose a set of ways to optimize it. Experimental evaluation shows that our approach performs better compared to strong baselines, including the ensemble of random learners.
Finally, we look into another generalization technique that enforces assumptions more "strongly" through restriction bias, namely by the use of architectural inductive bias that encodes the structure of a planning algorithm. One of the general planning algorithms for sequential decision-making problems is the value iteration algorithm whose structure is captured by the value iteration network (VIN). However, VIN imposes stringent assumptions about the problem, limiting its applicability only to problems with limited complexity. Here, we identify important properties common in real-world problems and propose generalized value iteration networks which principally extends VIN to handle these properties. Our methods are capable of handling complex transition functions, temporally changing environments, and a form of hierarchical planning by design, and may be able to do more thanks to end-to-end learning. Experimental evaluation shows that our methods can perform near optimally on problems that are well-aligned with their inductive bias, and still do well on other problems.
Another property that the real-world problems have, but remains unexplored in the above work, is their large-scale nature. In the last part of this thesis, we look into the issue of learning instability which arises due to scaling value iteration networks to large input sizes. We propose distributional value iteration networks which encode distributional value iteration as neural networks architecture. Experimental evaluation shows the benefit of the dense representation of the networks in alleviating the issue of learning instability.