Selective Exploration Methods for Experience Transfer in Reinforcement Learning

Mr Akshay Narayan
Dr Leong Tze Yun, Professor (Practice), School of Computing

  13 Jan 2020 Monday, 03:00 PM to 04:30 PM

 Executive Classroom, COM2-04-02


Transfer learning' reusing previously learned knowledge, can speed up learning in many reinforcement learning tasks. In this work, we propose a new selective exploration framework (SEF) for experience transfer to solve problems that require fast responses adapted from incomplete, prior knowledge. We consider the setting where the source and target tasks share similar objectives but differ in the transition dynamics, e.g., for a robotic agent operating in similar but challenging environments, such as care homes and hospital wards.

In this dissertation, we focus on policies as the source of experience for transfer. Policy reuse is effected by identifying the sub-spaces that are different in the target environment, where the source knowledge is insufficient. We present the selective exploration and policy transfer (SEAPoT) algorithm that is an instantiation of the selective exploration framework. We describe methods to construct sub-spaces for local exploration and a strategy that selectively and efficiently explores the target task. We define a task similarity metric based on the Jensen-Shannon distance between the tasks' transition-probability distributions. We demonstrate the flexibility of the proposed framework by incorporating different exploration mechanisms for learning. We demonstrate the efficacy of SEAPoT in large experiments, real-world scenarios modeled using Minecraft and discrete grid world environments as test-beds, and empirically show that our method performs better in terms of jump-starts and cumulative average rewards, as compared to the state-of-the-art policy reuse methods.

Further, we introduce the partial policy generation (PPG) algorithm. PPG builds on the concept of motif discovery in gene sequencing and computational biology. We show that using sub-goal information and clustering the input strings based on sub-goal results in an efficient set of partial policies that the agent can reuse in the target task. PPG is beneficial in scenarios where the entire policy reuse may not be effective.

Finally, we showcase the effectiveness of the algorithms in various discrete environments that clearly reflect real-world tasks, each highlighting some key aspects of the algorithm.