Continual Mixture of Expert Learning Under Spatial & Temporal Concept Drift
COM2 Level 4
Executive Classroom, COM2-04-02
closeAbstract:
Concept drift is highly detrimental towards the performance of machine learning models not tuned to the latest trends in data. This phenomenon is typical for evolving real world dataset, such as transaction graphs for fraud detection and time series user preference modelling. For large-scaled companies, such as Grab, machine learning practitioners often struggle with model re-training under tight service level agreement (SLA) & resource budget. Consequently, more attention has been devoted to continual learning, in which models are updated incrementally as new data is ingested. As a drawback, models are susceptible to catastrophic forgetting and tend to under perform on earlier tasks since their parameters have now been tuned for the present. One viable solution is therefore to employ capacity scaling over a Mixture of Experts (MoE) ensemble. New expert models are trained and introduced into an existing ensemble to accommodate the growth of tasks. A routing protocol presides over all experts and decides which expert(s) are to be selected during inference. Capacity scaled MoEs are promising as they theoretically eliminate catastrophic forgetting for continual learning.
In this thesis, we present practical algorithms to address the shortfalls of training capacity scaled MoE models under concept drift. Our work is specific to Grab, a superapp in Southeast Asia that deals primarily in the ride hailing and delivery sector, but may be generalized to the industry at large. We claim and justify that existing techniques for scaling & inference over capacity scaled MoE models are inefficient resource & accuracy wise when operating over large volume and velocity of ingested data. We explore these inefficiencies from three different angles and propose novel solutions as follows.
Our first work is Deep Dynamic Graph Partitioning (DDGP), a two pronged iterative pipeline for segmenting and training experts over large graph structures. Motivations for DDGP stem from lack of work addressing spatial concept drift within large graphs, which is detrimental for single graph model learning. DDGP employs expert feedback in-the-loop to iteratively divide a graph into sub-graphs, with the objective of maximizing the collective performance over all experts.
Our second work is the Indexed Router, a framework for low latency indexing and search across large expert ensembles. The Indexed Router was designed for systems that require low latency of model updates and high queries per second (QPS) throughput. These requirements are not satisfied by the present capacity scaling algorithms, which have linear time complexities for update and inference. The Indexed Router eliminates catastrophic forgetting, even when learning under temporal concept drift, by persisting all models to storage and providing a routing logic for selection of relevant historical experts during inference.
Finally, in spite of its merits, applications of capacity scaled MoE models for continual learning are almost non-existent, with single model training frameworks such as Avalanche and Mammoth taking the lead in terms of popularity and adoption in the industry. In addition, we observed multiple data science pipelines at Grab which are unable to transition towards continual learning due to the steep engineering cost involved. Motivated by this, we propose the AdaptiveStream} model building library. The library is an extension of the Indexed Router framework and is primarily written in Python and Tensorflow, with the goal to enable machine learning practitioners to harness the potential of capacity scaled MoE models and ensuring a seamless transition towards continual learning workloads. The code base for AdaptiveStream has been made open sourced for any future improvements to the framework.