Reliable and Affordable Training of Large DNNs over Preemptible GPUs
COM1 Level 2
DNN models across many domains continue to grow in size, resulting in high resource requirements for effective training, and unpalatable (and often unaffordable) costs for organizations and research labs across scales. In this talk, I will talk about our work on affordable AI that aims to significantly reduce training costs with effective use of preemptible instances, i.e., those that can be obtained at a much cheaper price while idle, but may be preempted whenever requested by priority users. Doing so, however, requires new forms of resiliency and efficiency to cope with the possibility of frequent preemptions -- a failure model that is drastically different from the occasional failures in normal cluster settings that existing checkpointing techniques target. I will talk about our techniques that enable these properties as well as our recent effort to commercialize these techniques in a startup called BreezeML.
Harry Xu is a Professor of Computer Science at University of California, Los Angeles. His work spans a range of system areas including operating and distributed system, runtime system, compiler, AI/ML system, and data analytics system. Harry is a winner of the Dahl-Nygaard (Junior) Prize, multiple NSF/ Google/Cisco/Alibaba/Huawei faculty awards, as well as several best paper awards in top conferences (such as OSDI). He co-founded BreezeML, a startup company that focuses on development of affordable ML infrastructures, and currently serves as its CEO.