Seminar by Mobilewala on 14 January 2022
Abstract:
When building predictive models, model accuracy measures like precision, recall, area under the curve (AUC), etc., have traditionally been the primary driver of model design and operationalization. While this leads to high-fidelity model construction at training and testing time, operationalized models frequently demonstrate a lack of resiliency. In other words, performance in production often degrades, producing results far worse than those at training and testing. As machine learning (ML) matures within organizations, resiliency often overrides raw predictive accuracy as the defining criterion for productionized models. Increasingly, ML practitioners are leaning towards operationalizing decently performing, predictable production models rather than those that exhibit high performance at test time but don't quite deliver on that promise when deployed.
The original data used to create the features on which the model was trained often differs from those that power the model in deployment - this phenomenon is called data drift. Of the many reasons that cause operationalized models to lack resilience, Data Drift is, arguably, the most common.
Existing tools handle this reactively - after models misbehave, mechanisms exist to check for drift. If drift is discovered, corrective measures are taken, which usually involves returning the misbehaving model. This does not lead to the construction of models that are resilient from first principles. We believe resilient models should be powered by data that exhibit low drift over time - such models, by definition, would exhibit less drift-induced misbehaviour. To manifest this property, i.e., drift over time, we have introduced the notion of data stability. While drift is a point measure, stability is a longitudinal metric. Stable data drift little over time, whereas unstable data is the opposite.
In this talk, we will introduce ANOVOS (anovos.ai), an open-source project that seeks to create a feature engineering pipeline to build resilient models. In Phase 1 of Anovos, which we will describe and demonstrate, we have launched a set of tools to ingest, analyze, clean, and prepare data for creating resilient and high-performing features, of which data stability is a key driver.
Biodata:
Anindya Datta, the CEO and Chairman of Mobilewalla is widely regarded as a front-running technologist, leader, and innovator, with core contributions to the state of the art in large-scale data management and internet technologies. Atlanta-based Mobilewalla has pioneered in consumer intelligence by applying groundbreaking AI & data science techniques on the industry's largest volumetric database of mobile engagement data. Prior to Mobilewalla, Anindya founded and ran Chutney Technologies (where he was backed by Kleiner Perkins), which evolved into one of the earliest entrants in the application virtualization area. The company was acquired by Cisco Systems in 2005. Anindya has also been on the faculties of major research universities and institutes, including the National University of Singapore, Georgia Institute of Technology, the University of Arizona, and the Bell Laboratories.
Anindya obtained his undergraduate degree from the Indian Institute of Technology (IIT) Kharagpur, and his M.S. and Ph.D. degrees from the University of Maryland, College Park