Optimizing the Trade-Off Between Resources and Information in Summarizing Massive Graph Data Streams

Professor Nick Duffield
Texas A&M University

Chaired by
Dr TAN Kian Lee, Shaw Senior Professor, School of Computing

  06 Nov 2017 Monday, 10:30 AM to 12:00 PM

 Executive Classroom, COM2-04-02


Sampling is a powerful approach to reduce Big Data to Small Data, relieving storage and enabling faster query response when an approximate answer suffices. The focus of this talk is a cost-based formulation for data reduction that allows flexible expression of query goals, and weights sampling to minimize information loss for a given space constraint. This approach was motivated by problems in streaming traffic measurements in ISPs. We describe some new applications for subgraph estimation in massive graph streaming data, for both query execution on stored transactional data, and construction of reference samples for retrospective queries. We establish a framework for unbiased estimation of the cardinality of arbitrary sets of subgraphs, together with estimates for their variance. The result hinges on a Martingale formulation of weighted priority sampling that establishes unbiasedness for edge-product form estimators, and that enables us to disjoin the sampling and estimation steps. This property enables execution in billion scale graphs with comparable accuracy to existing approaches in significantly smaller memory.


Nick Duffield ( http://nickduffield.net/work ) is a Professor in the Department of Electrical and Computer Engineering at Texas A&M University, and Director of the Texas A&M Engineering Big Data Initiative. From 1995 until 2013 he worked at AT&T Labs-Research, Florham Park, NJ, where he was a Distinguished Member of Technical Staff and an AT&T Fellow. He obtained a PhD in Mathematical Physics from the University of London, and the BA and MMath from the University of Cambridge, UK. His research focuses on the foundations and applications of Data Science and Computer Networking, including graph streaming, rough path analysis, network measurement and resilience, transportation, and hydrology. He is Chief Editor for Big Data at Frontiers in ICT and an Editor-at-Large for the IEEE/ACM Transactions on Networking. Dr. Duffield is an IEEE Fellow, an IET Fellow, a member of the Board of Directors of ACM Sigmetrics, and was a co-recipient of the ACM Sigmetrics Test of Time Award in 2012 and 2013. His research is supported by awards from NSF, DARPA, Google and Intel.