PH.D DEFENCE - PUBLIC SEMINAR

Efficient Sliding-Windowalgorithms For Real-Time Data Stream Analytics

Speaker
Mr Wang Yanhao
Advisor
Dr Tan Kian Lee, Tan Sri Runme Shaw Senior Professor, School of Computing


03 Feb 2020 Monday, 10:00 AM to 11:30 AM

Executive Classroom, COM2-04-02

Abstract:

In recent years, data streams have become ubiquitous because many different applications, e.g., social media, communication networks, and sensors, are continuously generating huge volumes of data at an unprecedented rate. Data streams provide both great opportunities and immense challenges for data analysis. On the one hand, they are important sources of valuable information and knowledge. On the other hand, the volumes and velocities of streams pose new challenges for data analytics. In practice, streaming algorithms have been widely used for data stream analytics. However, streaming algorithms still have some limitations in the time-sensitive scenarios because they cannot capture the highly dynamic and time-evolving patterns of streaming data.

To address these problems, we adopt the sliding window model where only recent data in a stream is used for computation to avoid including outdated information and guarantee the timeliness in real-time analysis. Nevertheless, since sliding-window algorithms are generally more challenging than their (append-only) streaming counterparts, efficient sliding-window algorithms for analyzing time-sensitive data streams in real time are still largely unexplored yet. In this thesis, we investigate different problems of real-time data stream analytics and propose efficient sliding-window solutions for these problems.

First, we study the influence maximization problem on social streams. We formulate the definition of influence based on user actions in a social network, and propose the Stream Influence Maximization (SIM) query to track a seed set with the maximum influence over the sliding window of recent actions in a social stream. Then, we develop the Influential Checkpoints (IC) and Sparse Influential Checkpoints (SIC) frameworks for efficient SIM processing with theoretical guarantees. Moreover, we extend SIM to the Location-aware SIM (LSIM) query to find a seed set with the maximum influence in a spatial region from a geo-tagged social stream. We further develop the Location-based SIC (LSIC) framework and its improved version LSIC+ for efficient LSIM processing.

Second, we investigate the representative subset selection (RSS) problem in data streams. We formulate dynamic RSS as maximizing a monotone submodular utility function subject to a d-knapsack constraint (SMDK) in the sliding window model. Then, we devise the KW and KW+ frameworks for this problem. Theoretically, KW can provide a (1-eps)/(1+d)-approximate solution where eps is a real number in (0,1) for SMDK over a sliding window of elements. KW+ is more efficient than KW at the expense of a lower approximation factor of (1-eps)/(2+2d).

Third, we consider the problem of maintaining a coreset for the minimum enclosing ball (MEB) of a sliding window of points. We propose the AOMEB algorithm to maintain a (sqrt(2)+eps)-coreset for MEB in an append-only stream. By using AOMEB as a building block, we propose two sliding-window algorithms, namely SWMEB and SWMEB+, to maintain coresets for MEBs with approximation ratios of sqrt(2)+eps and 9.66+eps respectively. Furthermore, we generalize our proposed algorithms to maintain coresets for kernelized MEBs in reproducing kernel Hilbert space (RKHS).

For each algorithm in this thesis, we conduct extensive experiments on real-world and synthetic datasets in different parameter settings. The experimental results demonstrate the effectiveness and efficiency of our proposed algorithms.