Streaming Anomaly Detection
COM1 Level 3
MR1, COM1-03-19
closeAbstract:
Anomaly detection is critical for finding suspicious behavior in innumerable systems, such as intrusion detection, fake ratings, and financial fraud. We need to detect anomalies in real-time or near real-time, i.e. determine if an incoming entity is anomalous or not, as soon as we receive it, to minimize the effects of malicious activities and start recovery as soon as possible. Therefore, online algorithms that can detect anomalies in a streaming manner are essential. Also, since the data increases as the stream is processed, we can only afford constant memory which makes the problem of streaming anomaly detection more challenging.
We first propose Midas which detects anomalous edges in dynamic graphs in an online manner, using constant time and memory. Midas focuses on detecting microcluster anomalies, or suddenly arriving groups of suspiciously similar edges such as denial of service attacks in network traffic data. In addition, by using a principled hypothesis testing framework, Midas provides theoretical bounds on the false positive probability, which previous methods do not provide. We then propose two variants, Midas-R which incorporates temporal and spatial relations, and Midas-F which aims to filter away anomalous edges to prevent them from negatively affecting the algorithm's internal data structures. Our experimental results show that Midas outperforms baselines in accuracy by up to 62% while processing the data orders of magnitude faster.
We then extend the count-min sketch data structure to a Higher-Order Sketch to capture complex relations in graph data, and to reduce detecting suspicious dense subgraph problem to finding a dense submatrix in constant time. Using this sketch, we propose four streaming methods to detect edge and subgraph anomalies in constant time and memory. Furthermore, our approach is the first streaming work that incorporates dense subgraph search to detect graph anomalies in constant memory and constant update time per newly arriving edge. We also provide theoretical guarantees on the higher-order sketch estimate and the submatrix density measure. Experimental results on real-world datasets demonstrate our effectiveness as opposed to popular state-of-the-art streaming edge and graph baselines.
Next, we broaden the graph setting to multi-aspect data. We propose MStream which detects anomalies in multi-aspect data streams including both categorical and numeric attributes and is online, thus processing each record in constant time and constant memory. Moreover, the anomalies detected by MStream are explainable. We further propose MStream-PCA, MStream-IB, and MStream-AE to incorporate correlation between features.
Finally, we consider multi-dimensional data streams with concept drift and propose MemStream, a streaming anomaly detection framework, allowing us to detect unusual events as they occur while being resilient to concept drift. MemStream leverages the power of a denoising autoencoder to learn representations and a memory module to learn the dynamically changing trend in data without the need for labels. We prove a theoretical bound on the size of memory for effective drift handling. In addition, we allow quick retraining when the arriving stream becomes sufficiently different from the training data. Furthermore, MemStream makes use of two architecture design choices to be robust to memory poisoning. Experimental results show the effectiveness of our approach compared to state-of-the-art streaming baselines.