PH.D DEFENCE - PUBLIC SEMINAR

From Raw Data to Processable Informative Data : Training Data Management for Big Data Analytics

Speaker
Mr Gao Jinyang
Advisor
Dr Ooi Beng Chin, Professor, School of Computing


31 Jan 2017 Tuesday, 10:00 AM to 11:30 AM

Executive Classroom, COM2-04-02

Abstract:

Due to the surging volume of Big Data, data-driven approaches are playing an ever-increasing role in nowadays knowledge discoveries and decision makings. Though cheap raw data from various sources are produced everywhere, most of them cannot be directly used as training data and benefit analytics tasks. This is mainly because the size of raw data is usually too large to be directly processed, and the informative value in raw data is not as high as that collected from deliberately designed experiments. To fulfill the use of Big Data, there is an increasing need to establish an infrastructure for training data management, transforming raw data to processable informative data, by leveraging both human effort and computational resources.

In this thesis, we aim to develop effective and efficient solutions to transform the Big Data into a processable and informative form. Two challenging problems are discussed and addressed. The first challenge is to increase the information value in Big Data, mainly by acquiring extra supervised information from data annotation. We propose a preference quantified model to annotate complex tasks where the supervised information is difficult to be represent by simple labels, and adapt an active learning approach to reduce the cost of human efforts. To further reduce the cost of data annotation by using crowd-sourcing, we develop a cost-sensitive method for crowd-sourced data quality management. The second challenge is to squeeze and reorganize the data to a processable form without losing much information inside the original data, which typically includes representing, compressing, indexing and sampling the data to increase the computational efficiency. We propose a hashing method to transform the training data into better compact representation, while preserving both internal information in each instances and external relations among those instances. Moreover, we index the data which are usually high-dimensional to support similarity queries based on the distance independent k-nearest neighbor measure. Finally, we study the effect of data sampling pattern on the efficiency of analytics model training, aiming to provide the most informative data in a processable size to the analytics model to speed up the model training procedure.