Data Analytics In Complex Data Environments: Methods Towards Missing Value And Transfer Learning

Ms Peng Jiaxu
Dr Hahn Jungpil, Associate Professor, School of Computing

  25 Jun 2019 Tuesday, 03:30 PM to 05:00 PM

 Executive Classroom, COM2-04-02


The rapid accumulation of data and advances in data analytics methods create not only opportunities but also challenges for data analytics. Statistical models can be built with extreme ease and speed, but researchers are still concerned about the validity of statistical inference and the viability of the statistical models over time. This thesis proposes two studies to investigate two fundamental, yet frequently recurring, problems faced by academic researchers and data scientists: handling missing values (the focus of the first study), and statistical learning in dynamic data environments (the focus of the second study).

In my first study entitled "A Semi-Supervised Learning Approach to Missing Value Imputation," I propose a new semi-supervised learning approach for imputing missing values. Traditional imputation models are often built on complete records --i.e., records in the dataset without missing values. For instance, the mean imputation approach uses the average of all complete records to impute missing values. This approach is simple to implement but does not take into consideration the missingness mechanism. If the missing values do not occur at random (e.g., a value is more likely to be missing when it is larger), then estimation based on the mean imputed data would be biased. The method I propose adopts a semi-supervised strategy and explicitly incorporates the missingness mechanism, albeit in a probabilistic manner, into the imputation approach. I analytically demonstrate that, the proposed method generates comparatively better imputation than traditional methods that ignore the missingness mechanism. Simulation results also confirm this. Finally, the proposed method is evaluated in the context of two real-world prediction tasks -- credit default prediction and earnings prediction. By imputing an important predictor variable for each of the two data sets, I show that the proposed method generates higher prediction accuracy compared to benchmark imputation methods. Future work includes investigating whether, how, and to what extent my method can improve the validity of coefficient estimation in regression analysis.

My second study entitled "Transfer Learning in Dynamic Business Environments" takes up the challenge of building statistical machine learning models in dynamically changing data environments. In the real world, the underlying true pattern of data may change. Statistical models built using historical data may not be responsive / adaptive to changes in the environment (which should be reflected as changes in the structural patterns in the data generated by the changed environment). A simple solution is to re-train the machine learning model using re-collected current data. However, current data are often scarce. Therefore, it would be beneficial to transfer the machine learning model built on past data to the current period. In this thesis, I study the effectiveness of the existing transfer learning methods in an important business application -- earnings forecasts for public firms. Moreover, I propose a two-step transfer learning method for improving the performance of machine learning in dynamic data environments. My method leverages on the insight that, by comparing current data with historical data, we may obtain information on the change in the data environments, which can subsequently guide the training of machine learning models using historical and current data jointly. Results show that transfer learning improves predictive performance of machine learning models when the data environments undergo significant changes. The future work of this study involves the evaluation of transfer learning methods in broader simulated and real-world contexts with the goal of understanding its proper use in empirical research and thus improving research and practice that employ predictive models.

Overall, this dissertation comprises two method-oriented studies. The first study aims at promoting advanced missing value handling methods by reviewing existing approaches, and more importantly, by proposing a new missing value imputation method to reduce imputation error due to non-random missingness. The second study aims to maintain the predictive performance in the dynamic data environments by answering whether and how transfer learning can be beneficial in improving decision accuracy. The review and experimentation of existing methods, as well as of the proposed new methods are expected to provide valuable contributions to data analytics researchers and practitioners.