Data Analytics in Complex Data Environments: Methods Towards Missing Values and Dynamic Data Patterns
Join Zoom Meeting
https://nus-sg.zoom.us/j/8904491843?pwd=V2JuR2JqQjhqSUNYbXk0R0NwbXh3QT09 Meeting ID: 890 449 1843 Password: 651125
Abstract:
The rapid accumulation of data and advances in data analytics methods create not only opportunities but also challenges for data analytics. One fundamental challenge arises from the heterogeneity in data patterns. This thesis investigates two frequently recurring problems that result in hard-to-be-observed data heterogeneity: missing values (the focus of the first study), and dynamic changing data patterns (the focus of the second study).
In my first study entitled "Handling Missing Values without Assuming Missing at Random," I propose approaches to handling missing values that occur not at random. Traditional imputation models are often built on complete records - i.e., records in the dataset without missing values. However, if missing values do not occur at random, the data patterns might have changed from observed to missing information (e.g., a value is more likely to be missing when it is larger), then even a simple mean estimation would be biased. In the proposed approaches, including a missing value imputation method based on semi-supervised learning, and a Monte Carlo likelihood estimation approach for correcting estimation bias caused by missing values, I explicitly incorporate the missingness mechanism into the data analytics process. I analytically demonstrate that, accommodating the missingness mechanism generates comparatively better imputation and statistical estimates than traditional methods that ignore the missingness mechanism. In particular, in the context of two real-world prediction tasks, results show that the proposed semi-supervised missing value imputation generates higher prediction accuracy compared to benchmark imputation methods. In the bias correction problem of regression analysis, the proposed Monte Carlo based approach generates unbiased estimation of regression coefficients under different missingness mechanisms.
My second study entitled "Transfer Learning in Dynamic Business Environments: Trade-offs in Response to Changes" takes up the challenge that, in changing data environments, we often have little information in responding to changes and adjusting statistical prediction models in a timely matter. In this study, I investigate the question of whether and how we can make use of all of the source data (including same-distribution recent source data and the remaining diff-distribution past source data) to achieve better prediction accuracy for a target task when there is only a small amount of source data that exhibit the target data pattern. In this study, I aim at bridging the research gap in theoretically understanding when and to what extent transfer learning works by using a sample selection perspective to represent changes in data pattern. Based on the sample selection model, I derive a probabilistic weighting scheme using the large source data set. Moreover, to examine transfer learning in the broader picture of changing data environments, I conduct a simulation analysis to examine two trade-offs when changes are detected, whether we should use transfer learning to adjust the prediction model, and whether we should make adjustment to the prediction model immediately or at a later time point until more same-distribution source data being available. The results, implications, and contributions are discussed.
Throughout my dissertation, I seek to understand the underlying mechanisms of the heterogeneity in data patterns arising from missing values and changing data environments, and to provide theoretical insights on how to approximate and make use of the often-overlooked mechanisms. For example, the first study incorporates the missingness mechanism to improve the imputation accuracy or to reduce bias in parameter estimation. The second study explore transfer learning from the sample selection perspective which facilitates investigating the trade-offs in response to changes. By unveiling the underlying theory and assumptions, this study promotes more robust application of data analytics in complex data environments.