DOCTORAL SEMINAR

Correcting Misclassification Bias in Regression Models with Variables Generated via Data Mining

Speaker
Ms Qiao Mengke
Advisor
Dr Huang Ke-Wei, Associate Professor, School of Computing


24 Dec 2019 Tuesday, 02:00 PM to 03:30 PM

Executive Classroom, COM2-04-02

Abstract:

As a result of advances in data mining, more and more empirical studies in the social sciences apply classification algorithms to construct independent and dependent variables for further analysis via standard regression methods. Specifically, in the first stage, researchers apply classification algorithms to construct a new categorical variable. In the second stage, this newly constructed categorical variable can be directly used as an independent or dependent variable in a regression. This type of study is defined as "individual-level hybrid study". Besides, the new variable from the first classification stage can be aggregated (e.g. mean or sum) and then, be used as a variable in the aggregate-level regression. This type of study is defined as "aggregate-level hybrid study". In the classification phase of these studies, researchers need to choose subjectively a classification performance metric for optimization in the standard procedure. No matter which performance metric is chosen, the constructed variable still includes classification error because those variables cannot be classified perfectly. The misclassification of constructed variables will lead to bias in the estimation of regression coefficients in the following phase, which has been documented as a problem of measurement error in the econometrics literature.

In the first study, we attempt to investigate systematically the theoretical foundation of the measurement error problem in individual-level hybrid studies. In individual-level hybrid studies, individual-level measurement error can be directly observed and quantified on the training set, which can be utilized to correct the estimation bias. Our theoretical analysis shows that consistent regression estimators can be recovered in all models studied in this paper. The main implication of our theoretical result is that researchers do not need to tune the classification algorithm to minimize the bias of estimated regression coefficients because this bias can be corrected by theoretical formulas, even if the classification accuracy is poor. Instead, we propose that a classification algorithm should be tuned to minimize the standard error of the focal regression coefficient derived based on the corrected formula. As a result, researchers can derive consistent and most efficient estimators in all models studied in this paper.

In the second study, we attempt to analyze the measurement error issue in aggregate-level hybrid studies. In aggregate-level hybrid studies, the aggregated measurement error cannot be directly observed and quantified on the training set since the training dataset is generally a random subsample of the whole dataset at individual level, not at aggregate level. Moreover, little literature has analyzed this problem since it appears due to the unique two-stage feature in hybrid study. Therefore, we need to derive new solutions to correct the estimation bias in aggregate-level hybrid studies. Our theoretical analysis shows that consistent regression estimators can be recovered in all cases studied in this paper. As with the first study, our theoretical results show that the estimation bias can be corrected by theoretical formulas, even if the classification performance is poor.