PH.D DEFENCE - PUBLIC SEMINAR

Correcting Misclassification Bias in Regression Models with Variables Generated Via Data Mining

Speaker

Ms Qiao Mengke

Advisor

Dr Huang Ke Wei, Associate Professor, School of Computing

07 Jan 2021 Thursday, 03:00 PM to 04:30 PM

Zoom presentation

Join Zoom Meeting
https://nus-sg.zoom.us/j/81722568511?pwd=S0dCL0dNbWZLY1Fhamw5TC9RL3FXdz09
Meeting ID: 817 2256 8511
Password: 964896

Abstract:

As a result of advances in data mining, more and more empirical studies in the social sciences apply classification algorithms to construct independent and dependent variables for further analysis via standard regression methods. Specifically, in the first stage, researchers apply classification algorithms to construct a new categorical variable. In the second stage, this newly constructed categorical variable can be directly used as an independent or dependent variable in a regression. This type of study is defined as "individual-level hybrid study". Besides, the new variable from the first classification stage can also be aggregated (e.g. mean or sum) and then, be used as a variable in the aggregate-level regression. This type of study is defined as "aggregate-level hybrid study". In the classification phase of these studies, researchers need to choose subjectively a classification performance metric for optimization in the standard procedure. No matter which performance metric is chosen, the constructed variable still includes classification error because those variables cannot be classified perfectly. The misclassification of constructed variables will lead to inconsistency in the estimation of regression coefficients in the following phase, which has been documented as a problem of measurement error in the econometrics literature.

In the first study, we attempt to investigate systematically the theoretical foundation of the measurement error problem in individual-level hybrid studies. In individual-level hybrid studies, individual-level measurement error can be directly observed and quantified on the labeled set, which can be utilized to correct the estimation inconsistency. Our theoretical analysis shows that consistent regression estimators can be recovered in all models studied in this paper. The main implication of our theoretical result is that researchers do not need to tune the classification algorithm to minimize the inconsistency of estimated regression coefficients because the inconsistency can be corrected by theoretical formulas, even if the classification accuracy is poor. Instead, we propose that a classification algorithm should be tuned to minimize the standard error of the focal regression coefficient derived based on the corrected formula. As a result, researchers can derive consistent and most precise estimators in all models studied in this paper.

In the second study, we attempt to analyze the measurement error issue in aggregate-level hybrid studies. In aggregate-level hybrid studies, the aggregated measurement error cannot be directly observed and quantified on the labeled set since the labeled dataset is generally a random subsample of the whole dataset at the individual level, not at the aggregate level. Moreover, little literature has analyzed this problem since it appears due to the unique two-stage feature in the hybrid study. Therefore, we need to derive new solutions to correct the estimation inconsistency in aggregate-level hybrid studies. Our theoretical analysis shows that consistent regression estimators can be recovered in all cases studied in this paper. As with the first study, our theoretical result shows that the estimation inconsistency can be corrected by theoretical formulas, even if the classification performance is poor.