Optimal Classification Performance Metric for Constructing Variables for Further Regression Analysis
05 Oct 2018 Friday, 02:00 PM to 03:30 PM
COM2 Level 4
Executive Classroom, COM2-04-02
Examiners: Associate Professor Goh Khim Yong and Assistant Professor Phan Tuan Quang
Because of the advances in text mining, more and more empirical studies in social sciences apply text classification algorithms to construct independent and dependent variables for further analysis by standard regression methods. In the classification phase of these studies, researchers need to subjectively choose a classification performance metric in the standard procedure of conducting classification analysis. No matter which performance metric is chosen, the constructed variable still includes error because those variables cannot be classified perfectly. The misclassification of constructed variables will lead to bias in the estimation of regression coefficients in the following phase. To the best of our knowledge, this study is the first to systematically investigate the theoretical relationship between classification errors of newly constructed variable and the accuracy of the regression estimators when using a newly constructed variable as the dependent or independent variable in both linear and nonlinear regressions. Our theoretical analysis shows that consistent regression estimators can be recovered in all cases. In other words, our findings show that researchers do not need to tune the classification algorithm to minimize the bias of estimated regression coefficients because those biased regression coefficients can be corrected by theoretical formulas. Instead, we propose that for future studies in social sciences, classification algorithm should be tuned to minimize the standard errors of the focal regression coefficients derived based on corrected formulas.