Towards High Quality and Interpretable Healthcare Data Analytics

Ms Zheng Kaiping
Dr Ooi Beng Chin, Distinguished Professor, School of Computing

  03 Dec 2019 Tuesday, 10:00 AM to 11:30 AM

 Executive Classroom, COM2-04-02


In recent years, the increasing availability of Electronic Medical Records (EMR) has brought vast array of promising opportunities to automate healthcare data analytics. This helps gradually reduce the need for traditional manual data analytics which relies on domain expertise, experience, and costly as well as painstakingly designed experiments. However, the complexity of EMR data and EMR data analytics poses challenges on healthcare analytic performance, diminishing its potential and hence usability in practice. Consequently, it is of vital necessity and importance to resolve the challenges in both EMR data and EMR data analytics in order to boost the performance and facilitate high quality as well as interpretable healthcare data analytics for providing useful medical insights.

In this thesis, we study four main challenges in EMR data and EMR data analytics, namely irregularity, bias, lack of reliability and lack of interpretability, and propose solutions to resolving them.

Firstly, we identify the irregularity challenge in EMR data and justify that it should be resolved at the feature level to reduce the time information loss. We propose an adapted Gated Recurrent Unit model to incorporate the fine-grained feature-level time span information. This model has the advantage of differentiating various medical features through learning their decaying parameters. Experimental results show that our proposed model can effectively improve EMR data analytic performance in terms of accuracy.

Secondly, we investigate the irregularity challenge in EMR data and find that it is a phenomenon, while bias should be the underlying reason. Hence, we formalize the bias challenge in EMR data and propose a general method to transform the biased EMR time series into unbiased data. Our inference model takes into account two characteristics of medical features, Condition Change Rate and Observation Rate, respectively representing the probability that a feature's actual condition changes from past and the probability that a feature is observed if abnormal. Experimental results demonstrate that our proposed bias resolving method manages to not only impute missing data more accurately but also boost the performance of downstream data analytic applications.

Thirdly, due to the fact that EMR data analytics is a high-stake application in which every patient needs to be considered equally important, it is crucial for the model to handle the samples it can predict well, while rely on human for the difficult ones. We treat this as a sample decomposition problem and propose to optimize the partial coverage model with a reject option as a solution. In particular, we devise a general two-level approach to optimize sample decomposition for healthcare applications, i.e., optimize the partial coverage model with a reject option, which re-weights the sample distribution in two ways: (i) select easy samples during training, (ii) tune the loss for each training sample based on its difficulty. We evaluate the effectiveness of our proposed optimization approach in two real-world EMR datasets and experimental results illustrate that the proposal exhibits a substantial superiority over baselines in terms of the model's prediction performance on easy samples.

Finally, for a high-stake application, we need to improve the interpretability of analytic models. We propose to categorize the feature importance into the global level and the local level to respectively provide general and time-specific explanations. Consequently, we propose to model the global-level feature importance in one subnetwork with the feature-wise transformation mechanism and the local-level feature importance in another subnetwork with the self-attention mechanism. Through training both subnetworks jointly, we aim to achieve accurate predictions and derive medically interpretable insights simultaneously. Experimental results confirm this model's effectiveness in terms of both prediction performance and interpretation capability with doctors' assistance on validation.