PH.D DEFENCE - PUBLIC SEMINAR

Two Essays on Interpretable Predictive Models and Multicalibrated Survival Analysis for Healthcare

Speaker
Mr. Thiti Suttaket
Advisor
Dr Stanley Kok, Assistant Professor, School of Computing


30 Oct 2023 Monday, 02:00 PM to 03:30 PM

Zoom presentation

Abstract:

Information systems are pivotal in healthcare, enhancing care quality, reducing costs, and improving health outcomes. Machine learning technique, which is an important tool in information systems, has been instrumental in addressing a range of healthcare challenges such as health risk prediction, survival analysis, and medical concept embedding. Moreover, machine learning models have been deployed to tackle specific issues tied to healthcare data, often collected in the form of electronic health records (EHRs). These tasks encompass de-identification for privacy preservation and data augmentation to support data-intensive models such as deep learning.

The first essay of this thesis addresses the challenge of interpretability in machine learning models for health risk prediction tasks. These tasks are often formulated as binary classification problems, in which we are interested in predicting whether a clinical event would happen (or not) at a specific time point or within a concrete time-frame with information from EHRs. In this class of problems, deep learning has demonstrated state-of-the-art empirical results in predictive performance. However, the blackbox nature of deep learning models prevents both clinicians and patients from trusting the models. Attention mechanisms are normally employed to improve the transparency of deep learning models. However, such attention mechanisms only highlight important inputs without sufficient clarity on how they correlate with each other. To tackle this drawback, I develop a novel model called Rational Multi-Layer Perceptrons (RMLP) that is constructed from weighted finite state automata. RMLP can provide a better interpretation by linking together relevant inputs at different timesteps into distinct sequences. RMLP also can be shown to be a generalization of a multi-layer perceptron to sequential, dynamic data. In sum, RMLP works on longitudinal time-series data, and learns interpretable patterns. Empirical comparisons on six real-world clinical tasks demonstrate RMLP’s efficacy.

However, by formulating a predictive healthcare task as a simple binary classification problem, we encounter a shortcoming. The model can answer questions such as whether a medical event of interest (e.g., death) would occur at a predefined time point. However, it is unable to answer the question of when the event of interest would occur. For such a question, it is essential to model patients' risks at all possible time points, which is the approach taken by the method of survival analysis, which is the topic examined in the second essay of this thesis. Survival analysis models the relationship between an individual's covariates and the onset time of an event of interest (e.g., death). It is important for survival models to be “well-calibrated” (i.e., for their predicted probabilities to be close to ground-truth probabilities) because badly calibrated systems can result in erroneous clinical decisions. Existing survival models are typically calibrated at the population level only, and thus run the risk of being poorly calibrated for one or more minority subpopulations. I propose a model called GRADUATE that achieves “multicalibration” by ensuring that all subpopulations are well-calibrated too. GRADUATE frames multicalibration as a constrained optimization problem, and optimizes both calibration and discrimination in-training to achieve a good balance between them. Empirical comparisons against state-of-the-art baselines on real-world clinical datasets demonstrate GRADUATE's efficacy. In a detailed analysis, I elucidate the shortcomings of the baselines vis-à-vis GRADUATE's strengths.