PH.D DEFENCE - PUBLIC SEMINAR

Novel Multiple Instance Learning Models for Digital Histopathology

Speaker
Mr Mustafa Umit Oner
Advisor
Dr Ken Sung Wing Kin, Professor, School of Computing
Dr Lee Hwee Kuan, Adjunct Associate Professor, School of Computing


15 Nov 2021 Monday, 02:00 PM to 03:30 PM

Zoom presentation

Abstract:

Histopathology is the golden standard in the clinic for cancer diagnosis and treatment planning. Recently, slide scanners have transformed histopathology into digital, where glass slides are digitized and stored as whole-slide-images (WSIs). WSIs provide us with precious data that powerful deep learning models can exploit. However, a WSI is a huge gigapixel image that traditional deep learning models cannot process. Besides, deep learning models require a lot of labeled data. Nevertheless, most WSIs are either unannotated or annotated with some weak labels indicating slide-level properties, like a tumor slide or a normal slide.

This thesis develops novel multiple instance learning (MIL) models tackling huge images and exploiting weak labels to reveal fine-level information within the images. MIL is a machine learning paradigm that learns the mapping between bags of instances and bag labels. We treat a WSI as a bag of small patches cropped over the WSI and use the WSI's weak label as the bag label. Firstly, we developed a weakly supervised clustering framework. Given only the weak labels of whether an image contains metastases or not, this framework successfully segmented out breast cancer metastases in the lymph node sections.

Secondly, one common component in all MIL methods is the MIL pooling filter, which obtains the bag-level representations from extracted features of instances. We introduced distribution-based pooling filters that obtain a bag-level representation by estimating marginal feature distributions. We formally proved that the distribution-based pooling filters are more expressive than the point estimate-based counterparts (like 'max' and 'mean' pooling) in terms of the amount of information captured while obtaining bag-level representations. Moreover, we empirically showed that models with distribution-based pooling filters perform equal or better than those with point estimate-based pooling filters on distinct real-world MIL tasks.

Thirdly, we developed a MIL model with a distribution pooling filter predicting tumor purity (percentage of cancer cells within a tissue section) from digital histopathology slides. An accurate tumor purity estimation is crucial for sample selection to minimize normal cell contamination in genomic analysis. Our model successfully predicted tumor purity in eight TCGA cohorts and a local Singapore cohort. The predictions were highly consistent with genomic tumor purity values, which were inferred from genomic data and accepted as accurate for downstream analysis. Furthermore, our model provided tumor purity maps showing the spatial variation within sections, which can help better understand the tumor microenvironment.

Finally, we give a recipe to prepare machine learning datasets for digital histopathology tasks. We show that incorrect data segregation during dataset preparation leads to data leakage: the model gives illusory good results on the test set. However, it is not the case for a new patient walking into the clinic. We conclude that patient-level data segregation is necessary to avoid data leakage in digital histopathology tasks. Moreover, it ensures that each patient in the test set is like a new patient walking into the clinic.