Transfer learning for robust predictions in computational genomics
Dr Niranjan Nagarajan, Associate Professor, School of Computing
The use of machine learning on large and complex biological datasets has been instrumental in deriving meaningful insights from biological experiments. Even with constant advances in high-throughput experimental strategies, biological experiments are frequently constrained by sample availability (e.g., human tumor tissue), high experimental cost, and intrinsic biological and experimental variabilities (e.g., sequencing errors). Biological datasets therefore frequently present unique challenges to machine learning due to differences in sample type and experimental setup, small sample sizes from complex experiments, and distribution shifts (e.g., in vitro to in vivo). This thesis focuses on leveraging transfer learning techniques that aim to improve the performance of data-limited target tasks by utilizing data-rich source tasks. Three case studies are conducted to present the challenges and motivations and introduce novel methods in computational genomics demonstrating the utility of transfer learning.
The first case study focuses on predicting cancer drug responses using in vitro omics data. Due to the multitude of drugs and availability of in vitro models, some drugs were tested on only a limited number of samples. This, combined with the high dimensionality of data, can undermine the generalizability of the models trained for individual drugs. Multi-task learning (MTL) has been used to resolve this problem using joint models that predict responses to various tasks, i.e., drugs. However, MTL performance can be inferior to the single-drug model counterpart (i.e., negative transfer). We present TUGDA, a method that leverages task uncertainties and uses them to weigh task-to-feature transfers. Compared to existing MTL methods, TUGDA consistently reduced negative transfer on a large in vitro dataset. We also observed the fewest negative transfer cases for drugs with limited data, demonstrating the usefulness of TUGDA in pharmacogenomics.
The second case study analyzes in vivo cancer drug response prediction. A limitation, in this case, is the lack of in vivo drug response data. While in vitro datasets are frequently used to learn predictive models of cancer drug response, the models do not fully reflect the in vivo drug response due to tumor microenvironment, immune response, and other patients' health factors. To overcome this limitation, prior works either assumed that batch effects were the main source of differences to correct for between models or relied on assumptions that could lead to negative transfer. We extended TUGDA to generalize cancer drug response prediction from in vitro to in vivo. In contrast to existing methods, TUGDA adopts more realistic assumptions that penalize noisy features and presumes that these features are not conserved across domains. Comprehensive analysis showed that TUGDA outperformed several existing methods on a number of metrics, demonstrating the utility of TUGDA in transferring knowledge from in vitro to in vivo drug response prediction.
The third case study examines a taxonomic classification problem, where sequencing reads from unknown organisms were compared with genome databases to identify microbial species in a given metagenomic sample. We focused on long-read sequencing technologies, which have several advantages in terms of accessibility and read length over existing next-generation sequencing technologies. However, the key challenges are the moderate to high sequencing error rate and the limited number of microbial species with long-read data. We present MetageNN, a long-read neural network taxonomic classification method that is robust to sequencing errors and missing genomes. MetageNN overcomes the lack of long-read data by training on k-mer profiles of sequences from a large genome database. The model relies on short kmer-profiles known to be less affected by sequencing errors to reduce the distribution gap between genome sequences and noisy long reads. MetageNN outperforms existing deep-learning tools on real long-read datasets. When compared to conventional tools, MetageNN offers a practical compromise between sensitivity, particularly when handling novel species, a small memory footprint compared to k-mer matching tools, and a faster prediction speed compared to read-mapping tools.
Taken together, this thesis demonstrated the broad utility of transfer learning in different challenges posed by biological complexity. We explored biological and experimental variabilities from cancer drug response prediction to microbial taxonomic classification. These findings illustrate the potential of the novel methods introduced and the different paradigms used in this work to address these unique challenges in computational genomics datasets.