Gene expression analysis in the presence of heterogeneity

Ms Abha Belorkar
Dr Wong Lim Soon, Kithct Chair Professor, School of Computing

  20 Jun 2018 Wednesday, 01:30 PM to 03:00 PM

 Executive Classroom, COM2-04-02


Differential expression analysis is a popular approach for identifying genomic biomarkers that distinguish various phenotype conditions. Using the identified biomarkers, biological mechanisms responsible for the phenotype differences are inferred. While methods for such analysis have evolved significantly in the last two decades, they are unable to account for undeclared heterogeneity in the groups under comparison. On the other hand, heterogeneity, of either biological or non-biological origins, is observed to be invariably present in gene expression datasets. Delineating the basis of gene expression heterogeneity in relation to biological pathways is a difficult problem. Our work is aimed at addressing this challenge:

First, we propose a normalization technique based on rank-fuzzification - Gene Fuzzy Scores (GFS), which retains meaningful variation in gene expression and attenuates obscuring noise. This is important for two reasons: (a) the quality of preprocessing heavily impacts the reliability of downstream gene expression analysis; and (b) popular normalization methods are reported to seldom enhance the quality of expression data. Comparison of GFS with other popular techniques - mean-scaling, quantile normalization, z-score normalization - showed that output from our normalization approach is more consistent and biologically coherent.

Second, we present SPSNet - a method for differential expression analysis of sam- ples with potential heterogeneity. SPSNet reports a list of significant subnetworks (smaller components of biological pathways) whose expression reveals undeclared sub-populations within the given sample phenotypes. Current approaches to study heterogeneity perform comparisons of individual genes across phenotypes, and thus shroud a holistic view of the underlying biological mechanisms. In contrast, our approach reveals factors relevant to biological heterogeneity (e.g. disease subtypes, developmental stages) or non-biological heterogeneity (e.g. platform differences, batch effects) in the form of gene subnetworks, amplifies their effects in the data, and facilitates discrimination of subpopulations within phenotypes. Using publicly available gene expression datasets containing disease heterogeneity and batch effects, we show that SPSNet has low false-positive rate, high sensitivity, and high biological coherence in analyzing heterogeneous gene expression data.

Finally, with the help of an illustrative case-study, we demonstrate the potential of our methods for normalization and heterogeneity analysis - GFS and SPSNet - to analyze RNA-Seq datasets. We observe that data generated on RNA-Seq platforms, unlike microarray data, is subject to sampling stochasticity when se- quencing depth is insufficient. This fact plays a critical role in the performance of methods which analyze RNA-Seq data. We present a Bernoulli trial-based model to explain sampling stochasticity, and propose the use of discretized-GFS (D-GFS) to attenuate the stochasticity effect. In our analyses, we also note that silhouette score fails to accurately represent the degree of clustering in data which is characterized by high dispersion. In response, we suggest a simple and effective alternative for clustering assessment, based on a metric we define as kNN score - the proportion of samples whose label matches the majority of its k nearest neighbors.