Modeling Dependencies with Mixture Models and Copulas

Mr. Siva Rajesh Kasa
Dr Vaibhav Rajan, Assistant Professor, School of Computing

07 Mar 2023 Tuesday, 12:00 PM to 01:30 PM

Zoom presentation


Model-based clustering is a well-established paradigm for clustering multivariate data. In model-based clustering, the data is assumed to be generated by a finite mixture model where each component represents a cluster. For continuous-valued variables, it is common to model each component density by a multivariate Gaussian distribution, leading to Gaussian Mixture Models (GMM). Traditionally modelling cluster dependencies using GMMs has centered around Pearson correlation and Gaussianity assumption, which is often not the case in real world data. In this work, we study the modeling of data when these assumptions are violated.

First, we discuss how misspecified Gaussian mixture models lead to inferior clusterings. We devise a new KL-divergence penalty term, based on the fitted components, to improve the clustering accuracy. Inference of this penalized objective is intractable using Expectation Maximization (EM); however, it can be done effortlessly using Automatic-Differentiation (AD) based Gradient Descent (GD). We show how the AD-GD based inference approach can be extended to high-dimensional data and not well-separated clusters. We show that this approach has favourable properties through theoretical and empirical work.

Second, we discuss fitting a more flexible extension of GMM through the use of copulas. Copulas provide a modular parametrization of multivariate distributions that decouples the modeling of marginals from the dependencies between them. Gaussian Mixture Copula Model (GMCM) is a highly flexible copula that can model many kinds of multi-modal asymmetric dependencies. They have been effectively used in clustering non-Gaussian data and in Reproducibility Analysis of high-throughput genomic experiments. Parameter estimation for GMCM is challenging due to its intractable likelihood. We propose the use Automatic Differentiation (AD) tools to develop a method, called AD-GMCM, that can maximize the exact GMCM likelihood. In our simulation studies and experiments with real data, AD-GMCM finds more accurate parameter estimates than PEM and yields better performance in clustering and reproducibility analysis. We also discuss a variant of GMCM for high-dimensional data, known as HD-GMCM and discuss its theoretical properties and real-world applications.