Effective GPU-Based Similarity Search

Mr Guo Qi
Dr Anthony Tung Kum Hoe, Professor, School of Computing

  15 Nov 2019 Friday, 10:00 AM to 11:30 AM

 Executive Classroom, COM2-04-02


The data currently generated and collected increase fast not only in volume, but also in complexity, which brings great challenges to the field of data analysis. Similarity search has been acknowledged as one of the most important and useful query operators, and it plays a fundamental role of many applications. Thus, in this thesis, we investigate the problems concerning effective similarity search. We address the term "effective" from three perspectives, aiming to provide our solutions to modern similarity search challenges.

First of all, many optimizing strategies for similarity search tailored to special data types and similarity measures cannot be easily extended to support others. It is worth to have a generic solution to tackle the data complexity challenge utilizing the superior processing power of GPUs, which can reduce the burden on developing index structures and searching methods for specific data types. To this end, we propose GENIE, an efficient and parallelizable GPU-based Generic Inverted Index framework, which can support a wide variety of data types and similarity measures. The principal intuition is that we notice many data types can be transformed into a form by either the Locality Sensitive Hashing (LSH) scheme or the Shotgun and Assembly (SA) scheme, so that similarity search can be conducted by an inverted-index-like structure on GPUs in parallel.

Secondly, the effectiveness of the basic similarity search results is limited, largely because many redundantly similar elements are included. Extensive literatures have discussed the problem, and one of the most popular approaches is to consider result diversification. . However, many existing work focus on one type of diversity, which is defined based on the distance between two data points in the feature space. Due to the concentration phenomenon, effectiveness of distance-based diversity is not desirable, especially in high dimensional space. We, thus, propose a novel definition of diversity based on spatial angles, which helps to retrieve angular diverse neighbors of a given query. Because of the nature of the proposed diversity, these angular diverse neighbors can not only be close to the query but also encircle the query in different directions. Such properties could not be guaranteed by using the distance-based diversity.

Thirdly, the importance of distance metrics in similarity search has been proved through many applications. It is non-trivial to select or learn proper distance metrics or similarity measures, especially when considering that each individual user may have own demands. Existing metric learning approaches are mainly based on static information coming with the data, for example, class labels. But from human's experience, people could have different perceived similarity in mind. None of the literatures address the metric learning problem with consideration of user's individuality. Hence, we introduce the personalized metric learning framework and study a specific problem of finding perceived facial similarity over a new collected data set.

We also discuss several potential future research problems, including the construction and application of diverse nearest neighbor graph, and online personalized metric learning.