Incorporating User Activity Data for Improved User Entity Resolution

Mr Sapumal Ahangama
A/P Poo Chiang Choon, Danny, Associate Professor, School of Computing

  19 Sep 2019 Thursday, 02:00 PM to 03:30 PM

 MR3, COM2-02-26


Widespread adoption of information systems has created vast amounts of digital data traces concerning the users, their relationships and personal activities. Interconnecting the digital data traces of a user from various information systems is not an easy task and is an open problem, commonly known as user entity resolution. Since the focus of prior research on incorporating user activity data for user entity resolution seems to be inadequate, in this thesis it is attempted to develop methods that would enable the incorporation of user activity data in user entity resolution with improved accuracy.

In approaching user entity resolution problems, prior researchers have carried out studies in two directions. One direction is on approximate matching or blocking methods with the intention of reducing the search space in relative terms. The second direction is on more granular user matching with the intention of identifying top-K user matches which would further filter the created blocks. Acknowledging the limitations and differentiation power of user activity data, this thesis presents methods where user activity data can be effectively incorporated in these two directions.

The first part of the thesis is an approximate matching method for text generated by users in related domains. In this proposed approach, text data was first mapped to the latent space using topic modelling and different localities within the latent space were searched using locality sensitive hashing to generate the user blocks. The hash boundaries were relaxed iteratively to account for cross domain differences. The superiority of results obtained from this study exceeding the state of the art baseline methods, validates the proposed approach of blocking the users in the latent space. The proposed method is capable of reducing the search space approximately by 80% for 90% of the users in a highly homogenous cross domain dataset.

In the second part of the thesis, it was attempted to identify a cross domain transfer model for user activity data using a deep learning transfer model. The intuition is that the user activity data in one domain could be converted to an estimated representation of a secondary domain using the proposed model. This secondary domain representation could be used to search and further filter the blocks generated through approximate matching methods to identify the user. The study presents four sub-variations of the model as a model for incomplete view in the secondary domain, a model developed for cold start scenario in the secondary domain, a model to incorporate auxiliary information and a model to incorporate first order network proximity of the users.

In the evaluation, the knowledge transfer capabilities of the models are first validated, using a recommendation problem. Upon validation, the model is applied to further filter the blocks generated, through approximate matching methods. The results indicate that the proposed approach has better capability in identifying the top K most probable matches within the above blocks, with results far exceeding the state of the art baselines.