22 May 2018 Tuesday, 02:30 PM to 04:00 PM
COM2 Level 4
Executive Classroom, COM2-04-02
Multimedia content is dominating today's Web information. The explosive growth in the amount of available online multimedia contents has created a need to filter, prioritize and efficiently deliver relevant information for users. This thesis is concerned with multimedia recommendation, a task of predicting the "rating" or "preference" that a user would give to a multimedia item (e.g., "video", "photo", "song").
First, we develop an effective scheme to predict the popularity of multimedia contents before they are published on social networks, which is known to play an important role in recommender systems. It is demonstrated that popularity feedback (e.g., number of downloads, user ratings, number of views) can be a powerful determinant under the conditions of data sparsity and cold-start scenarios in recommender systems. Specifically, we present a novel transductive multi-modal learning approach to predict the popularity of multimedia content with multi-modalities. We perform multi-modal learning, which seamlessly takes the modality relatedness and modality limitation into account by utilizing a common space shared by all modalities. Meanwhile, multimedia contents with different popularity can be better separated in such optimal common space, as compared to that of each single modality.
Second, we deeply explore the implicitness of users' preferences with respect to different items as well as the components within each item. The nature of multimedia user-item interactions is 1/0 binary implicit feedback (e.g., photo likes, video views, song downloads). We argue that there are two types of implicit feedback: item-level and component-level, which are usually neglected in conventional methods. To this end, we introduce the item-level and component-level attention model to assign attentive weights for inferring the underlying users' preferences encoded in the implicit user feedback. In particular, our attention model is a neural network consisting of two attention modules: the component-level attention module, starting from any content feature extraction network (e.g., CNN for images/videos), which learns to select informative components of multimedia items, and the item-level attention module, which learns to score the item preferences.
Third, we tackle the problem of specific venue prediction of user generated multimedia content, based on which to improve the performance of multimedia recommendation with location context. Particularly, we aim to predict the exact venue where the user was taking the photo or video rather than the general venue categories. Simply using the content information is insufficient for the task due to the high diversity of multimedia content. To alleviate the difficulty, an intuitive idea is to utilize a user's historical locations to restrict the venue prediction candidates and discover possible movement patterns. Therefore, we develop a generic embedding model based on matrix factorization that is able to capture the interaction between visual content and temporal patterns.
Extensive experiments have been conducted on several real-world datasets. The experimental results enable us to draw the following key findings. First, utilizing multiple modalities of multimedia items does improve the performance of popularity prediction. Second, it is important to take different levels of implicitness into consideration in multimedia recommendation. Third, exploring external knowledge beyond item contents is helpful for identifying exact location context.