PH.D DEFENCE - PUBLIC SEMINAR

Contextually Grounded Affective Analysis of Media

Speaker
Mr Devamanyu Hazarika
Advisor
Dr Roger Zimmermann, Professor, School of Computing


07 Dec 2021 Tuesday, 09:00 AM to 10:30 AM

Zoom presentation

Abstract:

Endowing machines with the capability to sense and emote affect (emotions, sentiments, mood, personality, etc.) is a long-standing goal of Artificial Intelligence (AI). Over the past three decades, the field of Affective Computing has made significant progress in building both affect detectors and generators. However, most works have primarily explored affective learning from data in isolated forms. In contrast, human communication is multimodal and majorly occurs via interactions (dialogs and conversations). Thus, training computational affective models in a contextual environment---with the ability to leverage heterogeneous information---is an important problem.

In this thesis, we embark on this direction and explore the role of context in the affective analysis of user-generated media. We focus on conversational videos containing two major forms of context, conversation histories and multimodal information. Affective understanding of such resources presents multiple challenges, including modeling interpersonal emotional influences in conversations, leveraging heterogeneous and incongruent multimodal signals, building affective systems robust against sarcasm, and using sample-efficient models for low resource training. We study these challenges in great detail and propose novel solutions on various affective tasks, including emotion, sentiment, and sarcasm analysis.

In the first part of the thesis, we delve into conversational context and introduce the task of Emotion Recognition in Conversations (ERC). ERC involves detecting the emotions of utterances present in conversational videos. In this task, we delve into improving affective understanding by modeling speaker-specific conversational histories and interpersonal influences. We hypothesize a theoretical framework that governs emotional dynamics in conversations, and propose two models, CMN and ICON. These models achieve significant improvements over baselines in multiple benchmarks, demonstrating the importance of conversational modeling for affective tasks.

In addition to improving performance on ERC, we also investigate means to improve the sample efficiency of these models via transfer learning. In particular, we propose an approach, TL-ERC, where we pre-train a hierarchical dialogue model on multi-turn conversations and then utilize its parameters to warm-start a conversational emotion classifier. Through several experiments, we find that knowledge acquired from resource-rich neural response generation can indeed help the resource-poor ERC task.

In the second part of the thesis, we explore multimodal context. First, we look into pragmatic aspects of multimodal conversations, where we study the task of automated sarcasm detection. Sarcasm is highly prevalent in opinionated text, and its detection is essential to build robust affective systems, such as sentiment analysis tools. While most works explore sarcasm through the lens of individual modalities, we argue that incorporating multimodal cues can improve the automatic classification of sarcasm. Towards this goal, we publicly release a new sarcasm dataset, Multimodal Sarcasm Detection Dataset (MUStARD), consisting of audiovisual utterances from famous TV shows and annotated with high-quality sarcasm labels. Second, we focus on improving multimodal representations for affective tasks. While the literature primarily proposes complex fusion mechanisms, we take another route and explore the pre-cursory step to fusion -- multimodal representation learning. In particular, we propose a modality-invariant and -shared representation learning model named MISA. We perform extensive experimentation of MISA on multiple affective tasks, such as multimodal sentiment analysis and multimodal humor detection. Results show significant improvements over state-of-the-art models, establishing MISA as an effective solution to learn multimodal representations.

Overall, this thesis proposes models and resources that leverage conversational and multimodal contexts to address affective computing applications. In these works, we find that providing the right kind of inductive bias is crucial to model the heterogeneous contextual signals, which ultimately improves the performance over the respective tasks. We also highlight limitations, open challenges, and research trends and discuss potential future directions of contextual affective computing.