PH.D DEFENCE - PUBLIC SEMINAR

Relation Understanding in Videos

Speaker
Mr Shang Xindi
Advisor
Dr Chua Tat Seng, Kithct Chair Professor, School of Computing


01 Sep 2021 Wednesday, 02:00 PM to 03:30 PM

Zoom presentation

Abstract:

Computational understanding of video content is in urgent demand under the scenario of big data. With the emergence of more complex applications, such as the semantic video retrieval, multimedia question answering with inference type questions, and various aspects of public safety, traditional techniques based on coarse-grained analytics relying on just objects and their colocations are no longer adequate. There has been increasing research attention on fine-grained visual analytics, ranging from basic object entities to detailed attributes, and from static images to dynamic videos. However, most research works still consider these fine-grained elements as an ensemble while overlooking the importance of relations between elements. This will inevitably limit the upper bound performance and explainability of such models. In fact, relation understanding is a key aspect to recognizing fine-grained events and activities within a video. In many AI and knowledge-based systems, a fact is represented as a relation between a subject entity and an object entity, or relation triplet, which forms the fundamental building block for high-level inference and decision-making tasks. Hence, relation understanding is also an important fundamental research problem towards the development of ultimate computer vision algorithms that can understand what they "see".

To this end, this thesis studies a novel task of video visual relation detection (VidVRD). The task aims to recognize the visual relations in the form of triplet for each pair of entities detected in the video. In particular, the thesis focuses on the spatial and verb relations such as "A in front of B" and "A chase B", respectively. Such relations are typical for the visual domain, yet challenging for the task due to the sparsity and large variance in relation representation, which are mainly introduced by the huge combination of entities and relationships, as well as the visual variance from blurring to deformation in videos. To initiate the research on VidVRD, the thesis first proposes a general pipeline framework to tackle the task and provides various baseline analyses.

However, the pipeline approach only treats the VidVRD task as several independent sub-tasks. Such separation will result in the lack of information flow between different sub-models, which creates redundant representation while each sub-task cannot share a common set of task-specific features. Therefore, the thesis further attempts to connect the sub-models in an end-to-end manner by proposing 3-D relation proposal module that serves as a critical bridge for relation feature learning. The idea is incorporated into a novel deep neural network architecture to learn deep relation representation from spatio-temporal multi-modal cues.

Further, the thesis investigates the core problem of VidVRD, that is the accurate classification of relation triplets. Prior approaches classify the three relation components in either an independent or cascaded manner, thus fail to fully exploit the inter-dependency among them. To utilize this inter-dependency in tackling the challenges of visual relation understanding, a novel iterative relation inference approach is proposed, which is light-weight yet effective. Correspondingly, a training approach is proposed to better learn the dependency knowledge from the likely correct triplet combinations. As such, the proposed inference approach can gradually refine each component based on the learned dependency knowledge and the latent predictions of the other two components. Ablation studies show that this iterative relation inference can empirically converge in a few steps and effectively boost the performance over prior approaches.

Meanwhile, the thesis also attempts to tackle the efficiency in constructing large-scale video dataset with object entity and relation annotations; and contributes two VidVRD benchmark datasets based on user-generated videos. With these two datasets, extensive experiments are conducted to demonstrate the effectiveness of the above proposed approaches and unveil future challenges and directions in relation understanding in videos.