Learning Structured Representations of Visual Scenes
Abstract:
Recent advances in deep learning, large-scale data, and significantly more powerful computing abilities have brought numerous breakthroughs in computer vision in the past few years. For instance, machines have achieved near-human performance, or even outperform humans, in certain lower-level visual recognition tasks including image classification, segmentation, and object detection. However, for other higher- level vision tasks requiring a more detailed understanding of visual contents, such as visual question answering (VQA) and visual captioning (VC), machines still lag behind humans. This is partly because, unlike human beings, machines lack the ability to establish a comprehensive, structured understanding of the contents, on which reasoning can be performed. To be specific, higher-level vision tasks are usually overly simplified by operating models directly on images and are tackled by end-to-end neural networks without taking the compositional semantics of scenes into account. It has been shown that deep neural networks based models sometimes make serious mistakes caused by taking shortcuts learned from biased datasets. Moreover, the “black-box” nature of neural networks means their predictions are barely explainable, which is unfavorable for visual reasoning tasks like VQA. As the intermediate-level representations bridging the two levels, structured representations of visual scenes, such as visual relationships between pairwise objects, have been shown to not only benefit compositional models in learning to reason along with the structures but provide higher interpretability for model decisions. Nevertheless, these representations receive much less attention than traditional recognition tasks, leaving numerous open challenges (e.g., imbalance predicate class problem) unsolved.
In the thesis, we study how to describe the content of the individual image or video with visual relationships as the structured representations. A visual relationship between two objects (subject and object, respectively) is defined by a triplet of the form (subject,predicate,object), which include subject’s and object’s bounding boxes and category labels along with the predicate label. The triplet form of visual relationships naturally resembles how humans describe the interaction between two objects with a language sentence: an adverb (predicate) connects a subject to an object, e.g., “person is sitting on a chair” is represented by (person,sitting on,chair). To establish a holistic representation of a scene, a graph structure named scene graph constructed with visual objects as nodes and predicates as directed edges is usually utilized to take object and relation contexts into account. For instance, e.g., “person sitting on a chair is holding a glass” can be represented by (person,sitting on,chair) and (person,holding,glass) where the persons in the two visual relation triplets refer to the same entity.
In the first part of the thesis, we spend two chapters on learning structured representations of images with visual relationships and scene graphs, which are formulated as Visual Relationship Detection (VRD) and Scene Graph Generation (SGG), respectively. First, we delve into how to incorporate external knowledge to perform VRD. Inspired by the recent success of pretrained representations, we propose a Transformer-based multimodal model which recognizes visual relations with both visual and language commonsense knowledge learned via pretraining on the large-scale corpus. The proposed model is also equipped with an effective spatial module and a novel mask attention module to explicitly capture spatial information among the objects. These designs are shown to benefit VRD and help the model achieve competitive results on two challenging VRD datasets. Second, we re-think the role of datasets’ knowledge and argue that some of them are “bad” knowledge bringing in biases for predicting visual relationships and should be removed. Specifically, we tackle the critical data imbalance problem from a novel perspective of reporting bias, which arises from datasets itselves and causes machines to prefer easy predicates such as (person,on,chair) or (bird,in,room), to more informative predicates (person,sitting on,chair) or (bird,flying in,room). To remove this reporting bias, we develop a model-agnostic debiasing method for generating more informative scene graphs by taking the chances of predicate classes being labeled into account. Also, we shift the focus from VRD to SGG to generate holistic, graph-structured representations and leverage message passing networks for incorporating the contexts. Extensive experiments show that our approach significantly alleviates the long tail, achieves state-of-the-art SGG debiasing performance, and produces prominently more fine-grained scene graphs.
In the second part of the thesis, we extend the static-image VRD setting into temporal domain and consider human-object interaction (HOI) detection – a special case of VRD where the subjects of visual relationships are restricted to humans. Conventional HOI methods operating on only static images have been used by researchers to predict temporal-related HOIs in videos; however, in this way models neglect temporal contexts and may provide sub-optimal performance. Another related task, video visual relationship detection (VidVRD), is also not a suitable setting as i) VidVRD methods neglect human-related features in general, ii) video object detection remains challenging and iii) action boundaries labeling itself might be inconsistent. We thus propose to bridge these gaps by explicitly considering temporal information and adopting a keyframe-based detection for video HOI detection. We also show that a naive temporal-aware variant of a common action detection baseline underperforms in video-based HOIs due to a feature-inconsistency issue. We then propose a novel, neural network-based model utilizing temporal information such as human and object trajectories, frame-wise localized visual features, and spatial-temporal masked human pose features. Experiments show that our approach is not only a solid baseline in our proposed video HOI benchmark, but also a competitive alternative in a popular video relationship detection benchmark.
Overall, in these works, we explore how structured representations of visual scenes can be effectively constructed and learned in both the static-image and video settings, with improvements resulting from external knowledge incorporation, bias-reducing mechanism, and/or enhanced representation models. At the end of this thesis, we also discuss some open challenges and limitations to shed light on future directions of structured representation learning for visual scenes.