PH.D DEFENCE - PUBLIC SEMINAR

Solving Computer Vision Problems under the Compositionality Principle

Speaker
Mr. Xu Ziwei
Advisor
Dr Mohan Kankanhalli, Provost'S Chair Professor, School of Computing


11 Apr 2023 Tuesday, 02:00 PM to 03:30 PM

Executive Classroom, COM2-04-02

Abstract:

Finding representations for the rich visual world is a long-standing problem that intrigues researchers in cognitive sciences, linguistics, and computer vision. Among the different theories, a commonly acknowledged principle is that of compositionality: complex concepts are formed out of primitive concepts. Guided by this principle, this thesis presents solutions to various computer vision problems in images and videos. We discuss how compositionality can be used to relate different aspects of these problems and produce efficient and robust solutions.

The first problem is to learn transformation-invariant representations of motion signals. We introduce the Motion Capsule Autoencoder (MCAE), which models a motion signal as a composition of identity and variation. Specifically, a motion signal is represented by a set of transformation invariant templates and the corresponding geometric transformations by using capsule autoencoders of a novel design. This leads to a robust identity-variation composition, where the variation is the transformation-induced changes and identity is the category of the motion signal.

The second problem is recognising objects described by adjective-noun pairs in images. This problem can be naturally viewed as a composition between semantics: two primitive concepts are combined into a complex one. The key challenge for this problem is that some concept pairs only appear at test time, which requires the system to minimise its bias toward seen pairs. Essentially, this requires the system to learn how to model the composition of pairs, instead of treating pairs as individual classes. To rectify this issue, we present the Blocked Message Passing Network (BMP-Net) with an improved message passing mechanism. We show that the presented mechanism produces less biased predictions compared with prior works.

The last problem is the temporal segmentation of human activity videos. We would like to discover and temporally locate a series of actions in a long video. A compositional view of this problem is that the activity depicted in the video is composed of a logically meaningful collection of actions. The challenging part of this problem is that the dependencies between actions are complicated and not explicitly provided in data annotations. To solve this problem, we develop a differentiable temporal logic (DTL) framework to allow for end-to-end training of deep learning models with constraints written as temporal logic formulae. We show that the presented framework can improve the performance of various deep learning models on this task by reducing the logical errors in their outputs.

In summary, this thesis presents solutions to various computer vision problems based on the compositionality principle. The findings are followed by an overview of their limitations and suggestions for future research directions.