PH.D DEFENCE - PUBLIC SEMINAR

A Human-Centered Approach for Visual Understanding

Speaker
Mr. Shen Zhiqi
Advisor
Dr Mohan Kankanhalli, Provost'S Chair Professor, School of Computing


31 Jul 2023 Monday, 02:30 PM to 04:00 PM

MR3, COM2-02-26

Abstract:

The field of deep learning-based artificial intelligence has experienced widespread growth in recent years, and computer vision, as one of its crucial areas, has achieved remarkable success in various fields, including video surveillance, social service robots, and autonomous vehicles. While many applications have been developed and integrated into our daily lives, most algorithms simply learn from data without deep consideration of human underlying perceptual processes, resulting in models that are often diffficult to explain, and whose performance is inadequate. It is our belief that humans are at the center of these artificial intelligence algorithms, and the aforementioned problems can be significantly mitigated if we understand how humans perceive and understand visual content.

Our research focuses on conducting fundamental studies on two types of human visual perception - human visual sensitivity and human visual attention - and using the findings from these studies to guide computer model design. Our thesis outlines three works related to human visual perception, namely: 1) human visual sensitivity to images, 2) human visual attention to images, and 3) human visual sensitivity to videos.

The study of human visual sensitivity to images aims to investigate and understand how humans perceive differences in images. For example, an image is composed of various objects and regions, and some of these may have been tempered. To determine which tampered objects and regions are most easily identifiable by people, we perform a human-centered analysis of human visual sensitivity at three levels: 1) low-level object features, including object color, illumination, and texture; 2) mid-level object features, including object size, convexity, solidity, and complexity; and 3) high-level object attributes, including object sentiment and semantics. After quantifying the degree of human sensitivity to objects, we introduce a new concept - the human sensitivity map - to quantitatively measure human sensitivity to visual changes. We apply this sensitivity map to privacy protection areas by adding perturbation noise to images to prevent machines from learning sensitive information from the visual scene, while making the noises imperceptible to human observers.

We extend the analysis of human visual sensitivity to visual attention. We investigate what factors make an image region attractive to humans. While most existing works strive to optimize the model structure to achieve better performance, we believe that high-level attributes, such as emotion-elicitation features, make important contributions and are more explainable as they are closer to human cognition. Based on the above, we designed a model with Atrous Spatial Pyramid Pooling (ASPP) and channel-weighting modules, which achieves state-of-the-art performance on four datasets.

Finally, we extend our research on human visual sensitivity to the video modality. While human visual sensitivity in each frame is similar to that in images, videos contain both spatial and temporal information. Due to the persistence of human vision, even tiny differences between video frames become noticeable. We continue to research the visual sensitivity, but to tampered videos. Our aim is to make the changes on videos as imperceptible as possible. We design a loss to address the imperceptibility problem across both spatial and temporal dimensions. Additionally, due to computational limitations, video frame sampling is typically used to select the most representative frames. Thus, the unknown frame selection problem from the frame sampling is unavoidable. To address this, we apply the Markov Chain Monte Carlo (MCMC) technique. Similar to images, we apply our research in video privacy protection to prevent machines from learning sensitive information in videos.

In summary, our thesis outlines three studies that explore the characteristics of human visual perception in image and video tasks. We demonstrate that understanding human perceptions can help build more explainable computational models with better performance. Furthermore, the fundamental studies of human visual perception have huge potential for new computer vision tasks, such as privacy-aware visual analytics and social advertising.