PH.D DEFENCE - PUBLIC SEMINAR

Towards Learning Scene Semantics on 3D Point Clouds

Speaker
Ms Zhao Na
Advisor
Dr Chua Tat Seng, Kithct Chair Professor, School of Computing


03 Mar 2021 Wednesday, 04:00 PM to 05:30 PM

Zoom presentation

Abstract:

3D scene understanding is essential for intelligent agents (e.g., domestic robots and autonomous vehicles) to progress towards human level intelligence. Recent advance of 3D sensors (e.g., RGB-D cameras and LiDARs) has enabled intelligent agents to easily sense their surrounding 3D scenes and acquire 3D point clouds of the scenes. However, the raw 3D point clouds are of limited use for the intelligent agents to understand the 3D scenes, since they lack structures and semantics. Hence, there is a strong need to learn scene semantics from 3D point clouds. In particular, the abilities of learning object-level and point-level semantics are vital to understanding point clouds of 3D scenes. More specifically, the ability of learning object-level semantics refers to the prediction of semantic category and the oriented 3D bounding box for all interested objects in the 3D point cloud, also known as 3D object detection. While the ability of learning point-level semantics refers to the automatic assignment of each point with a semantic class, also known as 3D semantic segmentation.

In this thesis, we focus on these two fundamental and challenging tasks, i.e., 3D object detection and 3D semantic segmentation, towards learning scene semantics on 3D point clouds. We utilize deep learning techniques and develop deep models to infer meaningful semantics for 3D data. When developing deep models for these two tasks, we investigate a series of learning settings, including fully-supervised learning, few-shot learning, semi-supervised learning, and incremental learning. These learning settings evolve from standard learning setting to practical learning settings, which take into account various real-world scenarios when defining problems and designing models.

This thesis begins with the task of 3D semantic segmentation. First, we specifically focus on fully-supervised learning of 3D semantic segmentation, where we design a permutation-invariant deep neural network that incorporates local and global context of points for improving the segmentation performance. Undoubtedly deep models have made significant performance improvements on 3D semantic segmentation with massive annotated datasets. However, their performance is largely hampered when learning new classes with scarce examples, which limits their practical usage. In view of this limitation, this thesis thus introduces few-shot learning for 3D semantic segmentation. The goal of few-shot 3D semantic segmentation is to segment novel classes given few annotated samples for training. We propose a novel attention-aware multi-prototype transductive inference method to solve this challenging problem.

Subsequently, this thesis investigates the task of 3D object detection. Despite remarkable performance achieved by deep learning based approaches for this task, there are two major issues that hinder the deployment of these existing approaches in real-world scenarios: 1) their performance heavily relies on large-scale high-quality 3D annotations, which are often tedious and expensive to collect; and 2) they suffer from a catastrophic performance drop on the originally trained classes when incrementally learning new classes without revisiting the past data. To mitigate the first data hunger issue, this thesis studies semi-supervised 3D object detection that requires only few labeled samples and proposes a self-ensembling semi-supervised 3D object detection method that achieves impressive detection results. Furthermore, to address the second continuous learning issue, this thesis investigates the unexplored class-incremental 3D object detection problem and presents a novel static-dynamic co-teaching method that is able to greatly preserve the detection capacity on previously seen classes.

Overall, this thesis studies two important tasks in 3D scene understanding from four different learning perspectives. The success of these studies would enable intelligent agents to progress towards learning scene semantics on 3D point clouds in real-world scenarios.