PH.D DEFENCE - PUBLIC SEMINAR

Monocular Image/Video-based Human Pose Estimation

Speaker
Mr. Lin Jiahao
Advisor
Dr Lee Gim Hee, Associate Professor, School of Computing


30 Sep 2020 Wednesday, 03:30 PM to 05:00 PM

Zoom presentation

Join Zoom Meeting
https://nus-sg.zoom.us/j/99066922155?pwd=djEyYmcrWUJzZTdXU1RxRWxVUktiZz09

Meeting ID: 990 6692 2155
Password: 980279

Abstract:
Human pose estimation is a challenging yet important area in computer vision that has varieties of applications in building machine intelligence. Monocular 3D human pose estimation is one of the typical tasks which infers human pose in 3D coordinate space from single or a sequence of images. Temporal information such as long-range dependencies provides extra information and could be exploited to address this ill-posed problem. We propose a trajectory space factorization approach to model sequential data and produce temporal consistent 3D pose estimates for all frames in a sequence of images. Specifically, we adopt matrix factorization to transform 2D motion trajectories into coefficients in trajectory space with pre-defined trajectory bases. 2D-to-3D regression is performed in trajectory space by learning a discriminative deep network to regress the 3D trajectory coefficients. Our framework produces estimation for all frames in a video sequence with a single forward pass, while achieving state-of-the-art 3D pose estimation performance.

To address human pose estimation for multi-person scenarios, two-stage approaches including top-down and bottom-up approaches have been adopted. Existing bottom-up approaches utilize learned embeddings from CNNs to obtain pairwise affinities for human joint grouping. Visual-based affinities are agnostic to the underlying human poses and may lead to infeasible pose estimates. We formulate the grouping task as a graph partitioning problem and propose a geometry-aware grouping framework based on Graph Neural Network (GNN) which incorporates coordinates of joint detections into the estimation of affinities. We empirically show that combining geometry-aware affinity with existing visual-based association approach effectively eliminates implausible pose estimates and improves the robustness of the grouping process.

3D pose estimation could also be extended to images containing multiple persons. Estimating the global location of each person in the camera space is critical for scene understanding, including human actions, human-human interactions, etc. We propose our Human Depth Estimation Network (HDNet) to globally localize each person in camera space. In particular, our HDNet estimates 2D human pose and uses the 2D pose heatmaps as attention masks to extract pose-related features for depth estimation. The features are further refined with a Graph Neural Network (GNN) to estimate the root joint depth with a classification output. We show state-of-the-art results on both root joint localization task and absolute 3D pose estimation task.