Skip to Main content Skip to Navigation

Object viewpoint estimation in the wild

Abstract : The goal of this thesis is to develop deep-learning approaches for estimating the 3D pose (viewpoint) of an object pictured in an image in different situations: (i) the object location in the image and the exact 3D model of the corresponding object are known, (ii) both the object location and the class are predicted and an exemplar 3D model is provided for each object class, and (iii) no 3D model is used and object location is predicted without the object being classified into a specific category.The key contributions of this thesis are the following. First, we propose a deep- learning approach to category-free viewpoint estimation. This approach can estimate the pose of any object conditioned only on its 3D model, whether or not it is similar to the objects seen at training time. The proposed network contains distinct modules for image feature extraction, shape feature extraction and pose prediction. These modules can have different variants for different representations of 3D models, but remain trainable end-to-end. Second, to allow inferring without exact 3D object models, we develop a class-exemplar-based viewpoint estimation approach that learns to condition the viewpoint prediction on the corresponding class feature extracted from a few 3D models during training. This approach differs from the previous approach in the sense that we extract an exemplar feature for each class instead of treating them independently for each object. We show that the proposed approach is robust against the precision of the provided 3D models and that can be adapted quickly to novel classes with using a few labeled examples. Third, we define a simple yet effective unifying framework that tackles both few-shot object detection and few- shot viewpoint estimation. We exploit, in a meta-learning setting, task-specific class information present in existing datasets, such as images with bounding boxes for object detection and exemplar 3D models of different classes for viewpoint estimation. And we propose a joint evaluation of object detection and viewpoint estimation in the few-shot regime. Finally, we develop a class-agnostic object viewpoint estimation approach that estimates the viewpoint directly from an image embedding, where the embedding space is optimized for object pose estimation through a geometry-aware contrastive learning. Rather than blindly pulling together features of the same object in different augmented views and pushing apart features of different objects while ignoring the pose difference between them, we propose a pose-aware contrastive loss that pushes away the image features of objects having different poses, ignoring the class of these objects. By sharing the network weights across all categories during training, we obtain a class-agnostic viewpoint estimation network that can work on objects of any category. Our method achieve state-of-the-art results in the Pascal3D+, ObjectNet3D and Pix3D category-level object pose estimation benchmarks, under both intra-dataset and inter-dataset settings.
Complete list of metadata
Contributor : ABES STAR :  Contact
Submitted on : Monday, January 24, 2022 - 6:38:14 PM
Last modification on : Thursday, September 29, 2022 - 10:47:06 AM
Long-term archiving on: : Tuesday, April 26, 2022 - 8:33:06 AM


Version validated by the jury (STAR)


  • HAL Id : tel-03541699, version 1


Yang Xiao. Object viewpoint estimation in the wild. Computer Vision and Pattern Recognition [cs.CV]. École des Ponts ParisTech, 2021. English. ⟨NNT : 2021ENPC0021⟩. ⟨tel-03541699⟩



Record views


Files downloads