Skip to Main content Skip to Navigation

Effective and annotation efficient deep learning for image understanding

Abstract : Recent development in deep learning have achieved impressive results on image understanding tasks. However, designing deep learning architectures that will effectively solve the image understanding tasks of interest is far from trivial. Even more, the success of deep learning approaches heavily relies on the availability of large-size manually labeled (by humans) data. In this context, the objective of this dissertation is to explore deep learning based approaches for core image understanding tasks that would allow to increase the effectiveness with which they are performed as well as to make their learning process more annotation efficient, i.e., less dependent on the availability of large amounts of manually labeled training data. We first focus on improving the state-of-the-art on object detection. More specifically, we attempt to boost the ability of object detection systems to recognize (even difficult) object instances by proposing a multi-region and semantic segmentation-aware ConvNet-based representation that is able to capture a diverse set of discriminative appearance factors. Also, we aim to improve the localization accuracy of object detection systems by proposing iterative detection schemes and a novel localization model for estimating the bounding box of the objects. We demonstrate that the proposed technical novelties lead to significant improvements in the object detection performance of PASCAL and MS COCO benchmarks. Regarding the pixel-wise image labeling problem, we explored a family of deep neural network architectures that perform structured prediction by learning to (iteratively) improve some initial estimates of the output labels. The goal is to identify which is the optimal architecture for implementing such deep structured prediction models. In this context, we propose to decompose the label improvement task into three steps: 1) detecting the initial label estimates that are incorrect, 2) replacing the incorrect labels with new ones, and finally 3) refining the renewed labels by predicting residual corrections w.r.t. them. We evaluate the explored architectures on the disparity estimation task and we demonstrate that the proposed architecture achieves state-of-the-art results on the KITTI 2015 benchmark.In order to accomplish our goal for annotation efficient learning, we proposed a self-supervised learning approach that learns ConvNet-based image representations by training the ConvNet to recognize the 2d rotation that is applied to the image that it gets as input. We empirically demonstrate that this apparently simple task actually provides a very powerful supervisory signal for semantic feature learning. Specifically, the image features learned from this task exhibit very good results when transferred on the visual tasks of object detection and semantic segmentation, surpassing prior unsupervised learning approaches and thus narrowing the gap with the supervised case.Finally, also in the direction of annotation efficient learning, we proposed a novel few-shot object recognition system that after training is capable to dynamically learn novel categories from only a few data (e.g., only one or five training examples) while it does not forget the categories on which it was trained on. In order to implement the proposed recognition system we introduced two technical novelties, an attention based few-shot classification weight generator, and implementing the classifier of the ConvNet based recognition model as a cosine similarity function between feature representations and classification vectors. We demonstrate that the proposed approach achieved state-of-the-art results on relevant few-shot benchmarks
Document type :
Complete list of metadata

Cited literature [188 references]  Display  Hide  Download
Contributor : ABES STAR :  Contact
Submitted on : Monday, October 28, 2019 - 12:02:35 PM
Last modification on : Saturday, January 15, 2022 - 3:56:55 AM
Long-term archiving on: : Wednesday, January 29, 2020 - 3:38:53 PM


Version validated by the jury (STAR)


  • HAL Id : tel-02335437, version 1


Spyridon Gidaris. Effective and annotation efficient deep learning for image understanding. Other. Université Paris-Est, 2018. English. ⟨NNT : 2018PESC1143⟩. ⟨tel-02335437⟩



Record views


Files downloads