Skip to Main content Skip to Navigation

Apprentissage de représentations pour l'analyse robuste de scènes audiovisuelles

Abstract : The goal of this thesis is to design algorithms that enable robust detection of objectsand events in videos through joint audio-visual analysis. This is motivated by humans’remarkable ability to meaningfully integrate auditory and visual characteristics forperception in noisy scenarios. To this end, we identify two kinds of natural associationsbetween the modalities in recordings made using a single microphone and camera,namely motion-audio correlation and appearance-audio co-occurrence.For the former, we use audio source separation as the primary application andpropose two novel methods within the popular non-negative matrix factorizationframework. The central idea is to utilize the temporal correlation between audio andmotion for objects/actions where the sound-producing motion is visible. The firstproposed method focuses on soft coupling between audio and motion representationscapturing temporal variations, while the second is based on cross-modal regression.We segregate several challenging audio mixtures of string instruments into theirconstituent sources using these approaches.To identify and extract many commonly encountered objects, we leverageappearance–audio co-occurrence in large datasets. This complementary associationmechanism is particularly useful for objects where motion-based correlations are notvisible or available. The problem is dealt with in a weakly-supervised setting whereinwe design a representation learning framework for robust AV event classification,visual object localization, audio event detection and source separation.We extensively test the proposed ideas on publicly available datasets. The experimentsdemonstrate several intuitive multimodal phenomena that humans utilize on aregular basis for robust scene understanding.
Document type :
Complete list of metadata

Cited literature [167 references]  Display  Hide  Download
Contributor : ABES STAR :  Contact
Submitted on : Tuesday, April 30, 2019 - 12:14:07 PM
Last modification on : Saturday, June 25, 2022 - 9:12:36 PM


Version validated by the jury (STAR)


  • HAL Id : tel-02115465, version 1


Sanjeel Parekh. Apprentissage de représentations pour l'analyse robuste de scènes audiovisuelles. Traitement du signal et de l'image [eess.SP]. Université Paris-Saclay, 2019. Français. ⟨NNT : 2019SACLT015⟩. ⟨tel-02115465⟩



Record views


Files downloads