Apprentissage de représentations pour l'analyse robuste de scènes audiovisuelles

Abstract : The goal of this thesis is to design algorithms that enable robust detection of objectsand events in videos through joint audio-visual analysis. This is motivated by humans’remarkable ability to meaningfully integrate auditory and visual characteristics forperception in noisy scenarios. To this end, we identify two kinds of natural associationsbetween the modalities in recordings made using a single microphone and camera,namely motion-audio correlation and appearance-audio co-occurrence.For the former, we use audio source separation as the primary application andpropose two novel methods within the popular non-negative matrix factorizationframework. The central idea is to utilize the temporal correlation between audio andmotion for objects/actions where the sound-producing motion is visible. The firstproposed method focuses on soft coupling between audio and motion representationscapturing temporal variations, while the second is based on cross-modal regression.We segregate several challenging audio mixtures of string instruments into theirconstituent sources using these approaches.To identify and extract many commonly encountered objects, we leverageappearance–audio co-occurrence in large datasets. This complementary associationmechanism is particularly useful for objects where motion-based correlations are notvisible or available. The problem is dealt with in a weakly-supervised setting whereinwe design a representation learning framework for robust AV event classification,visual object localization, audio event detection and source separation.We extensively test the proposed ideas on publicly available datasets. The experimentsdemonstrate several intuitive multimodal phenomena that humans utilize on aregular basis for robust scene understanding.
Document type :
Theses
Complete list of metadatas

Cited literature [167 references]  Display  Hide  Download

https://pastel.archives-ouvertes.fr/tel-02115465
Contributor : Abes Star <>
Submitted on : Tuesday, April 30, 2019 - 12:14:07 PM
Last modification on : Thursday, October 17, 2019 - 12:36:55 PM

File

78492_PAREKH_2019_archivage.pd...
Version validated by the jury (STAR)

Identifiers

  • HAL Id : tel-02115465, version 1

Citation

Sanjeel Parekh. Apprentissage de représentations pour l'analyse robuste de scènes audiovisuelles. Traitement du signal et de l'image [eess.SP]. Université Paris-Saclay, 2019. Français. ⟨NNT : 2019SACLT015⟩. ⟨tel-02115465⟩

Share

Metrics

Record views

255

Files downloads

130