Skip to Main content Skip to Navigation

Apprentissage de représentations pour l'analyse de scènes sonores

Abstract : This thesis work focuses on the computational analysis of environmental sound scenes and events. The objective of such tasks is to automatically extract information about the context in which a sound has been recorded. The interest for this area of research has been rapidly increasing in the last few years leading to a constant growth in the number of works and proposed approaches. We explore and contribute to the main families of approaches to sound scene and event analysis, going from feature engineering to deep learning. Our work is centered at representation learning techniques based on nonnegative matrix factorization, which are particularly suited to analyse multi-source environments such as acoustic scenes. As a first approach, we propose a combination of image processing features with the goal of confirming that spectrograms contain enough information to discriminate sound scenes and events. From there, we leave the world of feature engineering to go towards automatically learning the features. The first step we take in that direction is to study the usefulness of matrix factorization for unsupervised feature learning techniques, especially by relying on variants of NMF. Several of the compared approaches allow us indeed to outperform feature engineering approaches to such tasks. Next, we propose to improve the learned representations by introducing the TNMF model, a supervised variant of NMF. The proposed TNMF models and algorithms are based on jointly learning nonnegative dictionaries and classifiers by minimising a target classification cost. The last part of our work highlights the links and the compatibility between NMF and certain deep neural network systems by proposing and adapting neural network architectures to the use of NMF as an input representation. The proposed models allow us to get state of the art performance on scene classification and overlapping event detection tasks. Finally we explore the possibility of jointly learning NMF and neural networks parameters, grouping the different stages of our systems in one optimisation problem.
Complete list of metadata
Contributor : ABES STAR :  Contact
Submitted on : Wednesday, January 12, 2022 - 6:50:33 PM
Last modification on : Friday, January 14, 2022 - 4:24:01 PM
Long-term archiving on: : Wednesday, April 13, 2022 - 11:37:04 PM


Version validated by the jury (STAR)


  • HAL Id : tel-03523676, version 1



Victor Bisot. Apprentissage de représentations pour l'analyse de scènes sonores. Apprentissage [cs.LG]. Télécom ParisTech, 2018. Français. ⟨NNT : 2018ENST0016⟩. ⟨tel-03523676⟩



Record views


Files downloads