Alignement temporel musique-sur-partition par modèles graphiques discriminatifs

Cyril Joder 1
1 AAO
TSI - Département Traitement du Signal et des Images, LTCI - Laboratoire Traitement et Communication de l'Information
Abstract : This thesis focuses on the problem of aligning a musical recording to the corresponding score, which can find numerous applications in the field of music information retrieval. We choose a probabilistic approach and introduce the use of discriminative graphical models called conditional random fields (CRF) for this task, by expressing it as a sequence labeling problem. Indeed, the CRF framework is aimed at sequence segmentation or labeling, and it allows for the design of more flexible models than hidden Markov and hidden semi-Markov models which are commonly used in the alignment literature. In particular, CRFs allow for the use of a acoustic features extracted from a whole sequence of audio frames, instead of a single observation. We take advantage of this property to design features which perform an implicit modeling of the notion of tempo, at the lowest level of the model. Furthermore, we propose three different dependency structures for the modeling of the musical event durations, corresponding to different degrees of precision in the modeling of musical event durations. Three types of features are used, characterizing the local harmony, note attacks and tempo. Experiments run on a large database of classical piano and popular music exhibit very accurate alignments. Indeed, with the best performing system, more than 95 % of the note onsets are detected with a precision finer than 100 ms. Several traditional features, extracted from different representations of the audio, are considered for the characterization of the local match between the score and the recording. A comparison of these descriptors is conducted on the basis of their efficiency on the alignment task. Furthermore, we address the design of novel features, by learning a linear transformation from the symbolic to any time-frequency audio representation. We explore a best fit strategy (minimum divergence) as well as a discriminative criterion (maximum likelihood) for the estimation of the optimal mapping and show that such a learning has the potential to increase the alignment accuracy, for all the tested audio representations. Finally, we explore several strategies to take into account constraints relating to real use cases. In particular, complexity reduction is obtained thanks to a novel dedicated hierarchical pruning strategy. This method takes advantage of the hierarchical structure of music for a multi-pass decoding approach, yielding a better overall efficiency than the beam search method traditionally used in HMM-based models. We additionally show how the proposed framework can be modified in order to be robust to possible structural differences between the score and the musical performance, and we study the scalability properties of the models used.
Complete list of metadatas

Cited literature [132 references]  Display  Hide  Download

https://pastel.archives-ouvertes.fr/pastel-00664260
Contributor : Cyril Joder <>
Submitted on : Monday, January 30, 2012 - 11:11:34 AM
Last modification on : Monday, February 25, 2019 - 11:08:10 AM
Long-term archiving on : Wednesday, December 14, 2016 - 2:49:23 AM

Identifiers

  • HAL Id : pastel-00664260, version 1

Citation

Cyril Joder. Alignement temporel musique-sur-partition par modèles graphiques discriminatifs. Traitement du signal et de l'image [eess.SP]. Télécom ParisTech, 2011. Français. ⟨pastel-00664260⟩

Share

Metrics

Record views

475

Files downloads

808