Skip to Main content Skip to Navigation

Reconstitution de la parole par imagerie ultrasonore et vidéo de l'appareil vocal : vers une communication parlée silencieuse

Thomas Hueber 1 
1 SIGMA - Laboratoire Signaux, Modèles et Apprentissage Statistique
ESPCI Paris - Ecole Superieure de Physique et de Chimie Industrielles de la Ville de Paris, CNRS - Centre National de la Recherche Scientifique : UMR7084
Abstract : The aim of the thesis is the design of a "silent speech interface", or system permitting voice communication without vocalization. Two main applications are targeted: assistance to laryngectomized persons; and voice communication when silence must be maintained (public transport, military situation) or in extremely noisy environments. The system developed is based on capturing articulatory activity via ultrasound and video imaging. The problem addressed in this work is that of transforming multimodal observations of articulatory gestures into an audio speech signal. This "visuo-acoustic" conversion is achieved using machine learning methods requiring the construction of audiovisual training databases. To this end, in order to monitor the position of the ultrasound probe relative to the speaker's head during data acquisition, a procedure based on the use of two inertial sensors is first proposed. Subsequently, a system allowing to synchronously acquire high-speed ultrasound and video images of the vocal tract together with the uttered acoustic speech signal is presented. Two databases containing approximately one-hour of multimodal continuous speech data (in English) were recorded. Discrete cosine transform (DCT) and principal component analysis (EigenTongues/EigenLips approach) are then compared as techniques for visual feature extraction. A first approach to visuo-acoustic conversion is based on a direct mapping between visual and acoustic features using neural networks and Gaussian mixture models (GMM). In a second approach, an intermediate HMM-based phonetic decoding step is introduced, in order to take advantage of a priori linguistic information. Finally, two methods are compared for the inference of the acoustic features used in the speech synthesis step, one based on a unit selection procedure, and the second invoking HMMs (HMM-based synthesis system HTS), with the "Harmonic plus Noise" model (HNM) of the speech signal being used in both approaches.
Document type :
Domain :
Complete list of metadata
Contributor : Ecole Espci Paristech Connect in order to contact the contributor
Submitted on : Wednesday, January 13, 2010 - 8:00:00 AM
Last modification on : Thursday, November 18, 2021 - 4:02:46 AM
Long-term archiving on: : Friday, September 10, 2010 - 2:54:28 PM


  • HAL Id : pastel-00005707, version 1


Thomas Hueber. Reconstitution de la parole par imagerie ultrasonore et vidéo de l'appareil vocal : vers une communication parlée silencieuse. domain_other. Université Pierre et Marie Curie - Paris VI, 2009. Français. ⟨pastel-00005707⟩



Record views


Files downloads