Skip to Main content Skip to Navigation
Theses

Reconstitution de la parole par imagerie ultrasonore et vidéo de l'appareil vocal : vers une communication parlée silencieuse

Abstract : The aim of the thesis is the design of a "silent speech interface", or system permitting voice communication without vocalization. Two main applications are targeted: assistance to laryngectomized persons; and voice communication when silence must be maintained (public transport, military situation) or in extremely noisy environments. The system developed is based on capturing articulatory activity via ultrasound and video imaging. The problem addressed in this work is that of transforming multimodal observations of articulatory gestures into an audio speech signal. This "visuo-acoustic" conversion is achieved using machine learning methods requiring the construction of audiovisual training databases. To this end, in order to monitor the position of the ultrasound probe relative to the speaker's head during data acquisition, a procedure based on the use of two inertial sensors is first proposed. Subsequently, a system allowing to synchronously acquire high-speed ultrasound and video images of the vocal tract together with the uttered acoustic speech signal is presented. Two databases containing approximately one-hour of multimodal continuous speech data (in English) were recorded. Discrete cosine transform (DCT) and principal component analysis (EigenTongues/EigenLips approach) are then compared as techniques for visual feature extraction. A first approach to visuo-acoustic conversion is based on a direct mapping between visual and acoustic features using neural networks and Gaussian mixture models (GMM). In a second approach, an intermediate HMM-based phonetic decoding step is introduced, in order to take advantage of a priori linguistic information. Finally, two methods are compared for the inference of the acoustic features used in the speech synthesis step, one based on a unit selection procedure, and the second invoking HMMs (HMM-based synthesis system HTS), with the "Harmonic plus Noise" model (HNM) of the speech signal being used in both approaches.
Document type :
Theses
Domain :
Complete list of metadatas

https://pastel.archives-ouvertes.fr/pastel-00005707
Contributor : Ecole Espci Paristech <>
Submitted on : Wednesday, January 13, 2010 - 8:00:00 AM
Last modification on : Wednesday, October 14, 2020 - 3:43:05 AM
Long-term archiving on: : Friday, September 10, 2010 - 2:54:28 PM

Identifiers

  • HAL Id : pastel-00005707, version 1

Citation

Thomas Hueber. Reconstitution de la parole par imagerie ultrasonore et vidéo de l'appareil vocal : vers une communication parlée silencieuse. domain_other. Université Pierre et Marie Curie - Paris VI, 2009. Français. ⟨pastel-00005707⟩

Share

Metrics

Record views

822

Files downloads

2112