Skip to Main content Skip to Navigation

New insights into hierarchical clustering and linguistic normalization for speaker diarization

Abstract : The ever-expanding volume of available audio and multimedia data has elevated technologies related to content indexing and structuring to the forefront of research. Speaker diarization, commonly referred to as the `who spoke when?' task, is one such example and has emerged as a prominent, core enabling technology in the wider speech processing research community. Speaker diarization involves the detection of speaker turns within an audio document (segmentation) and the grouping together of all same-speaker segments (clustering). Much progress has been made in the field over recent years partly spearheaded by the NIST Rich Transcription evaluations focus on meeting domain, in the proceedings of which are found two general approaches: top-down and bottom-up. Even though the best performing systems over recent years have all been bottom-up approaches we show in this thesis that the top-down approach is not without significant merit. Indeed we first introduce a new purification component leading to competitive performance to the bottom-up approach. Moreover, while investigating the two diarization approaches more thoroughly we show that they behave differently in discriminating between individual speakers and in normalizing unwanted acoustic variation, i.e.\ that which does not pertain to different speakers. This difference of behaviours leads to a new top-down/bottom-up system combination outperforming the respective baseline system. Finally, we introduce a new technology able to limit the influence of linguistic effects, responsible for biasing the convergence of the diarization system. Our novel approach is referred to as Phone Adaptive Training (PAT).
Document type :
Complete list of metadata
Contributor : ABES STAR :  Contact
Submitted on : Wednesday, March 12, 2014 - 11:42:08 AM
Last modification on : Thursday, July 1, 2021 - 3:09:26 AM
Long-term archiving on: : Thursday, June 12, 2014 - 11:11:59 AM


Version validated by the jury (STAR)


  • HAL Id : pastel-00958322, version 1


Simon Bozonnet. New insights into hierarchical clustering and linguistic normalization for speaker diarization. Other. Télécom ParisTech, 2012. English. ⟨NNT : 2012ENST0019⟩. ⟨pastel-00958322⟩



Record views


Files downloads