Skip to Main content Skip to Navigation

Inference and evaluation of the multinomial mixture model for unsupervised text clustering

Abstract : In this thesis, we investigate the use of a probabilistic model for unsupervised clustering of text collections. We focus in particular on the multinomial mixture model, with one latent theme variable per document. Unsupervised clustering has become a basic module for many intelligent text processing applications, such as information retrieval, text classification or information extraction. Recent proposals have been made of probabilistic clustering models, which build "soft" theme-document associations. These models allow to compute, for each document, a probability vector whose values can be interpreted as the strength of the association between documents and clusters. As such, these vectors can also serve to project texts into a lower-dimensional "semantic" space. These models however pose non-trivial estimation problems, which are aggravated by the very high dimensionality of the parameter space. The contribution of this study is twofold. First, we present and contrast various estimation procedures for the multinomial mixture model, some of which had not been tested before in this context. Second, we propose a systematic evaluation of the performances of these algorithms, thereby defining a framework to assess the quality of unsupervised text clustering methods. The comparison with the performances of other classical models demonstrates, in our opinion, the relevance of the simple multinomial mixture model for clustering corpus mainly composed of monothematic documents.
Document type :
Domain :
Complete list of metadata
Contributor : Ecole Télécom ParisTech Connect in order to contact the contributor
Submitted on : Thursday, May 10, 2007 - 8:00:00 AM
Last modification on : Friday, July 31, 2020 - 10:44:05 AM
Long-term archiving on: : Wednesday, September 8, 2010 - 5:02:56 PM


  • HAL Id : pastel-00002424, version 1



Loïs Rigouste. Inference and evaluation of the multinomial mixture model for unsupervised text clustering. domain_other. Télécom ParisTech, 2006. English. ⟨pastel-00002424⟩



Record views


Files downloads