On unsupervised learning in high dimension

Abstract : In this thesis, we discuss two topics, high-dimensional clustering on the one hand and estimation of mixing densities on the other. The first chapter is an introduction to clustering. We present various popular methods and we focus on one of the main models of our work which is the mixture of Gaussians. We also discuss the problems with high-dimensional estimation (Section 1.3) and the difficulty of estimating the number of clusters (Section 1.1.4). In what follows, we present briefly the concepts discussed in this manuscript. Consider a mixture of $K$ Gaussians in $RR^p$. One of the common approaches to estimate the parameters is to use the maximum likelihood estimator. Since this problem is not convex, we can not guarantee the convergence of classical methods such as gradient descent or Newton's algorithm. However, by exploiting the biconvexity of the negative log-likelihood, the iterative 'Expectation-Maximization' (EM) procedure described in Section 1.2.1 can be used. Unfortunately, this method is not well suited to meet the challenges posed by the high dimension. In addition, it is necessary to know the number of clusters in order to use it. Chapter 2 presents three methods that we have developed to try to solve the problems described above. The works presented there have not been thoroughly researched for various reasons. The first method that could be called 'graphical lasso on Gaussian mixtures' consists in estimating the inverse matrices of covariance matrices $Sigma$ (Section 2.1) in the hypothesis that they are parsimonious. We adapt the graphic lasso method of [Friedman et al., 2007] to a component in the case of a mixture and experimentally evaluate this method. The other two methods address the problem of estimating the number of clusters in the mixture. The first is a penalized estimate of the matrix of posterior probabilities $ Tau in RR ^ {n times K} $ whose component $ (i, j) $ is the probability that the $i$-th observation is in the $j$-th cluster. Unfortunately, this method proved to be too expensive in complexity (Section 2.2.1). Finally, the second method considered is to penalize the weight vector $ pi $ in order to make it parsimonious. This method shows promising results (Section 2.2.2). In Chapter 3, we study the maximum likelihood estimator of density of $n$ i.i.d observations, under the assumption that it is well approximated by a mixture with a large number of components. The main focus is on statistical properties with respect to the Kullback-Leibler loss. We establish risk bounds taking the form of sharp oracle inequalities both in deviation and in expectation. A simple consequence of these bounds is that the maximum likelihood estimator attains the optimal rate $((log K)/n)^{1/2}$, up to a possible logarithmic correction, in the problem of convex aggregation when the number $K$ of components is larger than $n^{1/2}$. More importantly, under the additional assumption that the Gram matrix of the components satisfies the compatibility condition, the obtained oracle inequalities yield the optimal rate in the sparsity scenario. That is, if the weight vector is (nearly) $D$-sparse, we get the rate $(Dlog K)/n$. As a natural complement to our oracle inequalities, we introduce the notion of nearly-$D$-sparse aggregation and establish matching lower bounds for this type of aggregation. Finally, in Chapter 4, we propose an algorithm that performs the Kullback-Leibler aggregation of components of a dictionary as discussed in Chapter 3. We compare its performance with different methods: the kernel density estimator , the 'Adaptive Danzig' estimator, the SPADES and EM estimator with the BIC criterion. We then propose a method to build the dictionary of densities and study it numerically. This thesis was carried out within the framework of a CIFRE agreement with the company ARTEFACT.
Complete list of metadatas

Cited literature [105 references]  Display  Hide  Download

https://pastel.archives-ouvertes.fr/tel-01677233
Contributor : Abes Star <>
Submitted on : Monday, January 8, 2018 - 10:56:07 AM
Last modification on : Tuesday, April 30, 2019 - 6:05:06 PM
Long-term archiving on : Friday, May 4, 2018 - 12:31:04 AM

File

75993_SEBBAR_2017_archivage.pd...
Version validated by the jury (STAR)

Identifiers

  • HAL Id : tel-01677233, version 1

Citation

Mehdi Sebbar. On unsupervised learning in high dimension. Machine Learning [cs.LG]. Université Paris-Saclay, 2017. English. ⟨NNT : 2017SACLG003⟩. ⟨tel-01677233⟩

Share

Metrics

Record views

748

Files downloads

349