Skip to Main content Skip to Navigation

Recherche d'une représentation des données efficace pour la fouille des grandes bases de données

Abstract : The data preparation step of the data mining process represents 80% of the problem and is both time consuming and critical for the quality of the modeling. In this thesis, our purpose is to design an evaluation criterion of data representations, in order to automate data preparation. To overcome this problem, we introduce a non parametric family of density estimation models, named data grid models. Each variable is partitioned in intervals or in groups of values according to whether it is numerical of categorical, and the whole data space is partitioned into a grid of cells resulting from the cross-product of the univariate partitions. We then consider density estimation models where the density is assumed constant per data grid cell. Because of their high expressiveness, data grid models are hard to regularize and to optimize. We exploit a model selection technique based on a Bayesian approach and obtain an exact analytic criterion for the posterior probability of data grid models. We introduce combinatorial optimization algorithms which leverage the properties of our evaluation criterion and the sparseness of data in large dimension. These algorithms have a guaranteed algorithmic complexity, which is super-linear with the sample size. We evaluate data grid models in numerous tasks of data analysis, for supervised classification, regression, clustering or coclustering. The results demonstrate the validity of the approach, that allows to automatically and efficiently detect fine-grained and reliable information useful for the data preparation step.
Document type :
Complete list of metadata

Cited literature [152 references]  Display  Hide  Download
Contributor : Ecole Télécom ParisTech Connect in order to contact the contributor
Submitted on : Friday, May 23, 2008 - 8:00:00 AM
Last modification on : Friday, October 23, 2020 - 4:37:49 PM
Long-term archiving on: : Wednesday, September 8, 2010 - 5:36:13 PM


  • HAL Id : pastel-00003023, version 1


Marc Boullé. Recherche d'une représentation des données efficace pour la fouille des grandes bases de données. domain_other. Télécom ParisTech, 2007. English. ⟨pastel-00003023⟩



Record views


Files downloads