Skip to Main content Skip to Navigation
Theses

Recherche d'une représentation des données efficace pour la fouille des grandes bases de données

Abstract : The data preparation step of the data mining process represents 80% of the problem and is both time consuming and critical for the quality of the modeling. In this thesis, our purpose is to design an evaluation criterion of data representations, in order to automate data preparation. To overcome this problem, we introduce a non parametric family of density estimation models, named data grid models. Each variable is partitioned in intervals or in groups of values according to whether it is numerical of categorical, and the whole data space is partitioned into a grid of cells resulting from the cross-product of the univariate partitions. We then consider density estimation models where the density is assumed constant per data grid cell. Because of their high expressiveness, data grid models are hard to regularize and to optimize. We exploit a model selection technique based on a Bayesian approach and obtain an exact analytic criterion for the posterior probability of data grid models. We introduce combinatorial optimization algorithms which leverage the properties of our evaluation criterion and the sparseness of data in large dimension. These algorithms have a guaranteed algorithmic complexity, which is super-linear with the sample size. We evaluate data grid models in numerous tasks of data analysis, for supervised classification, regression, clustering or coclustering. The results demonstrate the validity of the approach, that allows to automatically and efficiently detect fine-grained and reliable information useful for the data preparation step.
Document type :
Theses
Complete list of metadatas

Cited literature [152 references]  Display  Hide  Download

https://pastel.archives-ouvertes.fr/pastel-00003023
Contributor : Ecole Télécom Paristech <>
Submitted on : Friday, May 23, 2008 - 8:00:00 AM
Last modification on : Friday, October 23, 2020 - 4:37:49 PM
Long-term archiving on: : Wednesday, September 8, 2010 - 5:36:13 PM

Identifiers

  • HAL Id : pastel-00003023, version 1

Citation

Marc Boullé. Recherche d'une représentation des données efficace pour la fouille des grandes bases de données. domain_other. Télécom ParisTech, 2007. English. ⟨pastel-00003023⟩

Share

Metrics

Record views

439

Files downloads

1899