Apprentissage automatique rapide et lent

Abstract : The Big Data era has revolutionized the way in which data is created and processed. In this context, multiple challenges arise given the massive amount of data that needs to be efficiently handled and processed in order to extract knowledge. This thesis explores the symbiosis of batch and stream learning, which are traditionally considered in the literature as antagonists. We focus on the problem of classification from evolving data streams.Batch learning is a well-established approach in machine learning based on a finite sequence: first data is collected, then predictive models are created, then the model is applied. On the other hand, stream learning considers data as infinite, rendering the learning problem as a continuous (never-ending) task. Furthermore, data streams can evolve over time, meaning that the relationship between features and the corresponding response (class in classification) can change.We propose a systematic framework to predict over-indebtedness, a real-world problem with significant implications in modern society. The two versions of the early warning mechanism (batch and stream) outperform the baseline performance of the solution implemented by the Groupe BPCE, the second largest banking institution in France. Additionally, we introduce a scalable model-based imputation method for missing data in classification. This method casts the imputation problem as a set of classification/regression tasks which are solved incrementally.We present a unified framework that serves as a common learning platform where batch and stream methods can positively interact. We show that batch methods can be efficiently trained on the stream setting under specific conditions. The proposed hybrid solution works under the positive interactions between batch and stream methods. We also propose an adaptation of the Extreme Gradient Boosting (XGBoost) algorithm for evolving data streams. The proposed adaptive method generates and updates the ensemble incrementally using mini-batches of data. Finally, we introduce scikit-multiflow, an open source framework in Python that fills the gap in Python for a development/research platform for learning from evolving data streams.
Document type :
Theses
Complete list of metadatas

Cited literature [53 references]  Display  Hide  Download

https://pastel.archives-ouvertes.fr/tel-02098633
Contributor : Abes Star <>
Submitted on : Friday, April 12, 2019 - 6:18:06 PM
Last modification on : Friday, May 17, 2019 - 12:56:46 PM

File

75748_MONTIEL_LOPEZ_2019_archi...
Version validated by the jury (STAR)

Identifiers

  • HAL Id : tel-02098633, version 1

Citation

Jacob Montiel López. Apprentissage automatique rapide et lent. Base de données [cs.DB]. Université Paris-Saclay, 2019. Français. ⟨NNT : 2019SACLT014⟩. ⟨tel-02098633⟩

Share

Metrics

Record views

401

Files downloads

110