Apprentissage automatique rapide et lent

Abstract : The Big Data era has revolutionized the way in which data is created and processed. In this context, multiple challenges arise given the massive amount of data that needs to be efficiently handled and processed in order to extract knowledge. This thesis explores the symbiosis of batch and stream learning, which are traditionally considered in the literature as antagonists. We focus on the problem of classification from evolving data streams.Batch learning is a well-established approach in machine learning based on a finite sequence: first data is collected, then predictive models are created, then the model is applied. On the other hand, stream learning considers data as infinite, rendering the learning problem as a continuous (never-ending) task. Furthermore, data streams can evolve over time, meaning that the relationship between features and the corresponding response (class in classification) can change.We propose a systematic framework to predict over-indebtedness, a real-world problem with significant implications in modern society. The two versions of the early warning mechanism (batch and stream) outperform the baseline performance of the solution implemented by the Groupe BPCE, the second largest banking institution in France. Additionally, we introduce a scalable model-based imputation method for missing data in classification. This method casts the imputation problem as a set of classification/regression tasks which are solved incrementally.We present a unified framework that serves as a common learning platform where batch and stream methods can positively interact. We show that batch methods can be efficiently trained on the stream setting under specific conditions. The proposed hybrid solution works under the positive interactions between batch and stream methods. We also propose an adaptation of the Extreme Gradient Boosting (XGBoost) algorithm for evolving data streams. The proposed adaptive method generates and updates the ensemble incrementally using mini-batches of data. Finally, we introduce scikit-multiflow, an open source framework in Python that fills the gap in Python for a development/research platform for learning from evolving data streams.
Document type :
Liste complète des métadonnées
Contributor : Abes Star <>
Submitted on : Friday, April 12, 2019 - 6:18:06 PM
Last modification on : Tuesday, April 16, 2019 - 9:48:55 AM


Version validated by the jury (STAR)


  • HAL Id : tel-02098633, version 1


Jacob Montiel López. Apprentissage automatique rapide et lent. Base de données [cs.DB]. Université Paris-Saclay, 2019. Français. ⟨NNT : 2019SACLT014⟩. ⟨tel-02098633⟩



Record views


Files downloads