Skip to Main content Skip to Navigation

Adapting machine learning methods to U-statistics

Abstract : With the increasing availability of large amounts of data, computational complexity has become a keystone of many machine learning algorithms. Stochastic optimization algorithms and distributed/decentralized methods have been widely studied over the last decade and provide increased scalability for optimizing an empirical risk that is separable in the data sample. Yet, in a wide range of statistical learning problems, the risk is accurately estimated by U-statistics, i.e., functionals of the training data with low variance that take the form of averages over d-tuples. We first tackle the problem of sampling for the empirical risk minimization problem. We show that empirical risks can be replaced by drastically computationally simpler Monte-Carlo estimates based on O(n) terms only, usually referred to as incomplete U-statistics, without damaging the learning rate. We establish uniform deviation results and numerical examples show that such approach surpasses more naive subsampling techniques. We then focus on the decentralized estimation topic, where the data sample is distributed over a connected network. We introduce new synchronous and asynchronous randomized gossip algorithms which simultaneously propagate data across the network and maintain local estimates of the U-statistic of interest. We establish convergence rate bounds with explicit data and network dependent terms. Finally, we deal with the decentralized optimization of functions that depend on pairs of observations. Similarly to the estimation case, we introduce a method based on concurrent local updates and data propagation. Our theoretical analysis reveals that the proposed algorithms preserve the convergence rate of centralized dual averaging up to an additive bias term. Our simulations illustrate the practical interest of our approach.
Complete list of metadata

Cited literature [56 references]  Display  Hide  Download
Contributor : ABES STAR :  Contact
Submitted on : Tuesday, February 6, 2018 - 1:16:11 AM
Last modification on : Friday, July 31, 2020 - 10:44:09 AM
Long-term archiving on: : Tuesday, May 8, 2018 - 6:21:48 AM


Version validated by the jury (STAR)


  • HAL Id : tel-01701636, version 1


Igor Colin. Adapting machine learning methods to U-statistics. Machine Learning [cs.LG]. Télécom ParisTech, 2016. English. ⟨NNT : 2016ENST0070⟩. ⟨tel-01701636⟩



Record views


Files downloads