. Abbasi-yadkori, Improved Algorithms for Linear Stochastic Bandits, Advances in Neural Information Processing Systems, 2011.

R. Agrawal, Sample mean based index policies with O(log n) regret for the multi-armed bandit problem, Advances in Applied Probability, vol.27, issue.4, pp.1054-1078, 1995.

. Agrawal, Asymptotically efficient adaptive allocation schemes for controlled i.i.d. processes: finite parameter space, IEEE Transactions on Automatic Control, vol.34, issue.3, pp.258-267, 1989.
DOI : 10.1109/9.16415

G. Agrawal, S. Agrawal, and N. Goyal, Analysis of Thompson Sampling for the multi-armed bandit problem, Proceedings of the 25th Conference On Learning Theory, 2012.

G. Agrawal, S. Agrawal, and N. Goyal, Further Optimal Regret Bounds for Thompson Sampling, Proceedings of the 16th Conference on Artificial Intelligence and Statistics, 2013.

G. Agrawal, S. Agrawal, and N. Goyal, Thompson Sampling for Contextual Bandits with Linear Payoffs, International Conference on Machine Learning (ICML), 2013.

. Asmuth, A Bayesian sampling approach to exploration in reinforcement learning, Uncertainty in Artificial Intelligence (UAI), 2009.

B. Audibert, J. Audibert, and S. Bubeck, Regret Bounds and Minimax Policies under Partial Monitoring, Journal of Machine Learning Research, 2010.
URL : https://hal.archives-ouvertes.fr/hal-00654356

. Audibert, Best Arm Identification in Multi-armed Bandits, Proceedings of the 23rd Conference on Learning Theory, 2010.
URL : https://hal.archives-ouvertes.fr/hal-00654404

. Audibert, Exploration-exploitation trade-off using variance estimates in multi-armed bandits, Theoretical Computer Science, p.410, 2009.
DOI : 10.1016/j.tcs.2009.01.016

. Auer, Finite-time analysis of the multiarmed bandit problem, Machine Learning, vol.47, issue.2/3, pp.235-256, 2002.
DOI : 10.1023/A:1013689704352

. Auer, The Nonstochastic Multiarmed Bandit Problem, SIAM Journal on Computing, vol.32, issue.1, pp.48-77, 2002.
DOI : 10.1137/S0097539701398375
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.130.158

. Bechhofer, Sequential identification and ranking procedures, 1968.

R. Bellman, The theory of dynamic programming, Bulletin of the American Mathematical Society, vol.60, issue.6, pp.503-515, 1954.
DOI : 10.1090/S0002-9904-1954-09848-8

R. Bellman, A problem in the sequential design of experiments. The indian journal of statistics, pp.221-229, 1956.

. Berry, . Fristedt, D. Berry, and B. Fristedt, Bandit Problems. Sequential allocation of experiments, 1985.

D. Bickel, P. Bickel, and K. Doksum, Mathematical Statistics, Basic Ideas and Selected Topics, 2001.

C. Bishop, Pattern Recognition and Machine Learning, 2006.

. Boucheron, Concentration inequalities. A non asymptotic theory of independence, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00751496

. Bradt, On Sequential Designs for Maximizing the Sum of $n$ Observations, The Annals of Mathematical Statistics, vol.27, issue.4, pp.1060-1074, 1956.
DOI : 10.1214/aoms/1177728073

S. Bubeck, Jeux de bandits et fondation du clustering, 2010.

C. Bubeck, S. Bubeck, and N. Cesa-bianchi, Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Machine Learning, pp.1-122, 2012.
DOI : 10.1561/2200000024

. Bubeck, Towards Minimax Policies for Online Linear Opimization with Bandit Feedback, Proceedings of the 25th Conference On Learning Theory, 2012.

. Bubeck, . Liu, S. Bubeck, and C. Liu, Prior-free and prior-dependent regret bounds for Thompson Sampling, 2014 48th Annual Conference on Information Sciences and Systems (CISS), 2013.
DOI : 10.1109/CISS.2014.6814158

. Bubeck, Pure exploration in finitely-armed and continuous-armed bandits, Theoretical Computer Science, vol.412, issue.19, pp.1832-18521832, 2011.
DOI : 10.1016/j.tcs.2010.12.059
URL : https://hal.archives-ouvertes.fr/hal-00609550

. Bubeck, Bounded regret in stochastic multi-armed bandits, Proceedings of the 26th Conference On Leaning Theory, 2013.

. Bubeck, Multiple Identifications in multi-armed bandits, International Conference on Machine Learning (ICML), 2013.

K. Burnetas, A. Burnetas, and M. Katehakis, Optimal Adaptive Policies for Sequential Allocation Problems, Advances in Applied Mathematics, vol.17, issue.2, pp.122-142, 1996.
DOI : 10.1006/aama.1996.0007
URL : http://doi.org/10.1006/aama.1996.0007

A. Burnetas and M. Katehakis, ASYMPTOTIC BAYES ANALYSIS FOR THE FINITE-HORIZON ONE-ARMED-BANDIT PROBLEM, Probability in the Engineering and Informational Sciences, pp.53-82, 2003.
DOI : 10.1017/S0269964803171045

L. Cesa-bianchi, N. Cesa-bianchi, and G. Lugosi, Prediction, Learning and Games, 2006.
DOI : 10.1017/CBO9780511546921

. Cesa-bianchi, . Lugosi, N. Cesa-bianchi, and G. Lugosi, Combinatorial bandits, Journal of Computer and System Sciences, vol.78, issue.5, pp.1404-1422, 2012.
DOI : 10.1016/j.jcss.2012.01.001

K. Chandrasekaran, K. Chandrasekaran, and R. Karp, Finding a most biaised coin with fewest flips, Proceeding of the 27th Conference on Learning Theory, 2014.

C. , L. Chang, F. Lai, and T. , Optimal stopping and dynamic allocation, Advances in Applied Probability, vol.19, pp.829-853, 1987.
DOI : 10.2307/1427104

. Chapelle, O. Li-]-chapelle, and L. Li, An empirical evaluation of Thompson Sampling, Advances in Neural Information Processing Systems, 2011.

. Chapelle, Simple and Scalable Response Prediction for Display Advertising, Transactions on Intelligent Systems and Technology, 2014.
DOI : 10.1145/2532128
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.389.7316

. Contal, Parallel Gaussian Process Optimization with Upper Confidence Bound and Pure Exploration, Proceedings of the European Conference on Machine Learning, 2013.
DOI : 10.1007/978-3-642-40988-2_15
URL : http://arxiv.org/abs/1304.5350

. Dani, The Price of Bandit Information in Online Optimization, Advances in Neural Information and Signal Processing, 2007.

. Dani, Stochastic Linear Optimization under Bandit Feedback, Advances in Neural Information and Signal Processing, pp.355-366, 2008.

L. De and . Pena, Self-Normalized Processes: Exponential inequalities, moment bounds and iterated logarithm laws. The Annals of Probability, pp.321902-1933, 2004.

L. De and . Pena, Self-normalized processes. Limit Theory and Statistical applications, 2009.

R. Durrett, Probability: Theory and Examples, 2010.
DOI : 10.1017/CBO9780511779398

. Even-dar, Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems, Journal of Machine Learning Research, vol.7, pp.1079-1105, 2006.

D. Feldman, Contributions to the " two-armed bandit, The Annals of Mathematical Statistics, vol.33, issue.3, pp.947-956, 1962.
DOI : 10.1214/aoms/1177704454

. Filippi, Optimism in reinforcement learning and Kullback-Leibler divergence, 2010 48th Annual Allerton Conference on Communication, Control, and Computing (Allerton), 2010.
DOI : 10.1109/ALLERTON.2010.5706896
URL : https://hal.archives-ouvertes.fr/hal-00476116

. Filippi, Parametric Bandits : The Generalized Linear case, Advances in Neural Information Processing Systems, 2010.

. Fonteneau, An optimistic posterior sampling strategy for Bayesian reinforcement learning, Workshop on Bayesian Optimization, NIPS, 2013.

W. Frostig, E. Frostig, and G. Weiss, Four proofs of Gittins??? multiarmed bandit theorem, Annals of Operations Research, vol.25, issue.12, 1999.
DOI : 10.1007/s10479-013-1523-0
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.295.444

. Gabillon, Best Arm Identification: A Unified Approach to Fixed Budget and Fixed Confidence, Advances in Neural Information Processing Systems, 2012.
URL : https://hal.archives-ouvertes.fr/hal-00747005

A. Garivier, Informational confidence bounds for self-normalized averages and applications, 2013 IEEE Information Theory Workshop (ITW), 2013.
DOI : 10.1109/ITW.2013.6691311
URL : https://hal.archives-ouvertes.fr/hal-00862062

. Garivier, . Cappé, A. Garivier, and O. Cappé, The KL-UCB algorithm for bounded stochastic bandits and beyond, Proceedings of the 24th Conference on Learning Theory, 2011.

. Garivier, . Moulines, A. Garivier, and E. Moulines, On Upper-Confidence Bound Policies for Switching Bandit Problems, Proceedings of the 22nd conference on Algorithmic Learning Theory, 2011.
DOI : 10.1007/978-3-642-24412-4_16

C. Ginebra, J. Ginebra, and M. Clayton, Small-sample frequentist properties of Bernoulli two-armed bandit Bayesian strategies, 1994.

C. Ginebra, J. Ginebra, and M. Clayton, Small-sample performance of Bernoulli two-armed bandit Bayesian strategies, Journal of Statistical Planning and Inference, vol.79, issue.1, pp.107-122, 1999.
DOI : 10.1016/S0378-3758(98)00230-4

J. Gittins, Bandit processes and dynamic allocation indices, Journal of the Royal Statistical Society, Series B, vol.41, issue.2, pp.148-177, 1979.
DOI : 10.1002/9780470980033

J. Gittins, J. Gittins, and D. Jones, A dynamic allocation index for the sequential design of experiments, Progress in Statistics (proceedings of the 1972 European Meeting of Statisticians), 1974.

. Gopalan, Thompson Sampling for Complex Online Problems, International Conference on Machine Learning (ICML), 2014.

O. Granmo, Solving two???armed Bernoulli bandit problems using a Bayesian learning automaton, International Journal of Intelligent Computing and Cybernetics, vol.3, issue.2, pp.207-234, 2010.
DOI : 10.1108/17563781011049179

. Graves, . Lai, T. Graves, and T. Lai, Asymptotically Efficient Adaptive Choice of Control Laws inControlled Markov Chains, SIAM Journal on Control and Optimization, vol.35, issue.3, pp.715-743, 1997.
DOI : 10.1137/S0363012994275440

. Guha, . Munagala, S. Guha, and K. Munagala, Stochastic Regret Minimization via Thompson Sampling, Proceedings of the 27th Conference On Learning Theory, 2014.

V. Heidrich-meisner and C. Igel, Hoeffding and Bernstein races for selecting policies in evolutionary direct policy search, Proceedings of the 26th Annual International Conference on Machine Learning, ICML '09, 2009.
DOI : 10.1145/1553374.1553426

W. Hoeffding, Probability inequalities for sums of bounded random variables, Journal of the American Statistical Association, vol.58, p.1330, 1963.

. Hoffman, On correlation and budget constraints in model-based bandit optimization with application to automatic machine learning, Proceedings of the 17th International Conference on Artificial Intelligence and Statistics, 2014.

T. Honda, J. Honda, and A. Takemura, An Asymptotically Optimal Bandit Algorithm for Bounded Support Models, Proceedings of the 23rd Conference on Learning Theory, 2010.

T. Honda, J. Honda, and A. Takemura, Optimality of Thompson Sampling for Gaussian Bandits depends on priors, Proceedings of the 17th conference on Artificial Intelligence and Statistics, 2014.

. Jaksch, Near-Optimal regret bounds for reinforcement learning, Journal of Machine Learning Research, vol.11, pp.1563-1600, 2010.

. Jamieson, lil'UCB: an Optimal Exploration Algorithm for Multi-Armed Bandits, Proceedings of the 27th Conference on Learning Theory, 2014.

H. Jeffreys, An Invariant Form for the Prior Probability in Estimation Problems, Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, vol.186, issue.1007, pp.453-461, 1946.
DOI : 10.1098/rspa.1946.0056

. Jennison, Asymptotically optimal procedures for sequential adaptive selection of the best of several normal means. Statistical Decision Theory and Related Topics III, pp.55-86, 1982.

S. Kalyanakrishnan and P. Stone, Efficient Selection in Multiple Bandit Arms: Theory and Practice, International Conference on Machine Learning (ICML), 2010.

. Kalyanakrishnan, PAC subset selection in stochastic multi-armed bandits, International Conference on Machine Learning (ICML), 2012.

. Karnin, Almost optimal Exploration in multi-armed bandits, International Conference on Machine Learning (ICML), 2013.

R. Katehakis, M. Katehakis, and H. Robbins, Sequential choice from several populations., Proceedings of the National Academy of Science, pp.8584-8585, 1995.
DOI : 10.1073/pnas.92.19.8584
URL : http://www.ncbi.nlm.nih.gov/pmc/articles/PMC41010

. Kaufmann, On Bayesian Upper- Confidence Bounds for Bandit Problems, Proceedings of the 15th conference on Artificial Intelligence and Statistics, 2012.

. Kaufmann, On the Complexity of A/B Testing, Proceedings of the 27th Conference On Learning Theory, 2014.
URL : https://hal.archives-ouvertes.fr/hal-00990254

. Kaufmann, On the Complexity of Best Arm Identification in Multi-Armed Bandit Models, 2014.
URL : https://hal.archives-ouvertes.fr/hal-01024894

. Kaufmann, . Kalyanakrishnan, E. Kaufmann, and S. Kalyanakrishnan, Information complexity in bandit subset selection, Proceeding of the 26th Conference On Learning Theory, 2013.

. Kaufmann, Thompson Sampling: An Asymptotically Optimal Finite-Time Analysis, Proceedings of the 23rd conference on Algorithmic Learning Theory, 2012.
DOI : 10.1007/978-3-642-34106-9_18
URL : https://hal.archives-ouvertes.fr/hal-00830033

. Korda, Thompson Sampling for 1- dimensional Exponential family bandits, Advances in Neural Information Processing Systems, 2013.

. Krause, . Ong, A. Krause, and C. Ong, Contextual Gaussian Process Bandit Optimization, Advances in Neural Information Processing Systems, 2011.

T. Lai, Adaptive Treatment Allocation and the Multi-Armed Bandit Problem, The Annals of Statistics, vol.15, issue.3, pp.1091-1114, 1987.
DOI : 10.1214/aos/1176350495

R. Lai, T. Lai, and H. Robbins, Asymptotically efficient adaptive allocation rules, Advances in Applied Mathematics, vol.6, issue.1, pp.4-22, 1985.
DOI : 10.1016/0196-8858(85)90002-8

M. Laurent, B. Laurent, and P. Massart, Adaptive estimation of a quadratic functional of a density by model selection, ESAIM: Probability and Statistics, vol.9, issue.5, pp.1302-1338, 2000.
DOI : 10.1051/ps:2005001

. Lelarge, Spectrum bandit optimization, 2013 IEEE Information Theory Workshop (ITW), 2013.
DOI : 10.1109/ITW.2013.6691221
URL : https://hal.archives-ouvertes.fr/hal-00917427

L. Levin, B. Levin, and C. Leu, On a Conjecture of Bechhofer, Kiefer, and Sobel for the Levin-Robbins-Leu Binomial Subset Selection Procedures, Sequential Analysis, vol.27, pp.106-125, 2008.

. Maillard, A Finite-Time Analysis of Multi-armed Bandits Problems with Kullback-Leibler Divergences, Proceedings of the 24th Conference On Learning Theory, 2011.
URL : https://hal.archives-ouvertes.fr/inria-00574987

T. Mannor, S. Mannor, and J. Tsitsiklis, The Sample Complexity of Exploration in the Multi-Armed Bandit Problem, Journal of Machine Learning Research, pp.623-648, 2004.

. Maron, O. Moore-]-maron, and A. Moore, The Racing Algorithm: Model Selection for Lazy Learners, Artificial Intelligence Review, vol.11, issue.15, pp.113-131, 1997.
DOI : 10.1007/978-94-017-2053-3_8

. Mellor, . Shapiro, J. Mellor, and J. Shapiro, Thompson Sampling in Switching Environments with Bayesian Online Change Point Detection, Proceeding of the 16th Conference on Artificial Intelligence and Statistics, 2013.

. Mnih, Empirical Bernstein stopping, Proceedings of the 25th international conference on Machine learning, ICML '08, 2008.
DOI : 10.1145/1390156.1390241
URL : https://hal.archives-ouvertes.fr/hal-00834983

J. Naghshvar, M. Naghshvar, and T. Javidi, Active sequential hypothesis testing, The Annals of Statistics, vol.41, issue.6, pp.2703-2738, 2013.
DOI : 10.1214/13-AOS1144SUPP
URL : http://arxiv.org/abs/1203.4626

J. Neveu, MartingalesàMartingales`Martingalesà temps discret, 1972.

J. Nino-mora-]-nino-mora, Computing a Classic Index for Finite-Horizon Bandits, INFORMS Journal on Computing, vol.23, issue.2, pp.254-267, 2011.
DOI : 10.1287/ijoc.1100.0398

. Osband, (More) Efficient Reinforcement Learning Via Posterior Sampling, Advances in Neural Information Processing Systems, 2013.

E. Paulson, A Sequential Procedure for Selecting the Population with the Largest Mean from $k$ Normal Populations, The Annals of Mathematical Statistics, vol.35, issue.1, pp.174-180, 1964.
DOI : 10.1214/aoms/1177703739

. Pavlidis, Simulation studies of multi-armed bandits with covariates, 10th Proceedings of the International Conference on Computer Modeling, 2008.

M. Puterman, Markov Decision Processes. Discrete Stochastic, 1994.

H. Robbins, Some aspects of the sequential design of experiments, Bulletin of the American Mathematical Society, vol.58, issue.5, pp.527-535, 1952.
DOI : 10.1090/S0002-9904-1952-09620-8

H. Robbins, Statistical Methods Related to the Law of the Iterated Logarithm, The Annals of Mathematical Statistics, vol.41, issue.5, pp.1397-1409, 1970.
DOI : 10.1214/aoms/1177696786

P. Rusmevichientong and J. Tsitsiklis, Linearly Parameterized Bandits, Mathematics of Operations Research, vol.35, issue.2, pp.395-411, 2010.
DOI : 10.1287/moor.1100.0446
URL : http://arxiv.org/abs/0812.3465

. Russo, D. Van-roy-russo, and B. Van-roy, Learning to Optimize via Posterior Sampling, Mathematics of Operations Research, vol.39, issue.4, 2014.
DOI : 10.1287/moor.2014.0650

. Salomon, A. Salomon, and J. Audibert, Deviations of Stochastic Bandit Regret, Proceedings of the 22nd conference on Algorithmic Learning Theory, 2011.
DOI : 10.1007/978-3-642-24412-4_15
URL : https://hal.archives-ouvertes.fr/hal-00624461

. Schreck, A Shrinkage-Thresholding Metropolis Adjusted Langevin Algorithm for Bayesian Variable Selection, IEEE Journal of Selected Topics in Signal Processing, vol.10, issue.2, 2013.
DOI : 10.1109/JSTSP.2015.2496546
URL : https://hal.archives-ouvertes.fr/hal-00921130

S. Scott, A modern Bayesian look at the multi-armed bandit, Applied Stochastic Models in Business and Industry, vol.9, issue.2, pp.639-658, 2010.
DOI : 10.1002/asmb.874

D. Siegmund, Sequential Analysis, 1985.
DOI : 10.1007/978-1-4757-1862-1

W. Thompson, ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES, Biometrika, vol.25, issue.3-4, pp.285-294, 1933.
DOI : 10.1093/biomet/25.3-4.285

W. Thompson, On the Theory of Apportionment, American Journal of Mathematics, vol.57, issue.2, pp.450-456, 1935.
DOI : 10.2307/2371219

. Valko, Finite-time analysis of kernelized contextual bandits, 29th Conference on Uncertainty in Artificial Intelligence (UAI), 2013.

A. Wald, Sequential Tests of Statistical Hypotheses, The Annals of Mathematical Statistics, vol.16, issue.2, pp.117-186, 1945.
DOI : 10.1214/aoms/1177731118

L. Wasserman, All of Statistics: A concise course in statistical inference, 2010.
DOI : 10.1007/978-0-387-21736-9

R. Weber, On the Gittins Index for Multiarmed Bandits, The Annals of Applied Probability, vol.2, issue.4, pp.1024-1033, 1992.
DOI : 10.1214/aoap/1177005588