. Urban, , vol.74, p.32, 2017.

. Lin, , vol.78, p.62, 2015.

, Exploring Weight Symmetry in Deep Neural Networks network symmetry #params top-1 top-5

A. Achille and S. Soatto, Emergence of invariance and disentangling in deep representations, 2017.

A. Achille and S. Soatto, Information dropout: Learning optimal representations through noisy computation, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.

D. Barber and F. Agakov, The im algorithm: a variational approach to information maximization, Advances in Neural Information Processing Systems, vol.16, 2001.

I. Alexander-a-alemi, J. V. Fischer, K. Dillon, and . Murphy, Deep variational information bottleneck, 2016.

B. Alexander-a-alemi, I. Poole, J. V. Fischer, . Dillon, A. Rif et al., An information-theoretic analysis of deep latent-variable models, vol.5, 2017.

A. , Alibaba's ai outguns humans in reading test, vol.4, 2018.

R. Amit and R. Meir, Meta-learning by adjusting priors based on extended pac-bayes theory, International Conference on Machine Learning, pp.205-214, 2018.

S. Arimoto, An algorithm for computing the capacity of arbitrary discrete memoryless channels, IEEE Transactions on Information Theory, vol.18, issue.1, pp.14-20, 1972.

S. Arora, R. Ge, B. Neyshabur, and Y. Zhang, Stronger generalization bounds for deep nets via a compression approach, 2018.

Y. Aytar and A. Zisserman, Tabula rasa: Model transfer for object category detection, Computer Vision (ICCV), 2011 IEEE International Conference on, pp.2252-2259

J. Ba and R. Caruana, Do deep nets really need to be deep?, Advances in neural information processing systems, pp.2654-2662, 2014.

F. Bibliography and . Bach, Breaking the curse of dimensionality with convex neural networks, Journal of Machine Learning Research, vol.18, issue.19, pp.1-53, 2017.

W. Balzer, M. Takahashi, J. Ohta, and K. Kyuma, Weight quantization in boltzmann machines, Neural Networks, vol.4, issue.3, pp.405-409, 1991.

A. Beck and M. Teboulle, Mirror descent and nonlinear projected subgradient methods for convex optimization, Operations Research Letters, vol.31, issue.3, pp.167-175, 2003.

A. Beck and L. Tetruashvili, On the convergence of block coordinate descent type methods, SIAM J. Optim, 2001.

D. Belanger and A. Mccallum, Structured prediction energy networks, International Conference on Machine Learning, vol.2, pp.983-992, 2016.

D. Belanger, D. Sheldon, and A. Mccallum, Marginal inference in mrfs using frank-wolfe, NIPS Workshop on Greedy Optimization

I. Belghazi, S. Rajeswar, A. Baratin, D. Hjelm, and A. Courville, Mine: mutual information neural estimation, vol.7, 2018.

Y. Bengio, J. Bengio, and . Cloutier, Learning a synaptic learning rule, IJCNN-91-Seattle International Joint Conference on Neural Networks, vol.2, p.969, 1991.

Y. Bengio, P. Simard, and P. Frasconi, Learning long-term dependencies with gradient descent is difficult, IEEE transactions on neural networks, vol.5, issue.2, pp.157-166, 1994.

Y. Bengio, N. L. Roux, P. Vincent, O. Delalleau, and P. Marcotte, Convex neural networks, Advances in neural information processing systems, vol.4, pp.123-130, 2006.

Y. Bengio, N. Léonard, and A. Courville, Estimating or propagating gradients through stochastic neurons for conditional computation, 2013.

J. O. Berger, Statistical Decision Theory and Bayesian Analysis, vol.6, 1985.

U. Bertele and F. Brioschi, Nonserial dynamic programming, vol.2, 1972.

L. Bertinetto, F. João, J. Henriques, P. Valmadre, A. Torr et al., Learning feed-forward one-shot learners, Advances in Neural Information Processing Systems, pp.523-531, 2016.

L. Bertinetto, J. F. Henriques, H. S. Philip, A. Torr, and . Vedaldi, Metalearning with differentiable closed-form solvers, ArXiv, pp.6-11, 2018.

D. P. Bertsekas, The method of multipliers for equality constraints, Constrained optimization and Lagrange Multiplier methods

P. Dimitri and . Bertsekas, Nonlinear programming. Athena scientific, 1999.

M. Christopher and . Bishop, Mixture density networks, Citeseer, 1994.

R. Blahut, Computation of channel capacity and rate-distortion functions, IEEE transactions on Information Theory, vol.18, issue.4, pp.460-473, 1972.

B. Matthew, C. H. Blaschko, and . Lampert, Learning to localize objects with structured output regression, European conference on computer vision, pp.2-15, 2008.

M. David, . Blei, Y. Andrew, and M. Ng, Latent dirichlet allocation, Journal of machine Learning research, vol.3, pp.993-1022, 2003.

M. David, A. Blei, J. D. Kucukelbir, and . Mcauliffe, Variational inference: A review for statisticians, Journal of the American Statistical Association, vol.112, issue.518, pp.859-877, 2017.

A. Blumer, A. Ehrenfeucht, D. Haussler, and M. K. Warmuth, Learnability and the vapnik-chervonenkis dimension, Journal of the ACM (JACM), vol.36, issue.4, pp.929-965, 1989.

C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra, Weight uncertainty in neural networks, 2015.

J. Bolte, S. Sabach, M. Teboulle, and Y. Vaisbourd, First order methods beyond convexity and lipschitz gradient continuity with applications to quadratic inverse problems, SIAM Journal on Optimization, vol.28, issue.3, pp.2131-2151, 2018.

A. Boulch, Sharesnet: reducing residual network parameter number by sharing weights, Proceedings of the International Conference on Learning Representations, 2017.

S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine learning, vol.3, pp.1-122, 2011.

Y. Boykov, R. Veksler, and . Zabih, Fast approximate energy minimization via graph cuts, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.23, issue.11, pp.1222-1239, 2001.

J. Bruna and S. Mallat, Invariant scattering convolution networks, IEEE transactions on pattern analysis and machine intelligence, vol.35, pp.1872-1886, 2013.

W. Bulten, Getting started with gans part 2: Colorful mnist

A. Bibliography, A. Canziani, E. Paszke, and . Culurciello, An analysis of deep neural network models for practical applications, 2002.

M. Fabio, A. Carlucci, S. Innocente, B. Bucci, T. Caputo et al., Domain generalization by solving jigsaw puzzles, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.6-11, 2019.

Y. Miguel-a-carreira-perpinán and . Idelbayev, Model compression as constrained optimization, with application to neural nets, 2017.

R. Caruana, Learning many related tasks at the same time with backpropagation, Advances in neural information processing systems, pp.657-664, 1995.

P. Chaudhari and S. Soatto, Stochastic gradient descent performs variational inference, converges to limit cycles for deep networks, 2018 Information Theory and Applications Workshop (ITA), pp.1-10, 2018.

C. Chen, B. He, Y. Ye, and X. Yuan, The direct extension of admm for multi-block convex minimization problems is not necessarily convergent, Mathematical Programming, vol.155, issue.1-2, pp.57-79, 2016.

S. Chen, C. Zhang, and M. Dong, Coupled end-to-end transfer learning with generalized fisher information, Computer Vision and Pattern Recognition, 2002.

W. Chen, Y. Liu, Z. Kira, Y. Wang, and J. Huang, A closer look at few-shot classification, vol.5, 2019.

W. Chen, J. Wilson, S. Tyree, K. Weinberger, and Y. Chen, Compressing neural networks with the hashing trick, International Conference on Machine Learning, pp.2285-2294, 2015.

W. Chen, J. Wilson, S. Tyree, K. Q. Weinberger, and Y. Chen, Compressing convolutional neural networks in the frequency domain, Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '16, pp.1475-1484, 2016.

K. Cho, D. Van-merrienboer, Y. Bahdanau, and . Bengio, On the properties of neural machine translation: Encoder-decoder approaches, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation (SSST-8), vol.8, 2014.

H. Choi, O. Meshi, and N. Srebro, Fast and scalable structural svm with slack rescaling, Artificial Intelligence and Statistics, pp.667-675, 2016.

M. Collins, Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithms, Proceedings of the ACL-02 conference on Empirical methods in natural language processing, vol.10, pp.1-8, 2002.

M. Collins, A. Globerson, T. Koo, X. Carreras, and P. L. Bartlett, Exponentiated gradient algorithms for conditional random fields and max-margin Markov networks, JMLR, vol.9, pp.1775-1822, 2008.

M. Courbariaux, Y. Bengio, and J. David, Binaryconnect: Training deep neural networks with binary weights during propagations, Advances in neural information processing systems, pp.3123-3131, 2015.

M. Courbariaux, I. Hubara, D. Soudry, R. El-yaniv, and Y. Bengio, Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or-1, 2016.

M. Thomas, . Cover, A. Joy, and . Thomas, Elements of information theory

T. Richard and . Cox, Probability, frequency and reasonable expectation, American journal of physics, vol.14, issue.1, pp.1-13, 1946.

K. Crammer, O. Dekel, J. Keshet, S. Shalev-shwartz, and Y. Singer, Online passive-aggressive algorithms, Journal of Machine Learning Research, vol.7, pp.551-585, 2006.

G. Cybenko, Approximation by superpositions of a sigmoidal function, Mathematics of control, signals and systems, vol.2, issue.4, pp.303-314, 1989.

G. Cybenko, Approximation by superpositions of a sigmoidal function, Mathematics of control, signals and systems, vol.2, issue.4, pp.303-314, 1989.

B. Dai, C. Zhu, and D. Wipf, Compressing neural networks using the variational information bottleneck, 2018.

A. Defazio, F. Bach, and S. Lacoste-julien, SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives, NIPS, pp.1646-1654, 2014.
URL : https://hal.archives-ouvertes.fr/hal-01016843

A. Defazio, F. Bach, and S. Lacoste-julien, SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives, Advances in Neural Information Processing Systems, vol.3, pp.1646-1654, 2014.
URL : https://hal.archives-ouvertes.fr/hal-01016843

J. Deng, W. Dong, R. Socher, L. Li, K. Li et al., Imagenet: A large-scale hierarchical image database, Computer Vision and Pattern Recognition, pp.248-255, 2009.

M. Denil, B. Shakibi, L. Dinh, and N. De-freitas, Predicting parameters in deep learning, Advances in neural information processing systems, pp.2148-2156

M. Denil, B. Shakibi, L. Dinh, M. A. Ranzato, and N. De-freitas, Predicting parameters in deep learning, Advances in Neural Information Processing Systems, vol.26, pp.2148-2156, 2013.

B. Emily, L. Denton, W. Zaremba, J. Bruna, Y. Lecun et al., Exploiting linear structure within convolutional networks for efficient evaluation, Advances in neural information processing systems, pp.1269-1277, 2014.

W. Emily-l-denton, J. Zaremba, Y. Bruna, R. Lecun, and . Fergus, Exploiting linear structure within convolutional networks for efficient evaluation, Advances in Neural Information Processing Systems, vol.27, pp.1269-1277, 2014.

J. Devlin, M. Chang, K. Lee, and K. T. Bert, Pretraining of deep bidirectional transformers for language understanding, vol.6, 2018.

O. Devolder, F. Glineur, and Y. Nesterov, First-order methods of smooth convex optimization with inexact oracle, Mathematical Programming, vol.146, issue.1-2, pp.37-75

L. Dinh, R. Pascanu, S. Bengio, and Y. Bengio, Sharp minima can generalize for deep nets, 2017.

A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas et al., Flownet: Learning optical flow with convolutional networks, Proceedings of the IEEE International Conference on Computer Vision, vol.7, pp.2758-2766, 2015.

J. Duchi, E. Hazan, and Y. Singer, Adaptive subgradient methods for online learning and stochastic optimization, Journal of Machine Learning Research, vol.12, pp.2121-2159, 2011.

K. Gintare, D. Dziugaite, and . Roy, Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data, 2017.

D. Eigen, C. Puhrsch, and R. Fergus, Depth map prediction from a single image using a multi-scale deep network, Advances in neural information processing systems, vol.7, pp.2366-2374, 2014.

E. Fiesler, A. Choudry, and H. Caulfield, Weight discretization paradigm for optical neural networks, Optical interconnections and networks, vol.1281, pp.164-174, 1990.

T. Finley and T. Joachims, Training structural SVMs when exact inference is intractable, International Conference on Machine Learning (ICML), pp.304-311, 2008.

C. Finn, P. Abbeel, and S. Levine, Model-agnostic meta-learning for fast adaptation of deep networks, Proceedings of the 34th International Conference on Machine Learning, vol.70, pp.2019-2025

M. Frank and P. Wolfe, An algorithm for quadratic programming, Naval research logistics quarterly, vol.3, issue.1-2, pp.95-110, 1956.

T. Furlanello, C. Zachary, M. Lipton, and . Tschannen, Laurent Itti, and Anima Anandkumar. Born again neural networks, 2018.

S. Furuichi, Information theoretical properties of Tsallis entropies, Journal of Mathematical Physics, vol.47, issue.2, p.23302, 2006.

M. Garnelo, D. Rosenbaum, C. Maddison, T. Ramalho, D. Saxton et al., Conditional neural processes, Proceedings of the 35th International Conference on Machine Learning, vol.80, pp.1704-1713

M. Garnelo, J. Schwarz, D. Rosenbaum, F. Viola, J. Danilo et al., , 2018.

S. Geman and D. Geman, Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.6, pp.721-741, 1984.

P. Germain, F. Bach, A. Lacoste, and S. Lacoste-julien, Pacbayesian theory meets bayesian inference, Advances in Neural Information Processing Systems, pp.1884-1892
URL : https://hal.archives-ouvertes.fr/hal-01324072

S. Gidaris and N. Komodakis, Dynamic few-shot visual learning without forgetting, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.4367-4375, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01829985

S. Gidaris, P. Singh, and N. Komodakis, Unsupervised representation learning by predicting image rotations, International Conference on Learning Representations, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01832768

S. Gidaris, A. Bursuc, N. Komodakis, P. Pérez, and M. Cord, Boosting few-shot visual learning with self-supervision, 2019.

G. Gidel, F. Pedregosa, and S. Lacoste-julien, Frank-wolfe splitting via augmented lagrangian method, Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, vol.84, pp.1456-1465, 2018.

R. Girshick, Fast r-cnn, Proceedings of the IEEE international conference on computer vision, vol.7, pp.1440-1448, 2015.

R. Giryes, G. Sapiro, and A. M. Bronstein, Deep neural networks with random gaussian weights: A universal classification strategy?, IEEE Trans. Signal Processing, vol.64, issue.13, pp.3444-3457, 2016.

T. Bibliography-amir-globerson and . Jaakkola, Fixing max-product: Convergent message passing algorithms for MAP LP-relaxations, NIPS, 2007.

A. Globerson and T. Jaakkola, Convergent propagation algorithms via oriented trees, UAI, pp.133-140, 2007.

F. Gomez and J. Schmidhuber, Evolving modular fast-weight networks for control, International Conference on Artificial Neural Networks, pp.383-389, 2005.

Y. Gong, L. Liu, M. Yang, and L. Bourdev, Compressing deep convolutional networks using vector quantization, 2014.

I. Good, Some history of the hierarchical bayesian methodology. Trabajos de estadística y de investigación operativa, vol.31, p.489, 1980.

J. Gordon, J. Bronskill, M. Bauer, S. Nowozin, and R. E. Turner, Decision-theoretic meta-learning: Versatile and efficient amortization of few-shot learning, 2018.

K. Goto and R. Van-de-geijn, High-performance implementation of the level-3 blas, ACM Trans. Math. Softw, vol.35, issue.1, 2008.

P. Goyal, P. Dollár, R. B. Girshick, P. Noordhuis, L. Wesolowski et al., Yangqing Jia, and Kaiming He. Accurate, large minibatch SGD: training

E. Grant, C. Finn, S. Levine, T. Darrell, and T. Griffiths, Recasting gradient-based meta-learning as hierarchical bayes, 2018.

A. Graves, Practical variational inference for neural networks, Advances in neural information processing systems, pp.2348-2356, 2011.

K. Greff, R. Srivastava, and J. Schmidhuber, Highway and residual networks learn unrolled iterative estimation, Proceedings of the International Conference on Learning Representations, vol.8, 2017.

P. Grunwald, A tutorial introduction to the minimum description length principle

Y. Guo, A. Yao, and Y. Chen, Dynamic network surgery for efficient dnns, Advances In Neural Information Processing Systems, pp.1379-1387, 2016.

D. Ha, A. Dai, V. Quoc, . Le, and . Hypernetworks, International Conference on Learning Representation (ICLR), 2017.

D. Ha, A. Dai, V. Quoc, . Le, and . Hypernetworks, Proceedings of the International Conference on Learning Representations, vol.8, 2017.

D. Benjamin, R. Haeffele, and . Vidal, Global optimality in neural network training, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, vol.4, pp.7331-7339, 2017.

S. Han, H. Mao, and W. J. Dally, Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding

S. Han, J. Pool, J. Tran, and W. Dally, Learning both weights and connections for efficient neural network, Advances in neural information processing systems, pp.1135-1143

S. Han, H. Mao, and W. J. Dally, Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding, International Conference on Learning Representations (ICLR), 2016.

S. Han, J. Pool, S. Narang, H. Mao, S. Tang et al., DSD: regularizing deep neural networks with dense-sparse-dense training flow. International Conference on Learning Representations, vol.8, 2017.

N. Harvey, C. Liaw, and A. Mehrabian, Nearly-tight vc-dimension bounds for piecewise linear neural networks, Conference on Learning Theory, pp.1064-1068

B. Hassibi, G. David, G. Stork, and . Wolff, Optimal brain surgeon and general network pruning, IEEE International Conference on, pp.293-299, 1993.

W. and K. Hastings, Monte carlo sampling methods using markov chains and their applications, Biometrika, vol.57, issue.1, pp.97-109, 1970.

J. Haugeland, Artificial intelligence: The very idea, 1989.

T. Hazan and R. Urtasun, A primal-dual message-passing algorithm for approximated large scale structured prediction, NIPS, pp.838-846, 2010.

T. Hazan and A. Shashua, Norm-product belief propagation: Primal-dual message-passing for approximate inference, IEEE Transactions on Information Theory, vol.56, issue.12, pp.6294-6316, 2002.

T. Hazan, J. Keshet, and D. A. Mcallester, Direct loss minimization for structured prediction, Advances in Neural Information Processing Systems, pp.1594-1602, 2010.

K. He, X. Zhang, S. Ren, and J. Sun, Delving deep into rectifiers: Surpassing human-level performance on imagenet classification, Proceedings of the IEEE international conference on computer vision, vol.6, pp.1026-1034, 2015.

K. He, X. Zhang, S. Ren, and J. Sun, Deep residual learning for image recognition, Proceedings of the IEEE conference on computer vision and pattern recognition, pp.770-778, 2016.

X. Bibliography-kaiming-he, S. Zhang, J. Ren, and . Sun, Deep residual learning for image recognition, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.770-778, 2016.

K. He, X. Zhang, S. Ren, and J. Sun, Deep residual learning for image recognition, Proceedings of the IEEE conference on computer vision and pattern recognition, pp.770-778

T. Heskes, Convexity arguments for efficient minimization of the bethe and kikuchi free energies, J. Artif. Intell. Res.(JAIR), vol.26, pp.153-190, 2006.

G. E. Hinton, Deep belief networks. Scholarpedia, vol.4, p.5947, 2009.

G. Hinton, O. Vinyals, and J. Dean, Distilling the knowledge in a neural network, 2015.

E. Geoffrey, D. Hinton, and . Van-camp, Keeping the neural networks simple by minimizing the description length of the weights, Proceedings of the sixth annual conference on Computational learning theory, pp.5-13, 1993.

N. Geoffrey-e-hinton, A. Srivastava, I. Krizhevsky, R. Sutskever, and . Salakhutdinov, Improving neural networks by preventing co-adaptation of feature detectors, 2012.

S. Hochreiter and J. Schmidhuber, Long short-term memory, vol.8, 1997.

S. Hochreiter and J. Schmidhuber, Flat minima, Neural Computation, vol.9, issue.1, pp.1-42, 1997.

M. Hong and Z. Luo, On the linear convergence of the alternating direction method of multipliers, Mathematical Programming, vol.162, issue.1-2, pp.165-199, 2017.

M. Hong, T. Chang, X. Wang, M. Razaviyayn, S. Ma et al., A block successive upper bound minimization method of multipliers for linearly constrained convex optimization, 2014.

G. Andrew, M. Howard, B. Zhu, D. Chen, W. Kalenichenko et al., Mobilenets: Efficient convolutional neural networks for mobile vision applications, 2017.

A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang et al., Mobilenets: Efficient convolutional neural networks for mobile vision applications

G. Huang, Z. Liu, Q. Kilian, L. Weinberger, and . Van-der-maaten, , vol.8, 2016.

Z. Huang and N. Wang, Like what you like: Knowledge distill via neuron selectivity transfer, 2017.

I. Hubara, M. Courbariaux, D. Soudry, R. El-yaniv, and Y. Bengio, Quantized neural networks: Training neural networks with low precision weights and activations, The Journal of Machine Learning Research, vol.18, issue.1, pp.6869-6898, 2017.

N. Forrest, S. Iandola, . Han, W. Matthew, K. Moskewicz et al., Squeezenet: Alexnet-level accuracy with 50x fewer parameters and< 0.5 mb model size, 2016.

F. N. Iandola, M. W. Moskewicz, K. Ashraf, S. Han, W. J. Dally et al., Squeezenet: Alexnet-level accuracy with 50x fewer parameters and <1mb model size, vol.8, 2016.

F. D. Igual, G. Quintana-ortí, and R. A. Van-de-geijn, Level-3 blas on a GPU : Picking the low hanging fruit. FLAME working note #37, 2009.

K. Hakan-inan, R. Khosravi, and . Socher, Tying word vectors and word classifiers: A loss framework for language modeling, vol.8, 2016.

S. Ioffe and C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift, Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pp.448-456, 2015.

S. Ioffe and C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift

M. Jaderberg, A. Vedaldi, and A. Zisserman, Speeding up convolutional neural networks with low rank expansions, 2014.

M. Jaderberg, A. Vedaldi, and A. Zisserman, Speeding up convolutional neural networks with low rank expansions, Proceedings of the British Machine Vision Conference (BMVC)

M. Jaderberg, W. M. Czarnecki, S. Osindero, O. Vinyals, A. Graves et al., Decoupled neural interfaces using synthetic gradients, Proceedings of the 34th International Conference on Machine Learning, vol.70, pp.1627-1635

T. Edwin and . Jaynes, Information theory and statistical mechanics, Physical review, vol.106, issue.4, p.620, 1957.

T. Edwin and . Jaynes, Probability theory: the logic of science, 1996.

. Bibliography,

H. Jeffreys, Theory of Probability. The Clarendon Press, 1939.

X. Jia, B. De-brabandere, T. Tuytelaars, and L. Gool, Dynamic filter networks, Advances in Neural Information Processing Systems, pp.667-675, 2016.

J. Jin, A. Dundar, and E. Culurciello, Flattened convolutional neural networks for feedforward acceleration, 2014.

T. Joachims, T. Finley, and C. Yu, Cutting-plane training of structural svms, Machine Learning, vol.77, pp.27-59, 2009.

K. Jason, A. S. Johnson, and . Willsky, Convex relaxation methods for graphical models: Lagrangian and maximum entropy approaches, 2002.

R. Johnson and T. Zhang, Accelerating stochastic gradient descent using predictive variance reduction, Advances in neural information processing systems, pp.315-323, 2013.

M. Jordan, Artificial intelligence: The revolution hasn't happened yet, 2019.

Z. Michael-i-jordan, T. S. Ghahramani, L. Jaakkola, and . Saul, An introduction to variational methods for graphical models, Machine learning, vol.37, issue.2, pp.183-233, 1999.

J. Kappes, B. Andres, F. Hamprecht, C. Schnorr, S. Nowozin et al., A comparative study of modern inference techniques for discrete energy minimization problems, Proceedings of the IEEE conference on computer vision and pattern recognition, pp.1328-1335
URL : https://hal.archives-ouvertes.fr/hal-00865699

K. Kawaguchi, Deep learning without poor local minima, Advances in Neural Information Processing Systems, vol.4, pp.586-594, 2016.

J. James-e-kelley, The cutting-plane method for solving convex programs, Journal of the society for Industrial and Applied Mathematics, vol.8, issue.4, pp.703-712, 1960.

D. Nitish-shirish-keskar, J. Mudigere, M. Nocedal, P. Smelyanskiy, and . Tang, On large-batch training for deep learning: Generalization gap and sharp minima, 2016.

J. Maynard and K. , A treatise on probability, Courier Corporation, 1921.

H. Kim, A. Mnih, J. Schwarz, M. Garnelo, A. Eslami et al., Oriol Vinyals, and Yee Whye Teh. Attentive neural processes, vol.6, 2019.

Y. Kim and A. M. Rush, Sequence-level knowledge distillation, EMNLP, vol.8, 2016.

Y. Kim, Y. Jernite, D. Sontag, and A. M. Rush, Character-aware neural language models, AAAI, vol.8, 2016.

P. Diederik, J. Kingma, and . Ba, Adam: A method for stochastic optimization, 2014.

P. Diederik, M. Kingma, and . Welling, Auto-encoding variational bayes, 2013.

P. Diederik, T. Kingma, M. Salimans, and . Welling, Variational dropout and the local reparameterization trick, Advances in Neural Information Processing Systems, pp.2575-2583, 2015.

P. Diederik, T. Kingma, R. Salimans, X. Jozefowicz, I. Chen et al., Improved variational inference with inverse autoregressive flow, Advances in Neural Information Processing Systems, vol.7, pp.4743-4751, 2016.

J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins et al., Overcoming catastrophic forgetting in neural networks, Proceedings of the national academy of sciences, pp.201611835-2017

D. Koller, N. Friedman, and F. Bach, Probabilistic graphical models: principles and techniques, 2009.

A. Kolmogorov, Foundations of the Theory of Probability, 1933.

V. Koltchinskii, Rademacher penalties and structural risk minimization, IEEE Transactions on Information Theory, vol.47, issue.5, pp.1902-1914, 2001.

N. Komodakis, N. Paragios, and G. Tziritas, MRF optimization via dual decomposition: Message-passing revisited, ICCV, pp.1-8, 2007.

N. Komodakis, Efficient training for pairwise or higher order CRFs via dual decomposition, CVPR, vol.3, pp.1841-1848, 2011.

N. Komodakis, Efficient training for pairwise or higher order CRFs via dual decomposition, Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pp.1841-1848, 2011.

T. Koo, A. Globerson, X. C. Pérez, and M. Collins, Structured prediction models via the matrix-tree theorem, Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp.141-150, 2007.

G. Rahul, S. Krishnan, D. Lacoste-julien, and . Sontag, Barrier frank-wolfe for marginal inference, Advances in Neural Information Processing Systems, pp.532-540, 2001.

A. Krizhevsky, I. Sutskever, and G. Hinton, Imagenet classification with deep convolutional neural networks, NIPS, vol.8, 2012.

. Bibliography,

A. Krizhevsky, Learning multiple layers of features from tiny images

A. Krizhevsky, Learning multiple layers of features from tiny images, 2001.

A. Krizhevsky, I. Sutskever, and G. E. Hinton, Imagenet classification with deep convolutional neural networks, Advances in neural information processing systems, pp.1097-1105

B. J. Frank-r-kschischang, H. Frey, and . Loeliger, Factor graphs and the sum-product algorithm, IEEE Transactions on information theory, vol.47, issue.2, pp.498-519, 2001.

A. Kucukelbir, M. David, and . Blei, Population empirical bayes, 2014.

A. Kulesza and F. Pereira, Structured learning with approximate inference, Advances in Neural Information Processing Systems, vol.3, pp.785-792, 2007.

A. Kulesza and F. Pereira, Structured learning with approximate inference, Advances in neural information processing systems, pp.785-792, 2008.

K. Kurdyka, On gradients of functions definable in o-minimal structures, Annales de l'institut Fourier, pp.769-783, 1998.

A. Lacoste, T. Boquet, N. Rostamzadeh, B. Oreshki, W. Chung et al., , 2017.

S. Lacoste-julien, M. Jaggi, M. Schmidt, and P. Pletscher, Blockcoordinate Frank-Wolfe optimization for structural SVMs, ICML, pp.53-61
URL : https://hal.archives-ouvertes.fr/hal-00720158

S. Lacoste-julien, M. Jaggi, M. Schmidt, and P. Pletscher, Blockcoordinate frank-wolfe optimization for structural svms, International Conference on Machine Learning, pp.53-61
URL : https://hal.archives-ouvertes.fr/hal-00720158

J. Lafferty, A. Mccallum, and F. Pereira, Conditional random fields: Probabilistic models for segmenting and labeling sequence data, International Conference on Machine Learning, 2001.

G. Lan, D. C. Renato, and . Monteiro, Iteration-complexity of first-order augmented Lagrangian methods for convex programming, Mathematical Programming, vol.155, issue.1-2, pp.511-547

P. Simon-laplace,

N. Lawrence, Machine learning systems design, vol.8, 2019.

A. Rémi-le-priol, S. Piché, and . Lacoste-julien, Adaptive stochastic dual coordinate ascent for conditional random fields, 2018.

V. Lebedev, Y. Ganin, M. Rakhuba, I. V. Oseledets, and V. S. Lempitsky, Speeding-up convolutional neural networks using fine-tuned cp-decomposition, International Conference on Learning Representations (ICLR), 2016.

Y. Lecun, S. John, S. A. Denker, and . Solla, Optimal brain damage, Advances in neural information processing systems, pp.598-605, 1990.

Y. Lecun, J. S. Denker, and S. A. Solla, Optimal brain damage, Advances in Neural Information Processing Systems, vol.2, pp.598-605, 1990.

Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-based learning applied to document recognition, Proceedings of the IEEE, vol.86, issue.11, pp.2278-2324, 1998.

Y. Lecun, Y. Bengio, and G. Hinton, Deep learning. nature, vol.521, p.436, 2015.

K. Lee, S. Maji, A. Ravichandran, and S. Soatto, Metalearning with differentiable convex optimization, CVPR, 2019.

C. Li, H. Farkhoor, R. Liu, and J. Yosinski, Measuring the intrinsic dimension of objective landscapes, 2018.

D. Li, Y. Yang, Y. Song, and T. M. Hospedales, Deeper, broader and artier domain generalization, Proceedings of the IEEE International Conference on Computer Vision, pp.5542-5550

H. Li, Z. Xu, G. Taylor, and T. Goldstein, Visualizing the loss landscape of neural nets, vol.4, 2017.

H. Li, D. Eigen, S. Dodge, M. Zeiler, and X. Wang, Finding Task-Relevant Features for Few-Shot Learning by Category Traversal, CVPR, pp.6-11, 2019.

Z. Li, F. Zhou, F. Chen, and H. Li, Meta-sgd: Learning to learn quickly for few-shot learning, vol.6, 2017.

Z. Li and D. Hoiem, Learning without forgetting, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.7, 2002.

H. Lin, J. Mairal, and Z. Harchaoui, QuickeNing: A generic quasi-Newton algorithm for faster gradient-based optimization

M. Lin, Q. Chen, and S. Yan, Network in network, vol.4, issue.2, 2013.

M. Lin, Q. Chen, and S. Yan, , vol.8, 2013.

R. Bibliography-zhouhan-lin, K. Memisevic, and . Konda, How far can we go without convolution: Improving fully-connected networks, 2001.

R. Linsker, An application of the principle of maximum information preservation to linear systems, Advances in neural information processing systems, pp.186-194, 1989.

Y. Liu, J. Lee, M. Park, S. Kim, E. Yang et al., Learning to propagate labels: Transductive propagation network for few-shot learning, 2018.

Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan et al., Learning efficient convolutional networks through network slimming, 2017 IEEE International Conference on, pp.2755-2763

S. ?ojasiewicz, Sur la géométrie semi-et sous-analytique, Ann. Inst. Fourier, vol.43, issue.5, pp.1575-1595, 1993.

B. London, B. Huang, and L. Getoor, The benefits of learning with strongly convex approximate inference, ICML, pp.410-418, 2015.

C. Louizos, K. Ullrich, and M. Welling, Bayesian compression for deep learning, Advances in Neural Information Processing Systems, pp.3288-3298

H. Lu and K. Kawaguchi, Depth creates no bad local minima, vol.4, 2017.

N. Ma, X. Zhang, H. Zheng, and J. Sun, Shufflenet V2: practical guidelines for efficient CNN architecture design, Computer Vision and Pattern Recognition (CVPR), pp.122-138, 2018.

W. Maass, Networks of spiking neurons: the third generation of neural network models, Neural networks, vol.10, issue.9, pp.1659-1671, 1997.

A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, and B. Frey, , vol.5, 2015.

S. Mandt, D. Matthew, D. M. Hoffman, and . Blei, Stochastic gradient descent as approximate bayesian inference, The Journal of Machine Learning Research, vol.18, issue.1, pp.4873-4907

M. P. Marcus, B. Santorini, and M. A. Marcinkiewicz, Building a large annotated corpus of english: The penn treebank, Computational Linguistics, vol.19, pp.313-330, 1993.

F. T. André, . Martins, A. T. Mário, P. M. Figueiredo, N. A. Aguiar et al., AD3: Alternating directions dual decomposition for MAP inference in graphical models, JMLR, vol.16, issue.2, pp.495-545, 2015.

M. Mathieu, M. Henaff, and Y. Lecun, Fast training of convolutional networks through ffts, International Conference on Learning Representations, 2014.

A. David and . Mcallester, Pac-bayesian stochastic model selection, Machine Learning, vol.51, pp.5-21, 2003.

J. Mccarthy, Programs with common sense. RLE and MIT computation center, 1960.

S. Warren, W. Mcculloch, and . Pitts, A logical calculus of the ideas immanent in nervous activity, The bulletin of mathematical biophysics, vol.5, issue.4, pp.115-133, 1943.

S. Merity, C. Xiong, J. Bradbury, and R. Socher, Pointer sentinel mixture models

O. Meshi, A. Jaimovich, A. Globerson, and N. Friedman, Convexifying the bethe free energy, UAI, 2009.

O. Meshi, T. Jaakkola, and A. Globerson, Convergence rate analysis of MAP coordinate minimization algorithms, NIPS, 2002.

O. Meshi, M. Mahdavi, and A. G. Schwing, Smooth and strong: MAP inference with linear convergence, NIPS, pp.298-306

O. Meshi, N. Srebro, and T. Hazan, Efficient training of structured SVMs via soft constraints, AISTATS, pp.699-707

O. Meshi, D. Sontag, A. Globerson, and T. S. Jaakkola, Learning efficiently with approximate inference via dual losses, ICML, pp.783-790, 2010.

O. Meshi, B. London, A. Weller, and D. Sontag, Train and test tightness of lp relaxations in structured prediction, Journal of Machine Learning Research, vol.20, issue.13, pp.1-34

N. Metropolis, A. W. Rosenbluth, N. Marshall, A. H. Rosenbluth, E. Teller et al., Equation of state calculations by fast computing machines. The journal of chemical physics, vol.21, pp.1087-1092, 1953.

T. Minka, Discriminative models, not discriminative training

L. Marvin and . Minsky, Logical versus analogical or symbolic versus connectionist or neat versus scruffy, AI magazine, vol.12, issue.2, pp.34-34, 1991.

N. Mishra, M. Rohaninejad, X. Chen, and P. Abbeel, A simple neural attentive meta-learner, 2017.

K. P. Murphy, Machine Learning: A Probabilistic Perspective, 2002.

S. Bibliography-rajib-nath, T. Tomov, J. Dong, and . Dongarra, Optimizing symmetric dense matrix-vector multiplication on gpus, High Performance Computing, Networking, Storage and Analysis (SC), pp.11-2011

. Yu and . Nesterov, Smooth minimization of non-smooth functions, Mathematical Programming, vol.103, issue.1, pp.127-152, 2002.

Y. Nesterov, Introductory lectures on convex optimization: A basic course, vol.87, p.4, 2013.

A. Newell and H. Calif, Gps, a program that simulates human thought, 1961.

S. Behnam-neyshabur, D. Bhojanapalli, N. Mcallester, and . Srebro, Exploring generalization in deep learning, Advances in Neural Information Processing Systems, pp.5947-5956

Z. Behnam-neyshabur, S. Li, Y. Bhojanapalli, N. Lecun, and . Srebro, Towards understanding the role of over-parametrization in generalization of neural networks, 2018.

Y. Andrew and M. Ng, On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes, Advances in neural information processing systems, pp.841-848, 2002.

Q. Nguyen and M. Hein, Optimization landscape and expressivity of deep cnns, International Conference on Machine Learning, vol.4, pp.3727-3736, 2018.

A. Nichol, J. Achiam, and J. Schulman, On first-order meta-learning algorithms, vol.6, 2001.

R. Novak, Y. Bahri, A. Daniel, J. Abolafia, J. Pennington et al., Sensitivity and generalization in neural networks: an empirical study, 2018.

J. Steven, G. E. Nowlan, and . Hinton, Simplifying neural networks by soft weightsharing, Neural Comput, vol.4, issue.4, pp.473-493, 1992.

S. Nowozin and C. H. Lampert, Structured learning and prediction in computer vision, Foundations and Trends® in Computer Graphics and Vision, vol.6, issue.3-4, pp.185-365, 2001.

J. Nutini, M. Schmidt, I. Laradji, M. Friedlander, and H. Koepke, Coordinate descent converges faster with the Gauss-Southwell rule than random selection, ICML, vol.3, pp.1632-1641, 2015.

P. Okunev and . Charles-r-johnson, Necessary and sufficient conditions for existence of the lu factorization of an arbitrary matrix, 2005.

A. Van-den-oord, S. Dieleman, H. Zen, and K. Simonyan, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio, vol.4, 2016.

. Openai, Openai dota 2 1v1 bot

B. N. Oreshkin, A. Pau-rodríguez-lópez, and . Lacoste, Tadam: Task dependent adaptive metric for improved few-shot learning, Advances in Neural Information Processing Systems (NIPS), 2018.

X. A-emin-orhan and . Pitkow, Skip connections eliminate singularities, vol.4, 2017.

Q. Sinno-jialin-pan and . Yang, A survey on transfer learning, IEEE Transactions on knowledge and data engineering, vol.22, issue.10, pp.1345-1359, 2010.

G. Parisi, Statistical field theory, 1988.

J. Pearl, Bayesian networks: A model of self-activated memory for evidential reasoning, Proc. of Cognitive Science Society (CSS-7), 1985.

J. Pearl, Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference, 1988.

G. Valle-pérez, A. Ard, C. Louis, and . Camargo, Deep learning generalizes because the parameter-function map is biased towards simple functions, 2018.

C. Peterson, A mean field theory learning algorithm for neural networks, Complex systems, vol.1, pp.995-1019, 1987.

P. Pletscher, . Cheng-soon, J. M. Ong, and . Buhmann, Spanning tree approximations for conditional random fields, Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics (AISTATS), pp.408-415, 2009.

P. Pletscher, . Cheng-soon, J. M. Ong, and . Buhmann, Entropy and margin maximization for structured output learning, ECML, pp.83-98, 2010.

Y. Lorien and . Pratt, Discriminability-based transfer between neural networks, Advances in neural information processing systems, vol.4, pp.204-211, 1993.

S. Qiao, C. Liu, W. Shen, and A. L. Yuille, Few-shot image recognition by predicting parameters from activations, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.6-11, 2018.

A. Quattoni and A. Torralba, Recognizing indoor scenes, Computer Vision and Pattern Recognition, pp.413-420, 2009.

M. Raghu, B. Poole, J. Kleinberg, S. Ganguli, and J. Sohl-dickstein, On the expressive power of deep neural networks, Proceedings of the 34th International Conference on Machine Learning, vol.70, pp.2847-2854

M. Ranjbar, T. Lan, Y. Wang, N. Steven, Z. Robinovitch et al., Optimizing nondecomposable loss functions in structured prediction, IEEE transactions on pattern analysis and machine intelligence, vol.35, pp.911-924, 2013.

A. Rasmus, M. Berglund, M. Honkala, H. Valpola, and T. Raiko, Semi-supervised learning with ladder networks, Advances in Neural Information Processing Systems, pp.3546-3554, 2015.

M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, Xnor-net: Imagenet classification using binary convolutional neural networks, European Conference on Computer Vision, pp.525-542, 2016.

S. Ravi and A. Beatson, Amortized bayesian meta-learning, International Conference on Learning Representations (ICLR), 2018.

S. Ravi and H. Larochelle, Optimization as a model for few-shot learning, International Conference on Learning Representation, vol.6, 2016.

P. Ravikumar and J. Lafferty, Quadratic programming relaxations for metric labeling and markov random field map estimation, Proceedings of the 23rd international conference on Machine learning, pp.737-744, 2002.

P. Ravikumar, A. Agarwal, and M. Wainwright, Message-passing for graph-structured linear programs: Proximal methods and rounding schemes, Journal of Machine Learning Research, vol.11, pp.1043-1080, 2002.

D. Jimenez-rezende, S. Mohamed, and D. Wierstra, Stochastic backpropagation and approximate inference in deep generative models, vol.6, 2014.

M. Richardson and P. Domingos, Markov logic networks, Machine learning, vol.62, issue.1-2, pp.107-136, 2006.

P. Richtárik and M. Taká?, Stochastic reformulations of linear systems: algorithms and convergence theory

J. Rissanen, Modeling by shortest data description, Automatica, vol.14, issue.5, pp.465-471, 1978.

H. Robbins, An empirical bayes approach to statistics, Herbert Robbins Selected Papers, pp.41-47, 1985.

H. Robbins and S. Monro, A stochastic approximation method. The annals of mathematical statistics, pp.400-407, 1951.

H. Robbins and S. Monro, A stochastic approximation method, Herbert Robbins Selected Papers, vol.4, pp.102-109, 1985.

C. Robert and G. Casella, Monte Carlo statistical methods, 2013.

A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta et al., Hints for thin deep nets, 2014.

N. Rosenfeld, O. Meshi, D. Tarlow, and A. Globerson, Learning structured models with the auc loss and its generalizations, Artificial Intelligence and Statistics, pp.841-849, 2014.

L. Nicolas, M. Roux, F. Schmidt, and . Bach, A stochastic gradient method with an exponential convergence rate for finite training sets, NIPS, vol.3, pp.2663-2671, 2012.

G. E. David-e-rumelhart, R. Hinton, and . Williams, Learning representations by back-propagating errors, nature, vol.323, issue.6088, p.533, 1986.

O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh et al., Imagenet large scale visual recognition challenge, International Journal of Computer Vision, vol.115, issue.3, pp.211-252, 2015.

S. Russell and P. Norvig, Artificial Intelligence: A Modern Approach, 2009.

A. A. Rusu, D. Rao, J. Sygnowski, O. Vinyals, R. Pascanu et al., Meta-learning with latent embedding optimization, International Conference on Learning Representations, 2019.

L. Sagun, U. Evci, Y. V-ugur-guney, L. Dauphin, and . Bottou, Empirical analysis of the hessian of over-parametrized neural networks, vol.4, 2017.

R. Salakhutdinov and G. Hinton, Deep boltzmann machines, Artificial intelligence and statistics, pp.448-455, 2009.

A. Santoro, D. Raposo, G. T. David, M. Barrett, R. Malinowski et al., A simple neural network module for relational reasoning, NIPS, 2017.

G. Victor, J. Satorras, and . Bruna, Few-shot learning with graph neural networks. ArXiv, abs/1711.04043, 2017.

B. Savchynskyy, J. Kappes, S. Schmidt, and C. Schnörr, A study of Nesterov's scheme for Lagrangian decomposition and MAP labeling, CVPR, pp.1817-1823, 2011.

A. Michael-saxe, Y. Bansal, J. Dapello, M. Advani, A. Kolchinsky et al., On the information bottleneck theory of deep learning, International Conference on Learning Representation (ICLR)

J. Bibliography-thomas-schlegl, G. Ofner, and . Langs, Unsupervised pre-training across image domains improves lung tissue classification, International MICCAI Workshop on Medical Computer Vision, vol.7, pp.82-93, 2014.

J. Schmidhuber, Evolutionary principles in self-referential learning. on learning now to learn: The meta-meta-meta, 1987.

J. Schmidhuber, Learning to control fast-weight memories: An alternative to dynamic recurrent networks, Neural Computation, vol.4, issue.1, pp.131-139, 1992.

J. Schmidhuber, Deep learning in neural networks: An overview, Neural networks, vol.61, issue.1, pp.85-117, 2015.

M. Schmidt, N. L. Roux, and F. R. Bach, Convergence rates of inexact proximal-gradient methods for convex optimization, NIPS, p.3, 2011.
URL : https://hal.archives-ouvertes.fr/inria-00618152

M. Schmidt, R. Babanezhad, M. Ahmed, A. Defazio, A. Clifton et al., Non-uniform stochastic average gradient method for training conditional random fields, AIStats, pp.819-828, 2015.

S. Shalev, -. Shwartz, and T. Zhang, Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization, ICML, pp.64-72, 2014.

S. Shalev, -. Shwartz, and T. Zhang, Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization, Mathematical Programming, vol.155, pp.105-145, 2016.

S. Shalev-shwartz, Y. Singer, N. Srebro, and A. Cotter, Pegasos: Primal estimated sub-gradient solver for svm. Mathematical programming, vol.127, pp.3-30, 2011.

C. E. Shannon, A mathematical theory of communication, Bell system technical journal, vol.27, issue.3, pp.379-423, 1948.

J. Shotton, J. Winn, C. Rother, and A. Criminisi, Textonboost: Joint appearance, shape and context modeling for multi-class object recognition and segmentation, European conference on computer vision, pp.1-15

D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang et al., Mastering the game of go without human knowledge, Nature, vol.550, issue.7676, pp.354-2017

K. Simonyan and A. Zisserman, Very deep convolutional networks for large-scale image recognition, ICLR, vol.8, 2001.

K. Simonyan and A. Zisserman, Very deep convolutional networks for large-scale image recognition, 2014.

K. Simonyan and A. Zisserman, Very deep convolutional networks for large-scale image recognition, 2002.

J. Snell, K. Swersky, and R. Zemel, Prototypical networks for few-shot learning, Advances in Neural Information Processing Systems, pp.4077-4087, 2017.

S. Sonnenburg, H. Strathmann, S. Lisitsyn, V. Gal, F. J. García et al., Alesis Novik, Abinash Panda, Evangelos Anagnostopoulos, Liang Pang, Alex Binder, serialhex, and Björn Esser. shogun-toolbox/shogun: Shogun 6.1.0, 2017.

D. Sontag and T. Jaakkola, Tree block coordinate descent for map in graphical models, Artificial Intelligence and Statistics, pp.544-551, 2009.

D. Sontag, T. Meltzer, A. Globerson, T. Jaakkola, and Y. Weiss, Tightening LP relaxations for MAP using message passing, UAI, vol.3, pp.503-510, 2008.

D. Sontag, A. Globerson, and T. Jaakkola, Introduction to dual composition for inference, Optimization for Machine Learning, 2011.

D. Soudry and Y. Carmon, No bad local minima: Data independent training error guarantees for multilayer neural networks, vol.4, 2016.

D. Soudry and E. Hoffer, Exponentially vanishing sub-optimal local minima in multilayer neural networks, vol.4, 2017.

J. Tobias-springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller, Striving for simplicity: The all convolutional net, vol.4, 2014.

N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, vol.15, pp.1929-1958, 2014.

K. Rupesh, K. Srivastava, J. Greff, and . Schmidhuber, Training very deep networks, Advances in Neural Information Processing Systems, vol.28, pp.2377-2385, 2015.

R. Leont and &. Stratonovich, Conditional markov processes, Non-linear transformations of stochastic processes, pp.427-453, 1965.

Y. Sun, X. Wang, and X. Tang, Deep learning face representation from predicting 10,000 classes, Proceedings of the IEEE conference on computer vision and pattern recognition, vol.4, pp.1891-1898, 2014.

Y. Bibliography-flood-sung, L. Yang, T. Zhang, . Xiang, H. S. Philip et al., Learning to compare: Relation network for few-shot learning, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.

C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed et al., Going deeper with convolutions, CVPR, vol.8, 2015.

C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed et al., Going deeper with convolutions, Proceedings of the IEEE conference on computer vision and pattern recognition, pp.1-9

C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, Rethinking the inception architecture for computer vision, Proceedings of the IEEE conference on computer vision and pattern recognition, pp.2818-2826, 2016.

K. Tang, N. Ruozzi, D. Belanger, and T. Jebara, Bethe learning of graphical models via MAP decoding, AIStats, pp.1096-1104, 2016.

B. Taskar, C. Guestrin, and D. Koller, Max-margin markov networks, Proceedings of the 16th International Conference on Neural Information Processing Systems, pp.25-32, 2003.

M. Teye, H. Azizpour, and K. Smith, Bayesian uncertainty estimation for batch normalized deep networks, International Conference on Machine Learning (ICML), 2018.

S. Thrun and L. Pratt, Learning to learn: Introduction and overview, Learning to learn, vol.6, pp.3-17, 1998.

N. Tishby and N. Zaslavsky, Deep learning and the information bottleneck principle, Information Theory Workshop (ITW), pp.1-5, 2015.

N. Tishby, C. Fernando, W. Pereira, and . Bialek, The information bottleneck method

M. Jakub, M. Tomczak, and . Welling, Vae with a vampprior, vol.5, 2017.

A. Toshev and C. Szegedy, Deeppose: Human pose estimation via deep neural networks, Proceedings of the IEEE conference on computer vision and pattern recognition, vol.7, pp.1653-1660, 2014.

I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun, Large margin methods for structured and interdependent output variables, Journal of machine learning research, vol.6, pp.1453-1484, 2005.

K. Ullrich, E. Meeds, and M. Welling, Soft weight-sharing for neural network compression, International Conference on Learning Representations (ICLR)

K. Ullrich, E. Meeds, and M. Welling, Soft weight-sharing for neural network compression, International Conference on Learning Representations, vol.8, 2017.

G. Urban, J. Krzysztof, S. E. Geras, O. Kahou, S. Aslan et al., Abdelrahman Mohamed, Matthai Philipose, Matt Richardson, and Rich Caruana. Do deep convolutional nets really need to be deep and convolutional? In ICLR, 2001.

G. Leslie and . Valiant, A theory of the learnable, Proceedings of the sixteenth annual ACM symposium on Theory of computing, pp.436-445, 1984.

V. Vapnik, E. Levin, and Y. Le-cun, Measuring the vc-dimension of a learning machine, Neural computation, vol.6, issue.5, pp.851-876, 1994.

I. Stylianos and . Venieris, Alexandros Kouris, and Christos-Savvas Bouganis. Toolflows for mapping convolutional neural networks on fpgas: A survey and future directions, vol.51, p.56, 2018.

P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P. Manzagol, Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion, Journal of machine learning research, vol.11, pp.3371-3408, 2002.

O. Vinyals, C. Blundell, T. Lillicrap, K. Koray, and D. Wierstra, Matching networks for one shot learning, Advances in Neural Information Processing Systems, vol.29, pp.3630-3638, 2016.

M. J. Wainwright, Estimating the wrong graphical model: Benefits in the computationlimited setting, JMLR, vol.7, issue.8, pp.1829-1859, 2006.

M. J. Wainwright, Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, vol.1, pp.1-305, 2008.

M. J. Wainwright, T. S. Jaakkola, and A. S. Willsky, MAP estimation via agreement on (hyper)trees: Message-passing and linear-programming approaches

J. Martin, T. S. Wainwright, A. S. Jaakkola, and . Willsky, A new class of upper bounds on the log partition function, IEEE Transactions on Information Theory, vol.51, issue.7, pp.2313-2335, 2005.

A. S. Weigend and B. Huberman, Predicting the future: A connectionist approach, International Journal of Neural Systems, vol.1, issue.3, pp.193-209, 1990.

P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff et al., Caltech-UCSD Birds 200, 2010.

M. Bibliography and . Welling, Intelligence per kilowatt-hour, vol.3, 2018.

M. Welling, Do we still need models or just more data and compute?, vol.8, 2019.

P. Werbos, Beyond regression:" new tools for prediction and analysis in the behavioral sciences, vol.4, 1974.

T. Werner, A linear programming approach to max-sum problem: A review, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.29, issue.7, pp.1165-1179, 2007.

C. Wu, J. Luo, and J. D. Lee, No spurious local minima in a two hidden unit relu network, International Conference on Learning Representation Workshop

L. Wu and Z. Zhu, Towards understanding generalization of deep learning: Perspective of loss landscapes, 2017.

S. Wu, G. Li, F. Chen, and L. Shi, Training and inference with integers in deep neural networks, 2018.

S. Xie, R. Girshick, and P. Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks, 2016.

S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, Aggregated residual transformations for deep neural networks, Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pp.5987-5995

A. Xu, Information-theoretic limitations of distributed information processing, 2016.

A. Xu and M. Raginsky, Information-theoretic analysis of generalization capability of learning algorithms, Advances in Neural Information Processing Systems, vol.6, pp.2524-2533, 2017.

J. Xu, L. Xiao, and A. M. López, Self-supervised domain adaptation for computer vision tasks, IEEE Access, vol.7, pp.156694-156706, 2019.

J. Yang, R. Yan, and A. G. Hauptmann, Cross-domain video concept detection using adaptive svms, Proceedings of the 15th ACM international conference on Multimedia, pp.188-197, 2007.

Z. Yang, M. Moczulski, M. Denil, A. Nando-de-freitas, L. Smola et al., Deep fried convnets, Proceedings of the IEEE International Conference on Computer Vision, pp.1476-1483, 2015.

C. Yanover, T. Meltzer, and Y. Weiss, Linear programming relaxations and belief propagation-an empirical study, Journal of Machine Learning Research, vol.7, pp.1887-1907, 2002.

S. Jonathan, . Yedidia, T. William, Y. Freeman, and . Weiss, Understanding belief propagation and its generalizations, vol.8, pp.236-239, 2001.

S. Jonathan, . Yedidia, T. William, Y. Freeman, and . Weiss, Constructing free-energy approximations and generalized belief propagation algorithms, IEEE Transactions on Information Theory, vol.51, issue.7, pp.2282-2312, 2005.

I. En-hsu-yen, X. Huang, K. Zhong, R. Zhang, K. Pradeep et al., Dual decomposed learning with factorwise oracle for structural SVM of large output domain, NIPS, pp.5024-5032, 2016.

J. Yim, D. Joo, J. Bae, and J. Kim, A gift from knowledge distillation: Fast optimization, network minimization and transfer learning, Computer Vision and Pattern Recognition, pp.4133-4141, 2017.

J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, How transferable are features in deep neural networks?, Advances in neural information processing systems, pp.3320-3328, 2014.

J. Yu, B. Matthew, and . Blaschko, The lovász hinge: A novel convex surrogate for submodular losses, IEEE transactions on pattern analysis and machine intelligence, 2018.

S. Zagoruyko and N. Komodakis, Wide residual networks, BMVC, 2004.
URL : https://hal.archives-ouvertes.fr/hal-01832503

S. Zagoruyko and N. Komodakis, Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01832769

S. Zagoruyko and N. Komodakis, Wide residual networks, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01832503

W. Zaremba, I. Sutskever, and O. Vinyals, Recurrent neural network regularization, 2014.

C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, Understanding deep learning requires rethinking generalization, vol.1, p.3, 2016.

S. Zhang and N. He, On the convergence rate of stochastic mirror descent for nonsmooth nonconvex optimization, 2018.

X. Zhang, X. Zhou, M. Lin, and J. Sun, Shufflenet: An extremely efficient convolutional neural network for mobile devices, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, vol.8, pp.6848-6856, 2018.

S. Bibliography-liang-zhao, Y. Liao, Z. Wang, J. Li, B. Tang et al., Theoretical properties for neural networks with weight matrices of low displacement rank, International Conference on Machine Learning, vol.8, pp.4082-4090, 2017.

W. Zhou, V. Veitch, M. Austern, P. Ryan, P. Adams et al., Compressibility and generalization in large-scale deep learning, 2018.

X. Zhu, Z. Ghahramani, and J. Lafferty, Semi-supervised learning using gaussian fields and harmonic functions, Proceedings of the 20th International conference on Machine learning (ICML-03), vol.6, pp.912-919, 2003.

B. Zoph, V. Vasudevan, J. Shlens, and Q. Le, Learning transferable architectures for scalable image recognition, Proceedings of the IEEE conference on computer vision and pattern recognition, vol.8, pp.8697-8710, 2018.