S. Abu-el-haija, N. Kothari, J. Lee, A. P. Natsev, G. Toderici et al., Youtube-8M: A large-scale video classification benchmark, 2016.

S. Adavanne, P. Pertilä, and T. Virtanen, Sound event detection using spatial features and convolutional recurrent neural network, ICASSP, pp.771-775, 2017.

J. Alayrac, P. Bojanowski, N. Agrawal, J. Sivic, I. Laptev et al., Unsupervised learning from narrated instruction videos, Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (CVPR) (CVPR), pp.4575-4583, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01171193

J. Amores, Multiple instance classification: Review, taxonomy and comparative study, Artificial intelligence, vol.201, pp.81-105, 2013.

G. Andrew, R. Arora, J. Bilmes, and K. Livescu, Deep canonical correlation analysis, Proc. of International Conference on Machine Learning, pp.1247-1255, 2013.

R. Arandjelovi? and A. Zisserman, Look, listen and learn, IEEE International Conference on Computer Vision, 2017.

R. Arandjelovi? and A. Zisserman, Objects that sound. CoRR, 2017.

H. Atilgan, S. M. Town, K. C. Wood, G. P. Jones, R. K. Maddox et al., Integration of visual information in auditory cortex promotes auditory scene analysis through multisensory binding, Neuron, vol.97, issue.3, pp.640-655, 2018.

Y. Aytar, C. Vondrick, and A. Torralba, Soundnet: Learning sound representations from unlabeled video, Advances in Neural Information Processing Systems, pp.892-900, 2016.

Y. Aytar, C. Vondrick, and A. Torralba, See, hear, and read: Deep aligned representations, 2017.

D. Bahdanau, K. Cho, and Y. Bengio, Neural machine translation by jointly learning to align and translate, 2014.

Z. Barzelay and Y. Y. Schechner, Harmony in motion, 2007 IEEE Conference on Computer Vision and Pattern Recognition, pp.1-8, 2007.

S. Becker and G. E. Hinton, Self-organizing neural network that discovers surfaces in random-dot stereograms, Nature, vol.355, issue.6356, p.161, 1992.

F. Bießmann, F. C. Meinecke, A. Gretton, A. Rauch, G. Rainer et al., Temporal kernel cca and its application in multimodal neuronal data analysis, Mach Learn, vol.79, issue.1-2, pp.5-27, 2010.

H. Bilen, V. P. Namboodiri, and L. J. Van-gool, Object and action classification with latent window parameters, International Journal of Computer Vision, vol.106, issue.3, pp.237-251, 2014.

H. Bilen, M. Pedersoli, and T. Tuytelaars, Weakly supervised object detection with posterior regularization, Proceedings BMVC 2014, pp.1-12, 2014.

H. Bilen and A. Vedaldi, Weakly supervised deep detection networks, CVPR, pp.2846-2854, 2016.

V. Bisot, S. Essid, R. , and G. , Overlapping sound event detection with supervised nonnegative matrix factorization, ICASSP, pp.31-35, 2017.

J. K. Bizley, G. P. Jones, and S. M. Town, Where are multisensory signals combined for perceptual decision-making?, Current opinion in neurobiology, vol.40, pp.31-37, 2016.

C. Boutsidis and E. Gallopoulos, Svd based initialization: A head start for nonnegative matrix factorization, Pattern Recognition, vol.41, issue.4, pp.1350-1362, 2008.

H. Bredin and G. Chollet, Measuring audio and visual speech synchrony: methods and applications, IET International Conference on Visual Information Engineering (VIE 2006), pp.255-260, 2006.
URL : https://hal.archives-ouvertes.fr/hal-01987830

H. Bredin and G. Chollet, Audiovisual speech synchrony measure: application to biometrics, EURASIP Journal on Applied Signal Processing, issue.1, pp.179-179, 2007.
URL : https://hal.archives-ouvertes.fr/hal-01987803

A. Casanovas, G. Monaci, P. Vandergheynst, and R. Gribonval, Blind audiovisual source separation based on sparse redundant representations, IEEE Transactions on Multimedia, vol.12, issue.5, pp.358-371, 2010.
URL : https://hal.archives-ouvertes.fr/inria-00541412

S. Chang, D. Ellis, W. Jiang, K. Lee, A. Yanagawa et al., Large-scale multimodal semantic concept detection for consumer video, Proceedings of the international workshop on Workshop on multimedia information retrieval, pp.255-264, 2007.

J. Chen, T. Mukai, Y. Takeuchi, T. Matsumoto, H. Kudo et al., Relating audio-visual events caused by multiple movements: in the case of entire object movement, Information Fusion, 2002. Proceedings of the Fifth International Conference on, vol.1, pp.213-219, 2002.

L. Chen, S. Srivastava, Z. Duan, and C. Xu, Deep cross-modal audio-visual generation, Proc. of Thematic Workshops of ACM Multimedia, pp.349-357, 2017.

R. G. Cinbis, J. Verbeek, and C. Schmid, Weakly supervised object localization with multi-fold multiple instance learning, IEEE transactions on pattern analysis and machine intelligence, vol.39, pp.189-203, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01123482

P. Comon and C. Jutten, Handbook of Blind Source Separation: Independent component analysis and applications, 2010.
URL : https://hal.archives-ouvertes.fr/hal-00460653

S. B. Davis and P. Mermelstein, Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences, Readings in speech recognition, pp.65-74, 1990.

V. R. De-sa, Learning classification with unlabeled data, Advances in neural information processing systems, pp.112-119, 1994.

J. Deng, W. Dong, R. Socher, L. Li, K. Li et al., Imagenet: A large-scale hierarchical image database, Computer Vision and Pattern Recognition, pp.248-255, 2009.

T. Deselaers, B. Alexe, and V. Ferrari, Localizing objects while learning their appearance, European conference on computer vision, pp.452-466, 2010.

T. G. Dietterich, R. H. Lathrop, and T. Lozano-pérez, Solving the multiple instance problem with axis-parallel rectangles, Artificial intelligence, vol.89, issue.1-2, pp.31-71, 1997.

L. Dong, A comparison of multi-instance learning algorithms, 2006.

N. Q. Duong, A. Ozerov, L. Chevallier, and J. Sirot, An interactive audio source separation framework based on non-negative matrix factorization, Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.1567-1571, 2014.
URL : https://hal.archives-ouvertes.fr/hal-00960717

J. Durrieu, B. David, R. , and G. , A musically motivated mid-level representation for pitch estimation and musical audio source separation, IEEE Journal of Selected Topics in Signal Processing, vol.5, issue.6, pp.1180-1191, 2011.

J. Eggert and E. Korner, Sparse coding and nmf, International Joint Conference on Neural Networks, vol.4, pp.2529-2533, 2004.

A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K. Wilson et al., Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation, ACM Trans. Graph, vol.37, issue.4, p.11, 2018.

S. Essid and C. Févotte, Smooth nonnegative matrix factorization for unsupervised audiovisual document structuring, IEEE Transactions on Multimedia, vol.15, issue.2, pp.415-425, 2013.

C. Févotte and J. Idier, Algorithms for nonnegative matrix factorization with the ?-divergence, Neural computation, vol.23, issue.9, pp.2421-2456, 2011.

C. Févotte, E. Vincent, and A. Ozerov, Single-channel audio source separation with NMF: divergences, constraints and algorithms, Audio Source Separation, pp.1-24, 2018.

J. Fisher, T. Darrell, W. T. Freeman, P. Viola, I. Fisher et al., , 2001.

, Learning Joint Statistical Models for Audio-Visual Fusion and Segregation, Advances in Neural Information Processing Systems, number Ml, pp.772-778

D. Fitzgerald, M. Cranitch, and E. Coyle, Using tensor factorisation models to separate drums from polyphonic music, Proc Int Conf Digit Audio Eff, 2009.

J. Fritsch and M. D. Plumbley, Score informed audio source separation using constrained nonnegative matrix factorization and score synthesis, Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, pp.888-891, 2013.

R. Gao, R. Feris, and K. Grauman, Learning to separate object sounds by watching unlabeled video, ECCV, 2018.

J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence et al., Audio set: An ontology and human-labeled dataset for audio events, Acoustics, Speech and Signal Processing, pp.776-780, 2017.

R. Girshick, Fast R-CNN, ICCV, pp.1440-1448, 2015.

R. Girshick, J. Donahue, T. Darrell, M. , and J. , Rich feature hierarchies for accurate object detection and semantic segmentation, Proceedings of the IEEE conference on computer vision and pattern recognition, pp.580-587, 2014.

G. Gkioxari, R. Girshick, M. , and J. , Contextual action recognition with r* cnn, Proceedings of the IEEE international conference on computer vision, pp.1080-1088, 2015.

R. Goecke and J. B. Millar, Statistical Analysis of the Relationship between Audio and Video Speech Parameters for Australian English, Proc ISCA Tutor Res Workshop Audit-Vis Speech Process, pp.133-138, 2003.

X. Guo, S. Uhlich, and Y. Mitsufuji, NMF-based blind source separation using a linear predictive coding error clustering criterion, Proc. of IEEE Int, 2015.

, Conf. on Acoustics, Speech and Signal Processing (ICASSP), pp.261-265

D. R. Hardoon, S. Szedmak, and J. Shawe-taylor, Canonical correlation analysis: an overview with application to learning methods, Neural Comput, vol.16, issue.12, pp.2639-2664, 2004.

K. He, G. Gkioxari, P. Dollár, and R. Girshick, Mask r-cnn, 2017 IEEE International Conference on, pp.2980-2988, 2017.

K. He, X. Zhang, S. Ren, and J. Sun, Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE transactions on pattern analysis and machine intelligence, vol.37, pp.1904-1916, 2015.

T. Heittola, A. Mesaros, T. Virtanen, and A. Eronen, Sound event detection in multisource environments using source separation, Machine Listening in Multisource Environments, 2011.

S. Hershey, S. Chaudhuri, D. P. Ellis, J. F. Gemmeke, A. Jansen et al., CNN architectures for large-scale audio classification, ICASSP, pp.131-135, 2017.

J. Hosang, R. Benenson, and B. Schiele, How good are detection proposals, really?, 25th British Machine Vision Conference, pp.1-12, 2014.

H. Hotelling, Relations between two sets of variates, Biometrika, vol.28, issue.3 -4, pp.321-377, 1936.

P. Huang, M. Kim, M. Hasegawa-johnson, and P. Smaragdis, Deep learning for monaural speech separation, Proc. of IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), pp.1562-1566, 2014.

D. R. Hunter and K. Lange, A tutorial on mm algorithms, The American Statistician, vol.58, issue.1, pp.30-37, 2004.

M. Ilse, J. M. Tomczak, and M. Welling, Attention-based deep multiple instance learning, 2018.

H. Izadinia, I. Saleemi, and M. Shah, Multimodal analysis for identification and segmentation of moving-sounding objects, IEEE Transactions on Multimedia, vol.15, issue.2, pp.378-390, 2013.

R. Jaiswal, D. Fitzgerald, D. Barry, E. Coyle, R. et al., Clustering NMF basis functions using shifted NMF for monaural sound source separation, Proc. of IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), pp.245-248, 2011.

I. Jhuo, G. Ye, S. Gao, D. Liu, Y. Jiang et al., , 2014.

, Discovering joint audio-visual codewords for video event detection, Mach Vision Appl, vol.25, issue.1, pp.33-47

W. Jiang, C. Cotton, S. F. Chang, D. Ellis, and A. Loui, Short-term audiovisual atoms for generic video concept classification, Proc ACM Int Conf Multimed, pp.5-14, 2009.

W. Jiang, C. Cotton, S. F. Chang, D. Ellis, and A. Loui, Short-term audiovisual atoms for generic video concept classification, Proceedings of the 17th ACM International Conference on Multimedia, pp.5-14, 2009.

W. Jiang and A. C. Loui, Audio-visual grouplet: temporal audio-visual interactions for general video concept classification, Proc ACM Int Conf Multimed, pp.123-132, 2011.

Y. Jiang, S. Bhattacharya, S. Chang, and M. Shah, High-level event recognition in unconstrained videos, International journal of multimedia information retrieval, vol.2, issue.2, pp.73-101, 2013.

Y. Jiang, Z. Wu, J. Wang, X. Xue, C. et al., Exploiting feature and class relationships in video categorization with regularized deep neural networks, IEEE transactions on pattern analysis and machine intelligence, vol.40, pp.352-364, 2018.

C. Joder, S. Essid, R. , and G. , Temporal Integration for Audio Classification with Application to Musical Instrument Classification, IEEE Trans Audio Speech Lang Process, 2008.

V. Kantorov, M. Oquab, M. Cho, and I. Laptev, Contextlocnet: Contextaware deep network models for weakly supervised localization, European Conference on Computer Vision, pp.350-365, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01421772

E. Karamatl?, A. T. Cemgil, and S. K?rb?z, Weak label supervision for monaural source separation using non-negative denoising variational autoencoders, 2018.

T. Karras, T. Aila, S. Laine, A. Herva, and J. Lehtinen, Audio-driven facial animation by joint end-to-end learning of pose and emotion, ACM Transactions on Graphics (TOG), vol.36, issue.4, p.94, 2017.

J. Kay, Feature discovery under contextual supervision using mutual information, Proc Int Jt Conf Neural Netw, vol.4, pp.79-84, 1992.

W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier et al., The kinetics human action video dataset, 2017.

M. Keuper, B. Andres, and T. Brox, Motion trajectory segmentation via minimum cost multicuts, Proc. of IEEE International Conference on Computer Vision (CVPR), pp.3271-3279, 2015.

E. Kidron, Y. Schechner, and M. Elad, Pixels that sound, Computer Vision and Pattern Recognition, vol.1, pp.88-95, 2005.

D. P. Kingma and J. Ba, Adam: A method for stochastic optimization, 2014.

A. Kolesnikov and C. H. Lampert, Seed, expand and constrain: Three principles for weakly-supervised image segmentation, European Conference on Computer Vision, pp.695-711, 2016.

Q. Kong, Y. Xu, W. Wang, and M. D. Plumbley, A joint separationclassification model for sound event detection of weakly labelled data, Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.321-325, 2018.

O. Z. Kraus, J. L. Ba, and B. J. Frey, Classifying and segmenting microscopy images with deep multiple instance learning, Bioinformatics, vol.32, issue.12, pp.52-59, 2016.

L. Krishnan, M. Elhilali, and S. Shamma, Segregating complex sound sources through temporal coherence, PLoS computational biology, vol.10, issue.12, p.1003985, 2014.

A. Krizhevsky, I. Sutskever, and G. E. Hinton, Imagenet classification with deep convolutional neural networks, Advances in neural information processing systems, pp.1097-1105, 2012.

M. Kubovy and M. Schutz, Audio-visual objects, Review of Philosophy and Psychology, vol.1, issue.1, pp.41-61, 2010.

A. Kumar, M. Khadkevich, and C. Fugen, Knowledge transfer from weakly labeled audio using convolutional neural network for sound events and scenes, 2017.

A. Kumar and B. Raj, Audio event detection using weakly labeled data, Proceedings of the 2016 ACM on Multimedia Conference, pp.1038-1047, 2016.

M. P. Kumar, B. Packer, and D. Koller, Self-paced learning for latent variable models, Advances in Neural Information Processing Systems, pp.1189-1197, 2010.

L. Magoarou, L. Ozerov, A. Duong, and N. Q. , Text-informed audio source separation. example-based approach using non-negative matrix partial cofactorization, Journal of Signal Processing Systems, vol.79, issue.2, pp.117-131, 2015.
URL : https://hal.archives-ouvertes.fr/hal-00870066

J. Le-roux, F. Weninger, and J. R. Hershey, Sparse NMF-half-baked or well done? Mitsubishi Electric Research Labs (MERL), pp.2015-2038, 2015.

D. Lee, S. Lee, Y. Han, and K. Lee, Ensemble of convolutional neural networks for weakly-supervised sound event detection using multiple scale input, p.2017, 2017.

D. D. Lee and H. S. Seung, Algorithms for non-negative matrix factorization, Advances in neural information processing systems, pp.556-562, 2001.

B. Li, Z. Duan, and G. Sharma, Associating players to sound sources in musical performance videos. Late Breaking Demo, 2016.

B. Li, X. Liu, K. Dinesh, Z. Duan, and G. Sharma, Creating a musical performance dataset for multimodal music analysis: Challenges, insights, and applications, 2016.

D. Li, N. Dimitrova, M. Li, and I. Sethi, Multimedia content processing through cross-modal association, Proc ACM Int Conf Multimed, 2003.

A. Liutkus and R. Badeau, Generalized wiener filtering with fractional power spectrograms, Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.266-270, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01110028

A. Liutkus, J. Durrieu, L. Daudet, R. , and G. , An overview of informed audio source separation, 14th International Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS), pp.1-4, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00958661

J. Long, E. Shelhamer, D. , and T. , Fully convolutional networks for semantic segmentation, Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (CVPR) (CVPR), pp.3431-3440, 2015.

J. Lu, J. Yang, D. Batra, and D. Parikh, Hierarchical question-image co-attention for visual question answering, Advances In Neural Information Processing Systems, pp.289-297, 2016.

E. Maestre, Modeling instrumental gestures: an analysis/synthesis framework for violin bowing, 2009.

P. Maragos, P. Gros, A. Katsamanis, and G. Papandreou, Cross-modal integration for performance improving in multimedia: a review, Multimodal processing and interaction, pp.1-46, 2008.

M. Marchini, Analysis of Ensemble Expressive Performance in String Quartets: a Statistical and Machine Learning Approach, 2014.

M. Marchini, R. Ramirez, P. Papiotis, and E. Maestre, The sense of ensemble: a machine learning approach to expressive performance modelling in string quartets, Journal of New Music Research, vol.43, issue.3, pp.303-317, 2014.

A. Mesaros, T. Heittola, O. Dikmen, and T. Virtanen, Sound event detection in real life recordings using coupled matrix factorization of spectral representations and class activity annotations, ICASSP, pp.151-155, 2015.

A. Mesaros, T. Heittola, A. Diment, B. Elizalde, A. Shah et al., DCASE2017 challenge setup: Tasks, datasets and baseline system, Proc. of Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017), pp.85-92, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01627981

A. Mesaros, T. Heittola, and T. Virtanen, Metrics for polyphonic sound event detection, Applied Sciences, vol.6, issue.6, p.162, 2016.

G. J. Mysore and P. Smaragdis, A non-negative approach to semi-supervised separation of speech from noise with the use of temporal dynamics, Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.17-20, 2011.
URL : https://hal.archives-ouvertes.fr/hal-01084331

K. Nakadai, K. Hidai, H. G. Okuno, and H. Kitano, Real-time speaker localization and speech separation by audio-visual integration, Robotics and Automation, 2002. Proceedings. ICRA'02. IEEE International Conference on, vol.1, pp.1043-1049, 2002.

J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee et al., Multimodal deep learning, Proc. of International Conference on Machine Learning, pp.689-696, 2011.

D. Oneata, J. Revaud, J. Verbeek, and C. Schmid, Spatio-temporal object detection proposals, European conference on computer vision, pp.737-752, 2014.
URL : https://hal.archives-ouvertes.fr/hal-01021902

. Springer,

M. Oquab, L. Bottou, I. Laptev, and J. Sivic, Is object localization for free?-weakly-supervised learning with convolutional neural networks, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.685-694, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01015140

A. Owens, P. Isola, J. Mcdermott, A. Torralba, E. H. Adelson et al., Visually indicated sounds, Proc. of IEEE Conference on Computer Vision and Pattern Recognition, pp.2405-2413, 2016.

A. Owens, J. Wu, J. H. Mcdermott, W. T. Freeman, and A. Torralba, , 2016.

, Ambient sound provides supervision for visual learning, Proc. of European Conference on Computer Vision, pp.801-816

A. Ozerov and C. Févotte, Multichannel nonnegative matrix factorization in convolutive mixtures for audio source separation, IEEE Transactions on Audio, Speech, and Language Processing, vol.18, issue.3, pp.550-563, 2010.

S. Parekh, S. Essid, A. Ozerov, N. Q. Duong, P. Pérez et al., , 2018.

, Weakly supervised representation learning for unsynchronized audio-visual events

. Corr,

S. Ramakrishnan, Cryptographic and Information Security Approaches for Images and Videos, 2018.

S. Ren, K. He, R. Girshick, and J. Sun, Faster r-cnn: Towards real-time object detection with region proposal networks, Advances in neural information processing systems, pp.91-99, 2015.

X. Ren and J. Malik, Learning a classification model for segmentation, Proc. of IEEE International Conference on Computer Vision (ICCV), 2003.

B. Rivet, L. Girin, J. , and C. , Mixing audiovisual speech processing and blind source separation for the extraction of speech signals from convolutive mixtures, IEEE Transactions on Audio, Speech, and Language Processing, vol.15, issue.1, pp.96-108, 2007.
URL : https://hal.archives-ouvertes.fr/hal-00174100

J. Salamon, B. Mcfee, L. , and P. , DCASE 2017 submission: Multiple instance learning for sound event detection, p.2017, 2017.

F. Sedighin, M. Babaie-zadeh, B. Rivet, J. , and C. , Two multimodal approaches for single microphone source separation, EUSIPCO, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01400542

N. Seichepine, S. Essid, C. Fevotte, C. , and O. , Soft nonnegative matrix co-factorization, IEEE Trans Signal Process, p.99, 2014.
URL : https://hal.archives-ouvertes.fr/hal-01116863

N. Seichepine, S. Essid, C. Fevotte, C. , and O. , Soft nonnegative matrix co-factorization, IEEE Transactions on Signal Processing, p.99, 2014.
URL : https://hal.archives-ouvertes.fr/hal-01116863

R. Serizel, V. Bisot, S. Essid, R. , and G. , Acoustic features for environmental sound analysis, Computational Analysis of Sound Scenes and Events, pp.71-101, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01575619

E. Shlizerman, L. Dery, H. Schoen, and I. Kemelmacher-shlizerman, Audio to body dynamics, Proc. CVPR, 2018.

C. Sigg, B. Fischer, B. Ommer, V. Roth, and J. Buhmann, Nonnegative CCA for Audiovisual Source Separation, IEEE Workshop on Machine Learning for Signal Processing, pp.253-258, 2007.

P. Smaragdis and M. Casey, Audio/visual independent components, Proc. of ICA, pp.709-714, 2003.

P. Smaragdis and G. J. Mysore, Separation by "humming": user-guided sound extraction from monophonic mixtures, Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pp.69-72, 2009.

L. Smith and M. Gasser, The development of embodied cognition: Six lessons from babies, Artificial life, vol.11, issue.1-2, pp.13-29, 2005.

H. O. Song, Y. J. Lee, S. Jegelka, D. , and T. , Weakly-supervised discovery of visual pattern configurations, Advances in Neural Information Processing Systems, pp.1637-1645, 2014.

M. Spiertz and V. Gnann, Source-filter based clustering for monaural blind source separation, Proceedings of International Conference on Digital Audio Effects DAFx'09, 2009.

D. Stowell, D. Giannoulis, E. Benetos, M. Lagrange, and M. D. Plumbley, , 2015.

, Detection and classification of acoustic scenes and events, IEEE Transactions on Multimedia, vol.17, issue.10, pp.1733-1746

S. Suwajanakorn, S. M. Seitz, and I. Kemelmacher-shlizerman, Synthesizing obama: learning lip sync from audio, ACM Transactions on Graphics (TOG), vol.36, issue.4, p.95, 2017.

J. R. Uijlings, K. E. Van-de-sande, T. Gevers, and A. W. Smeulders, , 2013.

, Selective search for object recognition, International journal of computer vision, vol.104, issue.2, pp.154-171

J. C. Van-gemert, M. Jain, E. Gati, and C. G. Snoek, Apt: Action localization proposals from dense trajectories, Proc. of BMVC, vol.2, p.4, 2015.

E. Vincent, R. Gribonval, and C. Févotte, Performance measurement in blind audio source separation, IEEE transactions on audio, speech, and language processing, vol.14, issue.4, pp.1462-1469, 2006.
URL : https://hal.archives-ouvertes.fr/inria-00544230

E. Vincent, T. Virtanen, and S. Gannot, Audio source separation and speech enhancement, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01881431

T. Virtanen, Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria, IEEE transactions on audio, speech, and language processing, vol.15, issue.3, pp.1066-1074, 2007.

C. Vondrick, D. Patterson, and D. Ramanan, Efficiently scaling up crowdsourced video annotation, International Journal of Computer Vision, pp.1-21, 2012.

T. Vu, A. Osokin, and I. Laptev, Tube-cnn: Modeling temporal evolution of appearance for object detection in video, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01980339

B. Wang and M. D. Plumbley, Investigating single-channel audio source separation methods based on non-negative matrix factorization, Proc. ICA Research Network International Workshop, pp.17-20, 2006.

J. Wang and J. Zucker, Solving multiple-instance problem: a lazy learning approach, Proc. of International Conference on Machine Learning, pp.1119-1126, 2000.
URL : https://hal.archives-ouvertes.fr/hal-01573329

X. Wang, M. Yang, S. Zhu, L. , and Y. , Regionlets for generic object detection, Computer Vision (ICCV), 2013 IEEE International Conference on, pp.17-24, 2013.

S. Wisdom, T. Powers, J. Pitton, and L. Atlas, Deep recurrent nmf for speech separation by unfolding iterative thresholding, Applications of Signal Processing to Audio and Acoustics (WASPAA, pp.254-258, 2017.

W. Woodhall, Audio Production and Post-production, 2011.

Y. Xu, Q. Kong, W. Wang, and M. D. Plumbley, Surrey-CVSSP system for DCASE2017 challenge task4, p.2017, 2017.

G. Ye, I. Jhuo, D. Liu, Y. Jiang, D. Lee et al., Joint audio-visual bi-modal codewords for video event detection, Proc. of 2nd ACM International Conference on Multimedia Retrieval, p.39, 2012.

N. Yokoya, T. Yairi, and A. Iwasaki, Coupled Nonnegative Matrix Factorization Unmixing for Hyperspectral and Multispectral Data Fusion, IEEE Trans Geosci Remote Sens, vol.50, issue.2, pp.528-537, 2012.

J. Yoo and S. Choi, Matrix co-factorization on compressed sensing, Proc Int Joint Conf Artif Intell, 2011.

B. P. Yuhas, M. H. Goldstein, and T. J. Sejnowski, Integration of acoustic and visual speech signals using neural networks, IEEE Communications Magazine, vol.27, issue.11, pp.65-71, 1989.

C. Zhang, J. C. Platt, and P. A. Viola, Multiple instance boosting for object detection, Advances in neural information processing systems, pp.1417-1424, 2006.

Z. Zhang, J. Wu, Q. Li, Z. Huang, J. Traer et al., Generative modeling of audible shapes for object perception, Proc. of IEEE International Conference on Computer Vision (ICCV), 2017.

H. Zhao, C. Gan, A. Rouditchenko, C. Vondrick, J. Mcdermott et al., The sound of pixels, ECCV, 2018.

B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, Learning deep features for discriminative localization, Computer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference on, pp.2921-2929, 2016.

Y. Zhou, Z. Xu, C. Landreth, E. Kalogerakis, S. Maji et al., , 2018.

. Visemenet, Audio-driven animator-centric speech animation, ACM Trans. Graph, vol.37, issue.4, pp.161-162

X. Zhuang, X. Zhou, M. A. Hasegawa-johnson, and T. S. Huang, Real-world acoustic event detection, Pattern Recognition Letters, vol.31, issue.12, pp.1543-1551, 2010.

C. L. Zitnick and P. Dollár, Edge boxes: Locating object proposals from edges, ECCV, pp.391-405, 2014.