Youtube-8M: A large-scale video classification benchmark, 2016. ,
Sound event detection using spatial features and convolutional recurrent neural network, ICASSP, pp.771-775, 2017. ,
Unsupervised learning from narrated instruction videos, Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (CVPR) (CVPR), pp.4575-4583, 2016. ,
URL : https://hal.archives-ouvertes.fr/hal-01171193
Multiple instance classification: Review, taxonomy and comparative study, Artificial intelligence, vol.201, pp.81-105, 2013. ,
Deep canonical correlation analysis, Proc. of International Conference on Machine Learning, pp.1247-1255, 2013. ,
Look, listen and learn, IEEE International Conference on Computer Vision, 2017. ,
Objects that sound. CoRR, 2017. ,
Integration of visual information in auditory cortex promotes auditory scene analysis through multisensory binding, Neuron, vol.97, issue.3, pp.640-655, 2018. ,
Soundnet: Learning sound representations from unlabeled video, Advances in Neural Information Processing Systems, pp.892-900, 2016. ,
, See, hear, and read: Deep aligned representations, 2017.
Neural machine translation by jointly learning to align and translate, 2014. ,
Harmony in motion, 2007 IEEE Conference on Computer Vision and Pattern Recognition, pp.1-8, 2007. ,
Self-organizing neural network that discovers surfaces in random-dot stereograms, Nature, vol.355, issue.6356, p.161, 1992. ,
Temporal kernel cca and its application in multimodal neuronal data analysis, Mach Learn, vol.79, issue.1-2, pp.5-27, 2010. ,
Object and action classification with latent window parameters, International Journal of Computer Vision, vol.106, issue.3, pp.237-251, 2014. ,
Weakly supervised object detection with posterior regularization, Proceedings BMVC 2014, pp.1-12, 2014. ,
Weakly supervised deep detection networks, CVPR, pp.2846-2854, 2016. ,
Overlapping sound event detection with supervised nonnegative matrix factorization, ICASSP, pp.31-35, 2017. ,
Where are multisensory signals combined for perceptual decision-making?, Current opinion in neurobiology, vol.40, pp.31-37, 2016. ,
Svd based initialization: A head start for nonnegative matrix factorization, Pattern Recognition, vol.41, issue.4, pp.1350-1362, 2008. ,
Measuring audio and visual speech synchrony: methods and applications, IET International Conference on Visual Information Engineering (VIE 2006), pp.255-260, 2006. ,
URL : https://hal.archives-ouvertes.fr/hal-01987830
Audiovisual speech synchrony measure: application to biometrics, EURASIP Journal on Applied Signal Processing, issue.1, pp.179-179, 2007. ,
URL : https://hal.archives-ouvertes.fr/hal-01987803
Blind audiovisual source separation based on sparse redundant representations, IEEE Transactions on Multimedia, vol.12, issue.5, pp.358-371, 2010. ,
URL : https://hal.archives-ouvertes.fr/inria-00541412
Large-scale multimodal semantic concept detection for consumer video, Proceedings of the international workshop on Workshop on multimedia information retrieval, pp.255-264, 2007. ,
Relating audio-visual events caused by multiple movements: in the case of entire object movement, Information Fusion, 2002. Proceedings of the Fifth International Conference on, vol.1, pp.213-219, 2002. ,
Deep cross-modal audio-visual generation, Proc. of Thematic Workshops of ACM Multimedia, pp.349-357, 2017. ,
Weakly supervised object localization with multi-fold multiple instance learning, IEEE transactions on pattern analysis and machine intelligence, vol.39, pp.189-203, 2017. ,
URL : https://hal.archives-ouvertes.fr/hal-01123482
Handbook of Blind Source Separation: Independent component analysis and applications, 2010. ,
URL : https://hal.archives-ouvertes.fr/hal-00460653
Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences, Readings in speech recognition, pp.65-74, 1990. ,
Learning classification with unlabeled data, Advances in neural information processing systems, pp.112-119, 1994. ,
Imagenet: A large-scale hierarchical image database, Computer Vision and Pattern Recognition, pp.248-255, 2009. ,
Localizing objects while learning their appearance, European conference on computer vision, pp.452-466, 2010. ,
Solving the multiple instance problem with axis-parallel rectangles, Artificial intelligence, vol.89, issue.1-2, pp.31-71, 1997. ,
Solving the multiple instance problem with axis-parallel rectangles, Artificial intelligence, vol.89, issue.1-2, pp.31-71, 1997. ,
A comparison of multi-instance learning algorithms, 2006. ,
An interactive audio source separation framework based on non-negative matrix factorization, Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.1567-1571, 2014. ,
URL : https://hal.archives-ouvertes.fr/hal-00960717
A musically motivated mid-level representation for pitch estimation and musical audio source separation, IEEE Journal of Selected Topics in Signal Processing, vol.5, issue.6, pp.1180-1191, 2011. ,
Sparse coding and nmf, International Joint Conference on Neural Networks, vol.4, pp.2529-2533, 2004. ,
Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation, ACM Trans. Graph, vol.37, issue.4, p.11, 2018. ,
Smooth nonnegative matrix factorization for unsupervised audiovisual document structuring, IEEE Transactions on Multimedia, vol.15, issue.2, pp.415-425, 2013. ,
Algorithms for nonnegative matrix factorization with the ?-divergence, Neural computation, vol.23, issue.9, pp.2421-2456, 2011. ,
Single-channel audio source separation with NMF: divergences, constraints and algorithms, Audio Source Separation, pp.1-24, 2018. ,
, , 2001.
, Learning Joint Statistical Models for Audio-Visual Fusion and Segregation, Advances in Neural Information Processing Systems, number Ml, pp.772-778
Using tensor factorisation models to separate drums from polyphonic music, Proc Int Conf Digit Audio Eff, 2009. ,
Score informed audio source separation using constrained nonnegative matrix factorization and score synthesis, Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, pp.888-891, 2013. ,
Learning to separate object sounds by watching unlabeled video, ECCV, 2018. ,
Audio set: An ontology and human-labeled dataset for audio events, Acoustics, Speech and Signal Processing, pp.776-780, 2017. ,
Fast R-CNN, ICCV, pp.1440-1448, 2015. ,
Rich feature hierarchies for accurate object detection and semantic segmentation, Proceedings of the IEEE conference on computer vision and pattern recognition, pp.580-587, 2014. ,
Contextual action recognition with r* cnn, Proceedings of the IEEE international conference on computer vision, pp.1080-1088, 2015. ,
Statistical Analysis of the Relationship between Audio and Video Speech Parameters for Australian English, Proc ISCA Tutor Res Workshop Audit-Vis Speech Process, pp.133-138, 2003. ,
NMF-based blind source separation using a linear predictive coding error clustering criterion, Proc. of IEEE Int, 2015. ,
, Conf. on Acoustics, Speech and Signal Processing (ICASSP), pp.261-265
Canonical correlation analysis: an overview with application to learning methods, Neural Comput, vol.16, issue.12, pp.2639-2664, 2004. ,
Mask r-cnn, 2017 IEEE International Conference on, pp.2980-2988, 2017. ,
, Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE transactions on pattern analysis and machine intelligence, vol.37, pp.1904-1916, 2015.
Sound event detection in multisource environments using source separation, Machine Listening in Multisource Environments, 2011. ,
CNN architectures for large-scale audio classification, ICASSP, pp.131-135, 2017. ,
How good are detection proposals, really?, 25th British Machine Vision Conference, pp.1-12, 2014. ,
Relations between two sets of variates, Biometrika, vol.28, issue.3 -4, pp.321-377, 1936. ,
Deep learning for monaural speech separation, Proc. of IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), pp.1562-1566, 2014. ,
A tutorial on mm algorithms, The American Statistician, vol.58, issue.1, pp.30-37, 2004. ,
Attention-based deep multiple instance learning, 2018. ,
Multimodal analysis for identification and segmentation of moving-sounding objects, IEEE Transactions on Multimedia, vol.15, issue.2, pp.378-390, 2013. ,
Clustering NMF basis functions using shifted NMF for monaural sound source separation, Proc. of IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), pp.245-248, 2011. ,
, , 2014.
, Discovering joint audio-visual codewords for video event detection, Mach Vision Appl, vol.25, issue.1, pp.33-47
Short-term audiovisual atoms for generic video concept classification, Proc ACM Int Conf Multimed, pp.5-14, 2009. ,
Short-term audiovisual atoms for generic video concept classification, Proceedings of the 17th ACM International Conference on Multimedia, pp.5-14, 2009. ,
Audio-visual grouplet: temporal audio-visual interactions for general video concept classification, Proc ACM Int Conf Multimed, pp.123-132, 2011. ,
High-level event recognition in unconstrained videos, International journal of multimedia information retrieval, vol.2, issue.2, pp.73-101, 2013. ,
Exploiting feature and class relationships in video categorization with regularized deep neural networks, IEEE transactions on pattern analysis and machine intelligence, vol.40, pp.352-364, 2018. ,
Temporal Integration for Audio Classification with Application to Musical Instrument Classification, IEEE Trans Audio Speech Lang Process, 2008. ,
Contextlocnet: Contextaware deep network models for weakly supervised localization, European Conference on Computer Vision, pp.350-365, 2016. ,
URL : https://hal.archives-ouvertes.fr/hal-01421772
Weak label supervision for monaural source separation using non-negative denoising variational autoencoders, 2018. ,
Audio-driven facial animation by joint end-to-end learning of pose and emotion, ACM Transactions on Graphics (TOG), vol.36, issue.4, p.94, 2017. ,
Feature discovery under contextual supervision using mutual information, Proc Int Jt Conf Neural Netw, vol.4, pp.79-84, 1992. ,
, The kinetics human action video dataset, 2017.
Motion trajectory segmentation via minimum cost multicuts, Proc. of IEEE International Conference on Computer Vision (CVPR), pp.3271-3279, 2015. ,
Pixels that sound, Computer Vision and Pattern Recognition, vol.1, pp.88-95, 2005. ,
Adam: A method for stochastic optimization, 2014. ,
Seed, expand and constrain: Three principles for weakly-supervised image segmentation, European Conference on Computer Vision, pp.695-711, 2016. ,
A joint separationclassification model for sound event detection of weakly labelled data, Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.321-325, 2018. ,
Classifying and segmenting microscopy images with deep multiple instance learning, Bioinformatics, vol.32, issue.12, pp.52-59, 2016. ,
Segregating complex sound sources through temporal coherence, PLoS computational biology, vol.10, issue.12, p.1003985, 2014. ,
Imagenet classification with deep convolutional neural networks, Advances in neural information processing systems, pp.1097-1105, 2012. ,
Audio-visual objects, Review of Philosophy and Psychology, vol.1, issue.1, pp.41-61, 2010. ,
Knowledge transfer from weakly labeled audio using convolutional neural network for sound events and scenes, 2017. ,
Audio event detection using weakly labeled data, Proceedings of the 2016 ACM on Multimedia Conference, pp.1038-1047, 2016. ,
Self-paced learning for latent variable models, Advances in Neural Information Processing Systems, pp.1189-1197, 2010. ,
Text-informed audio source separation. example-based approach using non-negative matrix partial cofactorization, Journal of Signal Processing Systems, vol.79, issue.2, pp.117-131, 2015. ,
URL : https://hal.archives-ouvertes.fr/hal-00870066
, Sparse NMF-half-baked or well done? Mitsubishi Electric Research Labs (MERL), pp.2015-2038, 2015.
Ensemble of convolutional neural networks for weakly-supervised sound event detection using multiple scale input, p.2017, 2017. ,
Algorithms for non-negative matrix factorization, Advances in neural information processing systems, pp.556-562, 2001. ,
Associating players to sound sources in musical performance videos. Late Breaking Demo, 2016. ,
Creating a musical performance dataset for multimodal music analysis: Challenges, insights, and applications, 2016. ,
Multimedia content processing through cross-modal association, Proc ACM Int Conf Multimed, 2003. ,
Generalized wiener filtering with fractional power spectrograms, Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.266-270, 2015. ,
URL : https://hal.archives-ouvertes.fr/hal-01110028
An overview of informed audio source separation, 14th International Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS), pp.1-4, 2013. ,
URL : https://hal.archives-ouvertes.fr/hal-00958661
Fully convolutional networks for semantic segmentation, Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (CVPR) (CVPR), pp.3431-3440, 2015. ,
Hierarchical question-image co-attention for visual question answering, Advances In Neural Information Processing Systems, pp.289-297, 2016. ,
Modeling instrumental gestures: an analysis/synthesis framework for violin bowing, 2009. ,
Cross-modal integration for performance improving in multimedia: a review, Multimodal processing and interaction, pp.1-46, 2008. ,
Analysis of Ensemble Expressive Performance in String Quartets: a Statistical and Machine Learning Approach, 2014. ,
The sense of ensemble: a machine learning approach to expressive performance modelling in string quartets, Journal of New Music Research, vol.43, issue.3, pp.303-317, 2014. ,
Sound event detection in real life recordings using coupled matrix factorization of spectral representations and class activity annotations, ICASSP, pp.151-155, 2015. ,
DCASE2017 challenge setup: Tasks, datasets and baseline system, Proc. of Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017), pp.85-92, 2017. ,
URL : https://hal.archives-ouvertes.fr/hal-01627981
Metrics for polyphonic sound event detection, Applied Sciences, vol.6, issue.6, p.162, 2016. ,
A non-negative approach to semi-supervised separation of speech from noise with the use of temporal dynamics, Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.17-20, 2011. ,
URL : https://hal.archives-ouvertes.fr/hal-01084331
Real-time speaker localization and speech separation by audio-visual integration, Robotics and Automation, 2002. Proceedings. ICRA'02. IEEE International Conference on, vol.1, pp.1043-1049, 2002. ,
Multimodal deep learning, Proc. of International Conference on Machine Learning, pp.689-696, 2011. ,
Spatio-temporal object detection proposals, European conference on computer vision, pp.737-752, 2014. ,
URL : https://hal.archives-ouvertes.fr/hal-01021902
,
Is object localization for free?-weakly-supervised learning with convolutional neural networks, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.685-694, 2015. ,
URL : https://hal.archives-ouvertes.fr/hal-01015140
Visually indicated sounds, Proc. of IEEE Conference on Computer Vision and Pattern Recognition, pp.2405-2413, 2016. ,
, , 2016.
, Ambient sound provides supervision for visual learning, Proc. of European Conference on Computer Vision, pp.801-816
Multichannel nonnegative matrix factorization in convolutive mixtures for audio source separation, IEEE Transactions on Audio, Speech, and Language Processing, vol.18, issue.3, pp.550-563, 2010. ,
, , 2018.
, Weakly supervised representation learning for unsynchronized audio-visual events
,
Cryptographic and Information Security Approaches for Images and Videos, 2018. ,
Faster r-cnn: Towards real-time object detection with region proposal networks, Advances in neural information processing systems, pp.91-99, 2015. ,
Learning a classification model for segmentation, Proc. of IEEE International Conference on Computer Vision (ICCV), 2003. ,
Mixing audiovisual speech processing and blind source separation for the extraction of speech signals from convolutive mixtures, IEEE Transactions on Audio, Speech, and Language Processing, vol.15, issue.1, pp.96-108, 2007. ,
URL : https://hal.archives-ouvertes.fr/hal-00174100
DCASE 2017 submission: Multiple instance learning for sound event detection, p.2017, 2017. ,
Two multimodal approaches for single microphone source separation, EUSIPCO, 2016. ,
URL : https://hal.archives-ouvertes.fr/hal-01400542
Soft nonnegative matrix co-factorization, IEEE Trans Signal Process, p.99, 2014. ,
URL : https://hal.archives-ouvertes.fr/hal-01116863
Soft nonnegative matrix co-factorization, IEEE Transactions on Signal Processing, p.99, 2014. ,
URL : https://hal.archives-ouvertes.fr/hal-01116863
Acoustic features for environmental sound analysis, Computational Analysis of Sound Scenes and Events, pp.71-101, 2018. ,
URL : https://hal.archives-ouvertes.fr/hal-01575619
Audio to body dynamics, Proc. CVPR, 2018. ,
Nonnegative CCA for Audiovisual Source Separation, IEEE Workshop on Machine Learning for Signal Processing, pp.253-258, 2007. ,
Audio/visual independent components, Proc. of ICA, pp.709-714, 2003. ,
Separation by "humming": user-guided sound extraction from monophonic mixtures, Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pp.69-72, 2009. ,
The development of embodied cognition: Six lessons from babies, Artificial life, vol.11, issue.1-2, pp.13-29, 2005. ,
Weakly-supervised discovery of visual pattern configurations, Advances in Neural Information Processing Systems, pp.1637-1645, 2014. ,
Source-filter based clustering for monaural blind source separation, Proceedings of International Conference on Digital Audio Effects DAFx'09, 2009. ,
, , 2015.
, Detection and classification of acoustic scenes and events, IEEE Transactions on Multimedia, vol.17, issue.10, pp.1733-1746
Synthesizing obama: learning lip sync from audio, ACM Transactions on Graphics (TOG), vol.36, issue.4, p.95, 2017. ,
, , 2013.
, Selective search for object recognition, International journal of computer vision, vol.104, issue.2, pp.154-171
Apt: Action localization proposals from dense trajectories, Proc. of BMVC, vol.2, p.4, 2015. ,
Performance measurement in blind audio source separation, IEEE transactions on audio, speech, and language processing, vol.14, issue.4, pp.1462-1469, 2006. ,
URL : https://hal.archives-ouvertes.fr/inria-00544230
Audio source separation and speech enhancement, 2018. ,
URL : https://hal.archives-ouvertes.fr/hal-01881431
Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria, IEEE transactions on audio, speech, and language processing, vol.15, issue.3, pp.1066-1074, 2007. ,
Efficiently scaling up crowdsourced video annotation, International Journal of Computer Vision, pp.1-21, 2012. ,
Tube-cnn: Modeling temporal evolution of appearance for object detection in video, 2018. ,
URL : https://hal.archives-ouvertes.fr/hal-01980339
Investigating single-channel audio source separation methods based on non-negative matrix factorization, Proc. ICA Research Network International Workshop, pp.17-20, 2006. ,
Solving multiple-instance problem: a lazy learning approach, Proc. of International Conference on Machine Learning, pp.1119-1126, 2000. ,
URL : https://hal.archives-ouvertes.fr/hal-01573329
Regionlets for generic object detection, Computer Vision (ICCV), 2013 IEEE International Conference on, pp.17-24, 2013. ,
Deep recurrent nmf for speech separation by unfolding iterative thresholding, Applications of Signal Processing to Audio and Acoustics (WASPAA, pp.254-258, 2017. ,
Audio Production and Post-production, 2011. ,
Surrey-CVSSP system for DCASE2017 challenge task4, p.2017, 2017. ,
Joint audio-visual bi-modal codewords for video event detection, Proc. of 2nd ACM International Conference on Multimedia Retrieval, p.39, 2012. ,
Coupled Nonnegative Matrix Factorization Unmixing for Hyperspectral and Multispectral Data Fusion, IEEE Trans Geosci Remote Sens, vol.50, issue.2, pp.528-537, 2012. ,
Matrix co-factorization on compressed sensing, Proc Int Joint Conf Artif Intell, 2011. ,
Integration of acoustic and visual speech signals using neural networks, IEEE Communications Magazine, vol.27, issue.11, pp.65-71, 1989. ,
Multiple instance boosting for object detection, Advances in neural information processing systems, pp.1417-1424, 2006. ,
Generative modeling of audible shapes for object perception, Proc. of IEEE International Conference on Computer Vision (ICCV), 2017. ,
The sound of pixels, ECCV, 2018. ,
Learning deep features for discriminative localization, Computer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference on, pp.2921-2929, 2016. ,
, , 2018.
Audio-driven animator-centric speech animation, ACM Trans. Graph, vol.37, issue.4, pp.161-162 ,
Real-world acoustic event detection, Pattern Recognition Letters, vol.31, issue.12, pp.1543-1551, 2010. ,
Edge boxes: Locating object proposals from edges, ECCV, pp.391-405, 2014. ,