P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson et al., Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering, CVPR, 2018.

Z. Antoniou, Real-Time Adaptation to Time-Varying Constraints for Reliable mHealth Video Communications, 2017.

R. Armbrust, Capturing Growth: Photo Apps and Open Graph, 2012.

D. Bahdanau, K. Cho, and Y. Bengio, Neural Machine Translation by Jointly Learning to Align and Translate, ICLR, 2015.

N. Ballas, Modélisation de contextes pour l'annotation sémantique de vidéos, 2014.

T. Baltru?aitis, C. Ahuja, and L. Morency, Multimodal Machine Learning: A Survey and Taxonomy. Pattern Analysis and Machine Intelligence, 2017.

Y. Belinkov, A. Poliak, M. Stuart, B. Shieber, A. Van-durme et al., Don't Take the Premise for Granted: Mitigating Artifacts in Natural Language Inference, ACL, 2019.

C. Alexander, T. L. Berg, H. Berg, J. Daume, A. Dodge et al., Understanding and Predicting Importance in Images, CVPR, 2012.

M. David, . Blei, Y. Andrew, J. Ng, and . Edu, Latent Dirichlet Allocation Michael I, Jordan. J. Mach. Learn. Res, vol.3, 2003.

M. Bojarski, D. D. Testa, D. Dworakowski, B. Firner, B. Flepp et al., End to End Learning for Self-Driving Cars, 2016.

T. Brants and A. Franz, Web 1t 5-gram version 1. Linguistic Data Consortium, 2006.

M. Buda, A. Maki, and . Mazurowski, A systematic study of the class imbalance problem in convolutional neural networks, Neural Networks, 2018.

Y. W. Chao, Z. Wang, Y. He, J. Wang, and J. Deng, HICO: A benchmark for recognizing human-object interactions in images, ICCV, 2015.

N. Chawla, K. Bowyer, L. Hall, and P. Kegelmeyer, SMOTE: Synthetic Minority Over-sampling Techniqu, Journal of Artificial Intelligence Research, 2002.

G. Chen, W. Choi, X. Yu, T. Han, and M. Chandraker, Learning Efficient Object Detection Models with Knowledge Distillation, NIPS, 2017.

G. Liang-chieh-chen, I. Papandreou, K. Kokkinos, A. L. Murphy, . Yuille et al., Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. PAMI, 2018.

K. Cho, B. Van-merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares et al., Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation, EMNLP, 2014.
URL : https://hal.archives-ouvertes.fr/hal-01433235

N. Chomsky, Three models for the description of language. IRE Transactions on Information Theory, 1956.

A. Clark, J. Donahue, and K. Simonyan, Efficient Video Generation on Complex Datasets, 2019.

P. Covington, J. Adams, and E. Sargin, Deep neural networks for youtube recommendations, RecSys 2016 -Proceedings of the 10th ACM Conference on Recommender Systems, 2016.

Y. Cui, F. Zhou, Y. Lin, and S. Belongie, Fine-grained Categorization and Dataset Bootstrapping using Deep Metric Learning with Humans in the Loop, CVPR, 2016.

B. Dai, Y. Zhang, and D. Lin, Detecting Visual Relationships with Deep Relational Networks, CVPR, 2017.

J. Dai, K. He, and J. Sun, BoxSup: Exploiting bounding boxes to supervise convolutional networks for semantic segmentation, ICCV, 2015. ISBN 9781467383912

J. Deng, J. Krause, A. C. Berg, and L. Fei-fei, Hedging Your Bets: Optimizing Accuracy-Specificity Trade-offs in Large Scale Visual Recognition, CVPR, 2012.

J. Deng, N. Ding, Y. Jia, A. Frome, K. Murphy et al., Large-Scale Object Classification Using Label Relation Graphs, European Conference on Computer Vision, 2014.

, ImageNet: A large-scale hierarchical image database, IEEE Conference on Computer Vision and Pattern Recognition, pp.2-9, 2009.

G. Thomas and . Dietterich, Ensemble Methods in Machine Learning, International workshop on multiple classifier systems, 2000.

C. Doersch, A. Gupta, and A. A. Efros, Unsupervised Visual Representation Learning by Context Prediction, ICCV, 2015.

A. Dosovitskiy, T. Springenberg, M. Riedmiller, and T. Brox, Discriminative Unsupervised Feature Learning with Convolutional Neural Networks, NIPS, 2014.

L. Engstrom, A. Ilyas, A. Madry, and S. Santurkar, Brandon Tran, and Dimitris Tsipras. A Discussion of 'Adversarial Examples Are Not Bugs, They Are Features': Discussion and Author Responses. Distill, vol.219

M. Everingham, L. Van-gool, K. Christopher, J. Williams, A. Winn et al., The pascal visual object classes (VOC) challenge. International Journal of Computer Vision, vol.88, pp.303-338, 2010.

Y. Fang, K. Kuan, J. Lin, C. Tan, and V. Chandrasekhar, Object Detection Meets Knowledge Graphs, IJCAI, pp.1661-1667, 2017.

C. Fellbaum, WordNet: An Electronic Lexical Database, Bradford Books, vol.71, 1998.

F. Pedro, R. B. Felzenszwalb, D. Girshick, D. Mcallester, and . Ramanan, Object Detection with Discriminatively Trained Part Based Models, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2010.

B. Fernando, H. Bilen, E. Gavves, and S. Gould, Self-Supervised Video Representation Learning With Odd-One-Out Networks, 2017.

R. A. Fisher, . The, . Of, . Measurements, . Taxo-nomic et al., Annals of Eugenics, 1936.

A. Frome, G. S. Corrado, J. Shlens, S. Dean, A. Ranzato et al., DeViSE: A Deep Visual-Semantic Embedding Model, NIPS, 2013.

A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell et al., Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding, Arxiv, 2016.

C. Galleguillos, A. Rabinovich, and S. J. Belongie, Object categorization using co-occurrence, location and appearance, CVPR, 2008.

C. Gao, Y. Zou, and J. Huang, iCAN: Instance-Centric Attention Network for Human-Object Interaction Detection, BMVC, 2018.

M. Geva, Y. Goldberg, and J. Berant, Are We Modeling the Task or the Annotator? An Investigation of Annotator Bias in Natural Language Understanding Datasets, EMNLP-IJCNLP, 2019.

G. Ghiasi and C. C. Fowlkes, Laplacian pyramid reconstruction and refinement for semantic segmentation, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics

S. Gidaris, P. Singh, and N. Komodakis, UNSUPERVISED REP-RESENTATION LEARNING BY PRE-DICTING IMAGE ROTATIONS, ICLR, 2018.

J. Gilmer and D. Hendrycks, Adversarial Examples Are Not Bugs, They Are Features': Adversarial Example Researchers Need to Expand What is Meant by 'Robustness'. Distill, 2019.

R. Girdhar and D. Ramanan, Attentional Pooling for Action Recognition, NIPS, 2017.

R. Girshick, Fast R-CNN, ICCV, 2015.

R. Girshick, J. Donahue, T. Darrell, and J. Malik, Rich feature hierarchies for accurate object detection and semantic segmentation, CVPR, vol.11, 2014.

R. Girshick, J. Donahue, T. Darrell, and J. Malik, Rich feature hierarchies for accurate object detection and semantic segmentation, Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp.580-587, 2014.

X. Glorot and Y. Bengio, Understanding the difficulty of training deep feedforward neural networks, Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, 2010.

J. Goldberger, S. Roweis, G. Hinton, and R. Salakhutdinov, Neighbourhood Components Analysis, NIPS, 2004.

L. Gomez, Y. Patel, M. Rusinol, D. Karatzas, and C. Jawahar, Self-supervised learning of visual features through embedding images into text topic spaces, CVPR, 2017.

I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning, 2016.

K. Gorman and S. Bedrick, We need to talk about standard splits, ACL, 2019.

Y. Goyal, T. Khot, A. Agrawal, D. Summers-stay, D. Batra et al., Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering, International Journal of Computer Vision, 2019.

S. Gupta, J. Hoffman, and J. Malik, Cross Modal Distillation for Supervision Transfer, CVPR, 2016.

H. He, Y. Bai, E. A. Garcia, and S. Li, ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning, IEEE International Joint Conference on Neural Networks, vol.9781424418213, 2008.

R. Herzig, M. Raboh, G. Chechik, J. Berant, and A. Globerson, Mapping Images to Scene Graphs with Permutation-Invariant Structured Prediction, NIPS, 2018.

G. Hinton and J. Dean, Distilling the Knowledge in a Neural Network, NIPS Deep Learning Workshop, 2014.

, Sepp Hochreiter and Jürgen Schmidhuber. full-text. Neural Computation, 1997.

S. Hong, J. Oh, H. Lee, and B. Han, Learning Transferrable Knowledge for Semantic Segmentation with Deep Convolutional Neural Network, CVPR, 2016.

Z. Hu, X. Ma, Z. Liu, E. Hovy, and E. Xing, Harnessing Deep Neural Networks with Logic Rules, ACL, 2016. ISBN 9781510827585

J. Johnson, M. Douze, and H. Jégou, Billion-scale similarity search with GPUs

J. Johnson, R. Krishna, M. Stark, L. Li, D. A. Shamma et al., Image Retrieval using Scene Graphs

O. Kaiser, A. Nachum, S. Roy, and . Bengio, Learning to Remember Rare Events, ICLR, 2017.

W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier et al., The Kinetics Human Action Video Dataset

J. Kiefer and J. Wolfowitz, Stochastic Estimation of the Maximum of a Regression Function, The Annals of Mathematical Statistics, 1952.

N. Thomas, M. Kipf, and . Welling, Semi-supervised Classification with Graph Convolutioal Networks, ICLR, 2017.

R. Kiros, R. Salakhutdinov, and R. Zemel, Multimodal Neural Language Models. Icml, pp.595-603, 2014.

G. Koch, R. Zemel, and R. Salakhutdinov, Siamese Neural Networks for One-shot Image Recognition

R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata et al., Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations, 2016.

A. Krizhevsky, I. Sutskever, and G. E. Hinton, ImageNet Classification with Deep Convolutional Neural Networks, Advances In Neural Information Processing Systems, pp.1-9, 2012.

H. Kuehne, . Jhuang, . Garrote, T. Poggio, and . Serre, HMDB: A Large Video Database for Human Motion Recognition, High Performance Computing in Science and Engineering, 2012.

A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin et al., IJCV submission in review The Open Images Dataset V4 Unified image classification, object detection, and visual relationship detection at scale, 2018.

R. Brenden-m-lake, J. Salakhutdinov, J. B. Gross, and . Tenenbaum, One shot learning of simple visual concepts, {Proceedings of the 33rd Annual Conference of the Cognitive Science Society}, 2011.

R. Brenden-m-lake, J. B. Salakhutdinov, and . Tenenbaum, Humanlevel concept learning through probabilistic program induction, Science, 2015.

. Brenden-m-lake, J. B. Tomer-d-ullman, S. J. Tenenbaum, and . Gershman, Building Machines That Learn and Think Like People, Behavioral and Brain Sciences, 2017.

C. H. Lampert, H. Nickisch, and S. Harmeling, Learning to detect unseen object classes by between-class attribute transfer, IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops, 2009.

C. Leacock and M. Chodorow, Filling in a sparse training space for word sense identification, 1994.

Y. Li, W. Ouyang, X. Wang, and X. Tang, ViP-CNN: Visual Phrase Guided Convolutional Neural Network, CVPR, 2017.

Y. Li, J. Yang, Y. Song, Y. Research, L. Cao et al., Learning from Noisy Labels with Distillation, 2017.

K. Liang, Y. Guo, H. Chang, and X. Chen, Visual Relationship Detection with Deep Structural Ranking

K. Liang, Y. Guo, H. Chang, and X. Chen, Visual Relationship Detection with Deep Structural Ranking, AAAI, 2018.

X. Liang, L. Lee, and E. P. Xing, Deep Variation-structured Reinforcement Learning for Visual Relationship and Attribute Detection, CVPR, 2017.

G. Lin and C. Shen, Efficient Piecewise Training of Deep Structured Models for Semantic Segmentation, CVPR, 2016. ISBN 9781467388504

T. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick et al., Microsoft COCO: Common Objects in Context

T. Lin, P. Goyal, and R. Girshick, Kaiming He, and Piotr Dollár. Focal Loss for Dense Object Detection, 2017.

Z. Liu, X. Li, P. Luo, C. C. Loy, and X. Tang, Semantic image segmentation via deep parsing network, ICCV, 2015. ISBN 9781467383912

C. Lu, R. Krishna, M. Bernstein, and L. Fei-fei, Visual relationship detection with language priors, ECCV, 2016. ISBN 9783319464473

H. Macleod, C. L. Bennett, M. R. Morris, and E. Cutrell, Understanding Blind People's Experiences with Computer-Generated Captions of Social Media Images, 2017.

J. Macqueen, Some methods for classification and analysis of multivariate observations, Fifth Berkeley Symposium on Mathematical Statistics and Probability, 1967.

J. Mao, C. Gan, P. Kohli, J. B. Tenenbaum, and J. Wu, The Neuro-Symbolic Concept Learner: Interpreting Scenes, Words, and Sentences From Natural Supervision. In ICLR, 2019.

Y. Mao, C. Zhou, X. Wang, and R. Li, Show and Tell More: Topic-Oriented Multi-Sentence Image Captioning, In IJCAI, 2018.

G. Marcus, F. Thank-christina, E. Chollet, Z. Davis, S. Lipton et al., Deep Learning: A Critical Appraisal

K. Marino, R. Salakhutdinov, and A. Gupta, The More You Know: Using Knowledge Graphs for Image Classification, CVPR, 2017.

T. Mccoy, E. Pavlick, and T. Linzen, Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference, ACL, 2019.

T. Mikolov, G. Corrado, K. Chen, and J. Dean, Efficient Estimation of Word Representations in Vector Space, ICLR, 2013.

I. Misra, L. Zitnick, and M. Hebert, Shuffle and Learn: Unsupervised Learning using Temporal Order Verification, ECCV, 2016.

I. Misra, L. Zitnick, M. Mitchell, and R. Girshick, Seeing through the Human Reporting Bias: Visual Classifiers from Noisy Human-Centric Labels, CVPR, 2016.

M. Monfort, A. Andonian, B. Zhou, K. Ramakrishnan, S. A. Bargal et al., Moments in Time Dataset: one million videos for event understanding, 2019.

A. Neelakantan, M. Quoc-v-le-google-brain, . Abadi-google, A. Brain, D. Mccallum et al., LEARNING A NATURAL LAN-GUAGE INTERFACE WITH NEURAL PROGRAMMER, 2017.

A. Newell and J. Deng, Pixels to Graphs by Associative Embedding, NIPS, 2017.

A. Newell, Z. Huang, and J. Deng, Associative Embedding: Endto-End Learning for Joint Detection and Grouping, NIPS, 2017.

M. Nickel and D. Kiela, Poincaré Embeddings for Learning Hierarchical Representations

M. Nickel and D. Kiela, Poincaré Embeddings for Learning Hierarchical Representations, 2017.

M. Oquab, L. Bottou, I. Laptev, and J. Sivic, Learning and Transferring Mid-Level Image Representations using Convolutional Neural Networks, CVPR, 2014.
URL : https://hal.archives-ouvertes.fr/hal-00911179

V. Ordonez, J. Deng, Y. Choi, A. C. Berg, and T. L. Berg, From Large Scale Image Categorization to Entry-Level Categories, ICCV, 2013.

M. Palatucci, G. E. Hinton, D. Pomerleau, and T. M. Mitchell, Zero-Shot Learning with Semantic Output Codes, Advances in Neural Information Processing Systems, vol.22, 2009.

D. Pathak, P. Krähenbühl, J. Donahue, T. Darrell, and A. A. Efros, Context Encoders: Feature Learning by Inpainting, CVPR, 2016.

K. Pearson, On lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 1901.

J. Pennington, R. Socher, and C. Manning, Glove: Global Vectors for Word Representation, EMNLP, 2014.

J. Peyre, I. Laptev, C. Schmid, and J. Sivic, Weakly-supervised learning of visual relations, ICCV, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01576035

F. Plesse, A. Ginsca, B. Delezoide, and F. Prêteux, Visual Relationship Detection Based on Guided Proposals and Semantic Knowledge Distillation, ICME, 2018.

F. Plesse, A. Ginsca, B. Delezoide, and F. Prêteux, Learning Prototypes for Visual Relationship Detection, CBMI, 2018.

. Ning-qian, On the Momentum Term in Gradient Descent Learning Algorithms

I. Radosavovic, P. Dollár, R. Girshick, G. Gkioxari, and K. He, Data Distillation: Towards Omni-Supervised Learning, CVPR, 2018.

C. Vignesh-ramanathan, J. Li, W. Deng, Z. Han, K. Li et al., Learning semantic relationships for better action retrieval in images, CVPR, 2015.

J. Redmon and A. Farhadi, , 2018.

J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, You Only Look Once: Unified, Real-Time Object Detection, CVPR, 2016.

K. Shaoqing-ren, R. He, J. Girshick, and . Sun, Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks, NIPS, 2015.

S. Marco-tulio-ribeiro, C. Singh, and . Guestrin, Explaining the Predictions of Any Classifier, KDD, 2016. ISBN 9781450342322

H. Robbins and S. Monro, A Stochastic Approximation Method, Annals of Mathematical Statistics, 1951.

M. Rohrbach, M. Stark, G. Szarvas, I. Gurevych, and B. Schiele, What helps where -and why? Semantic relatedness for knowledge transfer, CVPR, 2010. ISBN 9781424469840

A. Sarullo and T. Mu, On Class Imbalance and Background Filtering in Visual Relationship Detection, 2019.

M. Schlichtkrull, N. Thomas, P. Kipf, R. Bloem, . Van-den et al., Modeling Relational Data with Graph Convolutional Networks, 2017.

F. Schroff and J. Philbin, FaceNet: A Unified Embedding for Face Recognition and Clustering, CVPR, 2015.

M. Schultz and T. Joachims, Learning a Distance Metric from Relative Comparisons, NIPS, 2003.

A. Shrivastava, A. Gupta, and R. Girshick, Training Regionbased Object Detectors with Online Hard Example Mining, CVPR, 2016.

D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai et al., A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play, Science, 2018.

E. Simo-serra, E. Trulls, L. Ferraz, and F. Moreno-noguer, FRACKING DEEP CONVOLUTIONAL IMAGE DESCRIPTORS. In ICLR, 2015.

K. Simonyan and A. Zisserman, Very Deep Convolutional Networks for Large-Scale Image Recognition, ICLR, 2015. ISBN 9781450341448

J. Snell, K. Swersky, and T. Zemel, Prototypical Networks for Few-shot Learning, NIPS, 2017.

R. Socher, A. Karpathy, V. Quoc, C. D. Le, A. Manning et al., Grounded Compositional Semantics for Finding and Describing Images with Sentences, Transactions of the Association for Computational Linguistics, 2014.

K. Sohn, Improved Deep Metric Learning with Multi-class N-pair Loss Objective, NIPS, 2016.

H. Song, Y. Xiang, S. J. Mit, and S. Savarese, Deep Metric Learning via Lifted Structured Feature Embedding, CVPR, 2016.

K. Soomro, M. Amir-roshan-zamir, and . Shah, UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild, 2012.

R. Speer and C. Havasi, Representing General Relational Knowledge in ConceptNet 5, LREC, 2012.

T. Sun, B. Zhou, L. Lai, and J. Pei, Sequence-based prediction of protein protein interaction using a deep-learning algorithm, BMC Bioinformatics, 2017.

J. Muhammad-atif-tahir, K. Kittler, F. Mikolajczyk, and . Yan, A multiple expert approach to the class imbalance problem using inverse random under sampling, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), volume, vol.5519, pp.82-91, 2009.

J. R-r-uijlings, K. Van-de-sande, A. Gevers, and . Smeulders, Selective Search for Object Recognition, International Journal of Computer Vision, 2013.

L. Van-der-maaten and G. Hinton, Visualizing Data using t-SNE, Journal of Machine Learning Research, vol.9, 2008.

B. Van-durme and L. Schubert, Extracting implicit knowledge from text, 2010.

A. Vaswani, G. Brain, N. Shazeer, N. Parmar, J. Uszkoreit et al., Attention Is All You Need, NIPS, 2017.

I. Vendrov, R. Kiros, S. Fidler, and R. Urtasun, ORDER-EMBEDDINGS OF IMAGES AND LANGUAGE, ICLR, 2016.

O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, Show and tell: A neural image caption generator, CVPR, 2015.

O. Vinyals, G. Deepmind, C. Blundell, T. Lillicrap, K. Kavukcuoglu et al., Matching Networks for One Shot Learning, NIPS, 2016.

O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, Show and Tell: Lessons learned from the 2015 MSCOCO Image Captioning Challenge, vol.99, pp.1-1, 2016.

W. Wang, R. Wang, S. Shan, and X. Chen, Exploring Context and Visual Pattern of Relationship for Scene Graph Generation, CVPR, 2019.

X. Wang and A. Gupta, Unsupervised Learning of Visual Representations using Videos, ICCV, 2015.

X. Wang, Y. Ye, and A. Gupta, Zero-shot Recognition via Semantic Embeddings and Knowledge Graphs, CVPR, 2018.

Z. Wang, M. Nando-de-freitas, and . Lanctot, Dueling Network Architectures for, Deep Reinforcement Learning. arXiv, issue.9, pp.1-16, 2016.

D. Wei, J. Lim, A. Zisserman, and W. Freeman, Learning and Using the Arrow of Time, CVPR, 2018.

Q. Kilian, J. Weinberger, L. Blitzer, and . Saul, Distance Metric Learning for Large Margin Nearest Neighbor Classification, Journal of Machine Learning Research, 2009.

S. Woo, D. Kim, K. Daejeon, E. E. Donghyeon, I. Cho et al., LinkNet: Relational Embedding for Scene Graph, NIPS, 2018.

Y. Xian and C. H. Lampert, Zero-Shot Learning -A Comprehensive Evaluation of the Good, the Bad and the Ugly, Bernt Schiele, and Zeynep Akata, 2018.

P. Eric, . Xing, Y. Andrew, M. I. Ng, S. Jordan et al., Distance Metric Learning, with Application to Clustering with Side-Information, NIPS, 2002.

C. Xu, H. Shao, C. Hsieh, J. J. Xiong, and . Corso, Can humans fly? Action understanding with multiple classes of actors, Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp.7-12, 2015.

D. Xu, Y. Zhu, C. B. Choy, and L. Fei-fei, Scene Graph Generation by Iterative Message Passing, CVPR, 2017.

K. Xu, J. L. Ba, R. Kiros, K. Cho, A. Courville et al., Attend and Tell: Neural Image Caption Generation with Visual Attention, ICML, 2015.

F. Yan and K. Mikolajczyk, Deep Correlation for Matching Images and Text, CVPR, 2015.

J. Yang, J. Lu, S. Lee, D. Batra, and D. Parikh, Graph R-CNN for Scene Graph Generation, ECCV, 2018.

B. Yao, X. Jiang, A. Khosla, A. L. Lin, L. Guibas et al., Human action recognition by learning bases of action attributes and parts, Proceedings of the IEEE International Conference on Computer Vision, pp.1331-1338, 2011.

J. Yim, D. Joo, J. Bae, and J. Kim, A Gift from Knowledge Distillation: Fast Optimization, Network Minimization and Transfer Learning, CVPR, 2017.

G. Yin, L. Sheng, B. Liu, N. Yu, X. Wang et al., Zoom-Net: Mining Deep Feature Interactions for Visual Relationship Recognition, ECCV, 2018.

R. Yu, A. Li, V. I. Morariu, and L. S. Davis, Visual Relationship Detection With Internal and External Linguistic Knowledge Distillation, 2017.

R. Zellers, M. Yatskar, S. Thomson, and Y. Choi, Neural Motifs: Scene Graph Parsing with Global Context, CVPR, 2018.

H. Zhang, Z. Kyaw, S. Chang, and T. Chua, Visual Translation Embedding Network for Visual Relation Detection, 2017.

J. Zhang, M. Elhoseiny, S. Cohen, and W. Chang, Relationship Proposal Networks. In CVPR, 2017.

J. Zhang, Y. Kalantidis, M. Rohrbach, and M. Paluri, Ahmed Elgammal, and Mohamed Elhoseiny. Large-Scale Visual Relationship Understanding, AAAI, 2019.

J. Zhang, K. J. Shih, A. Elgammal, A. Tao, and B. Catanzaro, Graphical Contrastive Losses for Scene Graph Parsing, CVPR, 2019.

R. Zhang, P. Isola, and A. A. Efros, Colorful Image Colorization, ECCV, 2016.

H. Zhao, X. Puig, B. Zhou, S. Fidler, and A. Torralba, Open Vocabulary Scene Parsing, ICCV, 2017.

H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, Pyramid scene parsing network, CVPR, 2017. ISBN 9781538604571

Y. Zhu, S. Jiang, and X. Li, Visual relationship detection with object spatial distribution, ICME, 2017.

Y. Zhu, A. Fathi, and L. Fei-fei, Reasoning about Object Affordances in a Knowledge Base Representation, ECCV, 2014.