深度学习:多模态任务

阅读量 ,评论量

视觉常识推理:From Recognition to Cognition: Visual Commonsense Reasoning

视觉问答:VQA: Visual Question Answering

跨模态检索:From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions

引用表达式理解:ReferItGame: Referring to Objects in Photographs of Natural Scenes

任务:

掩码语言模型(Masked Language Modelling)、掩码图像区域预测(Masked Region Prediction)、图文对齐(Image-Text Matching)