视觉常识推理:From Recognition to Cognition: Visual Commonsense Reasoning
视觉问答:VQA: Visual Question Answering
跨模态检索:From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions
引用表达式理解:ReferItGame: Referring to Objects in Photographs of Natural Scenes
任务:
掩码语言模型(Masked Language Modelling)、掩码图像区域预测(Masked Region Prediction)、图文对齐(Image-Text Matching)