Abstract: Vision Language Pre-training (VLP) has made significant progress in the field of universal multimodality in recent years. Universal multimodal datasets (such as MSCOCO, Flickr30k, etc.) have ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results