Abstract: Vision Language Pre-training (VLP) has made significant progress in the field of universal multimodality in recent years. Universal multimodal datasets (such as MSCOCO, Flickr30k, etc.) have ...