PowerPoint 演示文稿

Published on May 25, 2022

Scene 1 (0s)

[Audio] Hello everyone. I am Haoyu Lu from Renmin University of China. Today I am glad to share you with our work: COTS: Collaborative Two-Stream Vision- Language Pre-Training Model for Cross-Modal Retrieval..

Scene 2 (19s)

[Audio] Recently, large-scale multi-modal pretraining two-steams models like CLIP, ALIGN and Wenlan have shown promising performance. Despite the efficiency advantage of these models, they are slightly inferior to single-stream models due to the absence of fine-grained cross-modal interaction..

Scene 3 (39s)

[Audio] To overcome above limitation, we propose a novel collaborative Two-Stream vision-language pretraining model termed COTS by enhancing cross-modal interaction. In addition to instance level alignment via momentum contrastive learning, we leverage two extra levels of cross-modal interactions in our COTS: First, we propose a masked vision language modeling learning objective for token-level interaction without using a cross-stream network module, where variational autoencoder is imposed on the visual encoder to generate visual tokens for each image. Second, we introduce a KL-alignment learning objective between text-to- image and image-to-text retrieval tasks for task-level interaction, where the probability distribution per task is computed with the negative queues in momentum contrastive learning..

Scene 4 (1m 35s)

[Audio] We compare our COTS with the state-of-the-art methods on Flickr and MSCOCO. As shown in the Table, our COTS outperforms all two-stream models by large margins for all evaluation metrics. Without using extra object detectors, our COTS† achieves new state-of-the-art on Flickr with respect to both single-stream and two-stream methods. On MSCOCO, our COTS† also achieves higher performance than most single-stream methods and comparable results but with a 10,800× faster speed during inference..

Scene 5 (2m 12s)

[Audio] We further compare our COTS with the state-of-the-art methods on the video-text retrieval task. Our COTS significantly outperforms the state-of-the -arts even without modeling the temporal information of videos, which demonstrates the general applicability and the great potentiality of our COTS. Our COTS leads to better results than methods utilizing extra modalities (e.g., motion and audio) or those pre-trained on extra large video data, indicating that a well pre-trained vision language model may be the key to video-text retrieval..

Scene 6 (2m 47s)

[Audio] This figure shows the visualized attention maps of our COTS on images/ video frames responding to individual words. We can see from Figures (a)–(b) that our COTS can well locate different objects (even fine-grained ones like " violin" and " cellphone" in Figure (a), " hair" and " hands" in Figure (b)) in the same image. Figure (c) shows how our COTS determines gender information. Given the word " children", COTS focuses on the faces. When recognizing "girl", COTS pays attention to the girl's long hair and pink clothes (and the same for the word " boy"). Interestingly, our COTS can also capture concepts and actions as shown in Figure (d). COTS focuses on five dancers for both "five" and " dancers", but pays more attention for the number "five". And it focuses on feet when it comes to "jump". Figure (e) presents attention maps with respect to " stroller" on four frames from the same video, showing that our COTS can also work well for the video modality..

Scene 7 (4m 4s)

[Audio] We investigated how to improve the performance of the two-stream vision-language pre-training while still maintaining its advantage of high efficiency. Specifically, we propose a novel COllaborative Two-Stream VLP model termed COTS by leveraging three levels of cross-modal interactions. Extensive experiments validate the effectiveness and high efficiency of our COTS. It is also shown to have general applicability as it achieves new state-of-the-art on video-text retrieval without any modification. Thank you for listening!.