An Image is Worth Multiple Words: Discovering Object Level Concepts using Multi-Concepts Prompts Learning

Published on Slideshow
Static slideshow
Download PDF version
Download PDF version
Embed video
Share video
Ask about this video

Scene 1 (0s)

[Audio] We present Multi-Concepts Prompts Learning..

Scene 2 (4s)

[Audio] In nurseries, toddlers are shown pictures to learn new things. Teachers talk about each picture using sentences with new ideas, like sentences with unfamiliar words..

Scene 3 (15s)

[Audio] Similarly, we explore teaching machines new concepts through natural language without requiring image annotations..

Scene 4 (24s)

[Audio] We consider the language-driven vision concept discovery is a human-machine interaction process.

Scene 5 (32s)

[Audio] The human describes an image, leaving out multiple unfamiliar concepts.

Scene 6 (38s)

[Audio] The machine then learns to link each new concept with a corresponding learnable prompt, i.e. the pseudo words, from the sentence-image pair..

Scene 7 (42s)

[Audio] Once learnt, the machine can assist human explore hypothesis generation through local image editing without concrete knowledge of the new vision concept..

Scene 8 (1m 14s)

[Audio] Discovering out-of-distribution knowledge either from experimental observations or mining existing textbooks ..

Scene 9 (1m 31s)

How we did?. 9.

Scene 10 (1m 37s)

[Audio] Textual Inversion, a prompt learning method, learns a singular text embedding for a new "word" to represent image style and appearance, allowing it to be integrated into natural language sentences to generate novel synthesised images..

Scene 11 (1m 52s)

[Audio] However, identifying multiple unknown object-level concepts within one scene remains a complex challenge..

Scene 12 (2m 4s)

[Audio] While recent methods have resorted to cropping or masking individual images to learn multiple concepts, these techniques require image annotations which can be scarce or unavailable. For example, Custom Diffusion (CD) and Cones learn concepts from crops of objects, while Break-A-Scene uses masks. In contrast, our method learns object-level concepts using image-sentence pairs, aligning the cross-attention of each learnable prompt with a semantically meaningful region, and enabling mask-free local editing..

Scene 13 (2m 39s)

[Audio] To address this challenge, we introduce Multi- Concept Prompt Learning (MCPL), where multiple unknown "words" are simultaneously learned from a single sentence-image pair, without any imagery annotations..

Scene 14 (2m 53s)

[Audio] To enhance the accuracy of word-concept correlation and refine attention mask boundaries, we propose three regularisation techniques: Attention Masking, Prompts Contrastive Loss, and Bind Adjective..

Scene 15 (3m 9s)

[Audio] We generate and collected a Multi-Concept-Dataset including a total of 1400 images and masked objects/concepts, involving both in-distribution natural images and out- of-distribution biomedical images..

Scene 16 (3m 54s)

[Audio] Extensive quantitative comparisons with both real-world categories and biomedical images demonstrate that our method can learn new semantically disentangled concepts..

Scene 17 (4m 42s)

[Audio] Thank you and more results can be found in our project page..