[Audio] We present Multi-Concepts Prompts Learning..
[Audio] In nurseries, toddlers are shown pictures to learn new things. Teachers talk about each picture using sentences with new ideas, like sentences with unfamiliar words..
[Audio] Similarly, we explore teaching machines new concepts through natural language without requiring image annotations..
[Audio] We consider the language-driven vision concept discovery is a human-machine interaction process.
[Audio] The human describes an image, leaving out multiple unfamiliar concepts.
[Audio] The machine then learns to link each new concept with a corresponding learnable prompt, i.e. the pseudo words, from the sentence-image pair..
[Audio] Once learnt, the machine can assist human explore hypothesis generation through local image editing without concrete knowledge of the new vision concept..
[Audio] Discovering out-of-distribution knowledge either from experimental observations or mining existing textbooks ..
How we did?. 9.
[Audio] Textual Inversion, a prompt learning method, learns a singular text embedding for a new "word" to represent image style and appearance, allowing it to be integrated into natural language sentences to generate novel synthesised images..
[Audio] However, identifying multiple unknown object-level concepts within one scene remains a complex challenge..
[Audio] While recent methods have resorted to cropping or masking individual images to learn multiple concepts, these techniques require image annotations which can be scarce or unavailable. For example, Custom Diffusion (CD) and Cones learn concepts from crops of objects, while Break-A-Scene uses masks. In contrast, our method learns object-level concepts using image-sentence pairs, aligning the cross-attention of each learnable prompt with a semantically meaningful region, and enabling mask-free local editing..
[Audio] To address this challenge, we introduce Multi- Concept Prompt Learning (MCPL), where multiple unknown "words" are simultaneously learned from a single sentence-image pair, without any imagery annotations..
[Audio] To enhance the accuracy of word-concept correlation and refine attention mask boundaries, we propose three regularisation techniques: Attention Masking, Prompts Contrastive Loss, and Bind Adjective..
[Audio] We generate and collected a Multi-Concept-Dataset including a total of 1400 images and masked objects/concepts, involving both in-distribution natural images and out- of-distribution biomedical images..
[Audio] Extensive quantitative comparisons with both real-world categories and biomedical images demonstrate that our method can learn new semantically disentangled concepts..
[Audio] Thank you and more results can be found in our project page..