WP4: Integrative-AI for Multi-Modal Perception

Task Lead: UNITN, PI: Nicu Sebe: Co-PI: Elisa Ricci

We will investigate new paradigms and architectures for cross-modal content generation, understanding, and searching by considering continual and lifelong learning strategies and dealing with scarce labeled data and multimodal fusion strategies and by studying and implementing novel approaches for deep network distillation, compression, pruning and quantization. 

Connected to TP Vision, Languages in Multimodal Challenges (TP2).

Effort 30p/m + cascading funds.


Description of work 

 

ThisWP addresses new challenges in efficient processing of multiple modalities, understanding their content with minimal supervision and supporting new cross-modal content creation dealing with synthetic and fake data.

 

Task 2.4.1 – Cross-modal understanding and content generation Task leader: Nicu Sebe

The work will focus on effectively dealing with multimodal data, mixed across visual (images, videos) and text and using symbolic knowledge by putting forward 3 overarching goals: 1) cross-modal retrieval of text/audio-to-visual data and vice versa: design intelligent systems capable of fully comprehending multimodal digital input and retrieve image/video from textual content and vice versa; We will also design latent multimodal representations of textual, audio, and visual content which lie in the same space; 2) multimodal generation of visual and textual data: achieve a seamless cross-media generation between different modalities by developing new generative methodologies in a semi-supervised or unsupervised way; 3) validation of cross-modal retrieval and generation in real-domain use cases: designing new generations of AI systems for understanding visual/audio/textual content with super-human capabilities will have disruptive effects within applications. The research activities on this task will benefit from recent advances in the AI community to build foundation models based on multi-modal data such as CLIP, which is trained on image-text pairs on the Internet, and MERLOT, which is trained on YouTube videos with transcribed speech. We will demonstrate that the proposed models enable relevant large-scale applications in different industrial and societal contexts.

 

Task 2.4.2 – Learning from multi-modal data streams with minimal supervision Task leader: Elisa Ricci

In this task the activity will focus on the study and the development of novel deep learning frameworks for learning from multimodal data streams. Particularly, we plan to advance the state of the art in deep continual learning by: 1) introducing novel algorithms exploiting recent advances on self-supervised and contrastive learning to learn in an incremental fashion in absence of labeled data; 2) studying new approaches for fusing data from different modalities, both considering the design of specialized architectures (e.g., attention-based transformers) and novel training schemes (e.g., co-training); 3)  proposing approaches for updating a deep model while simultaneously handling both the semantic and the domain shift, i.e., enabling to learn a model which effectively incorporate knowledge when new data in the stream correspond to different labels and different modalities to what has been previously observed.

 

Task 2.4.3 – Resource-aware cross-modal deep architectures Task leader: Elisa Ricci

The work will focus on addressing one of the most important limitations of AI systems for processing multi-modal data, i.e. the fact that they rely on large-scale, complex deep architectures with a computationally costly training phase. Novel approaches for deep network distillation, compression, pruning and quantization will be studied and implemented, with a special focus on designing models which exploit feature redundancies arising from processing multi-modal data. We will also investigate more modern techniques based on Neural Architecture Search, as well as strategies to dynamically adapt the neural network architecture according to specific requirements in terms of computational cost or energy-consumption. Finally, the developed solutions will be deployed and tested on relevant applications, e.g. for multi-modal behavior analysis in the context of Human Robot Interaction.