TP2: Vision, Language and Multimodal Challenges

How to create AI agents capable of perception in real, complex environments with multiple combined modalities (text, speech, images, video, …).