How to create AI agents capable of perception in real, complex environments with multiple combined modalities (text, speech, images, video, …).