Recent progress in the Computer Vision and Natural Language Processing communities have made it possible to connect Vision and Language together in a variety of different tasks which lie at the intersection of Vision, Language, and Embodied AI. Those tasks range from generating meaningful descriptions of images, to answering questions and navigating agents in unseen environments via natural language instructions. This integration has grown up to the point that it is becoming endemic in literature, and a fundamental tool to develop AI algorithms. The lecture will provide an overview of these advancements, focusing on our recent works. We will delve into cutting-edge techniques for generating text from images and videos, addressing the controllability of AI systems with human involvement, and training large-scale models using web-based datasets. Additionally, we will explore the application of these approaches to embodied agents, which interact with the physical world for tasks like navigation and other embodied activities. Throughout the talk, we will emphasize the importance of developing appropriate evaluation metrics and discuss the emerging challenges in the field.
Lorenzo Baraldi is a Tenure Track Assistant Professor at the University of Modena and Reggio Emilia. He works under the supervision of Prof. Rita Cucchiara on Deep Learning, Video Analysis and Multimedia, and teaches in the courses of “Computer Vision and Cognitive Systems” and Scalable AI. Among his research interests, he worked on Egocentric Vision and Gesture Recognition, Temporal Video Segmentation and Retrieval, Saliency, Video Captioning, Visual-Semantic alignment and Embodied AI. He is the author of more than 80 publications in international journals and conferences, and serves as Associate Editor for Pattern Recognition Letters and as Area Chair for major multimedia conferences. He has been elected as a Scholar in the ELLIS society, the European Laboratory for Learning and Intelligent Systems, and coordinates the Modena ELLIS Unit. Since 2021, he has been appointed as deputy director of the Interdepartmental Center on Digital Humanities of the University of Modena and Reggio Emilia. In 2017, he worked in the Facebook AI Research laboratory in Paris, under the supervision of Hervé Jégou, where he developed a video copy detection algorithm that has been adopted in production on the social network.
Zoom & Password: 148148
PDF & VIDEO