From Images to Text New forms of Human-AI Interaction

bg-new
Author/s

Lorenzo Baraldi (University of Modena and Reggio Emilia)

About the resource/s
Recent progress in the Computer Vision and Natural Language Processing communities have made it possible to connect Vision and Language together in a variety of different tasks which lie at the intersection of Vision, Language, and Embodied AI. Those tasks range from generating meaningful descriptions of images, to answering questions and navigating agents in unseen environments via natural language instructions. This integration has grown up to the point that it is becoming endemic in literature, and a fundamental tool to develop AI algorithms. The lecture will provide an overview of these advancements, focusing on our recent works. We will delve into cutting-edge techniques for generating text from images and videos, addressing the controllability of AI systems with human involvement, and training large-scale models using web-based datasets. Additionally, we will explore the application of these approaches to embodied agents, which interact with the physical world for tasks like navigation and other embodied activities. Throughout the talk, we will emphasize the importance of developing appropriate evaluation metrics and discuss the emerging challenges in the field.
Media