People easily learn how to change a flat tire of a car or perform resuscitation by observing other people doing the same task, for example, in an instructional video. This involves advanced visual intelligence abilities such as interpreting sequences of human actions that manipulate objects to achieve a specific task. Currently, however, there is no artificial system with a similar level of cognitive visual competence. In this talk, I will describe our recent progress on learning from instructional videos how people manipulate objects and demonstrate transferring the learnt skill to a robotic manipulator.
Josef Sivic holds a distinguished researcher position at the Institute of Robotics, Informatics and Cybernetics at the Czech Technical University in Prague where he heads the Intelligent Machine Perception team and the ELLIS Unit Prague. He received the habilitation degree from Ecole Normale Superieure in Paris in 2014 and PhD from the University of Oxford in 2006. After Phd he was a post-doctoral associate at the Computer Science and Artificial Intelligence Laboratory at the Massachusetts Institute of Technology. He received the British Machine Vision Association Sullivan Thesis Prize, three test-of-time awards at major computer vision conferences, an ERC Starting Grant and, in 2023, an ERC Advanced Grant.