Text-driven 3D human motion synthesis

Tuesday 26th November 2024 17:00 CET

Associate Prof. Gül Varol

ABSTRACT

We pose the question: is human motion a language without words? More specifically, can human motions be described or controlled by words? As an effort to answer this question, we develop methods for generating 3D human body movements given textual descriptions. This talk will present a series of works, where each work investigates increasing granularity towards finegrained semantic control, allowing simultaneous and series of actions. We will further describe the follow-up works on text-to-motion retrieval and text-based motion editing. Our approaches employ variational autoencoders or diffusion models with transformer architectures, representing 3D motion as a sequence of parametric SMPL body models. The promising results underscore the potential of text-conditioned generative models in this domain, while limitations point to the need for future work on scaling up training data to unlock a broader vocabulary of action descriptions. The relevant reading materials are ACTOR, TEMOS, TMR, STMC [Petrovich 2021, 2022, 2023, 2024] and TEACH, SINC, MotionFix [Athanasiou 2022, 2023, 2024].

LECTURER SHORT CV

Gül Varol is a permanent researcher (~Assist. Prof.) in the IMAGINE group at École des Ponts ParisTech. Previously, she was a postdoctoral researcher at the University of Oxford (VGG), working with Andrew Zisserman. She obtained her PhD from the WILLOW team of Inria Paris and École Normale Supérieure (ENS). Her thesis, co-advised by Ivan Laptev and Cordelia Schmid, received the PhD awards from ELLIS and AFRIF. During her PhD, she spent time at MPI, Adobe, and Google. Prior to that, she received her BS and MS degrees from Boğaziçi University. She regularly serves as an Area Chair at major computer vision conferences and is serving as a Program Chair at ECCV’24. She is an associate editor for IJCV and was in the award committee for ICCV’23. She has co-organized a number of workshops at CVPR, ICCV, ECCV, and NeurIPS. Her research interests cover vision and language applications, including video representation learning, human motion synthesis, and sign languages.