Learning robust and powerful representations is at the core of many problems in multimedia, including content representation, multi-modal fusion, social signals, etc. While the supervised and self-supervised learning paradigms showed great progress in many applications, the learned representations are strongly tailored to one application or domain, and their adaptation to a different scenario or dataset might require large amounts of data, not always available. Deep probabilistic models provide an opportunity to exploit various unsupervised mechanisms that enable several interesting properties. First, they can combined with other deep or shallow probabilistic models within the same methodological framework. Second, they can include unsupervised mixture mechanisms useful for modality and/or model selection on-the-fly. Third, they are naturally suitable not only for unsupervised learning, but also for unsupervised adaptation, thus overcoming a potential domain shift with few data. In this talk, we will discuss the methodology of deep probabilistic models, i.e. variational learning, and showcase their interest for multi-modal applications with auditory and visual data of human activities (speech and motion).