Variational Audio-Visual Representation Learning