Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation

Guy Yariv, Itai Gat, Sagie Benaim, Lior Wolf, Idan Schwartz*, Yossi Adi*

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review


We consider the task of generating diverse and realistic videos guided by natural audio samples from a wide variety of semantic classes. For this task, the videos are required to be aligned both globally and temporally with the input audio: globally, the input audio is semantically associated with the entire output video, and temporally, each segment of the input audio is associated with a corresponding segment of that video. We utilize an existing text-conditioned video generation model and a pre-trained audio encoder model. The proposed method is based on a lightweight adaptor network, which learns to map the audio-based representation to the input representation expected by the text-to-video generation model. As such, it also enables video generation conditioned on text, audio, and, for the first time as far as we can ascertain, on both text and audio. We validate our method extensively on three datasets demonstrating significant semantic diversity of audio-video samples and further propose a novel evaluation metric (AV-Align) to assess the alignment of generated videos with input audio samples. AV-Align is based on the detection and comparison of energy peaks in both modalities. In comparison to recent state-of-the-art approaches, our method generates videos that are better aligned with the input sound, both with respect to content and temporal axis. We also show that videos produced by our method present higher visual quality and are more diverse. Code and samples are available at: https://pages.cs.huji.ac.il/adiyoss-lab/TempoTokens/.

Original languageAmerican English
Title of host publicationProceedings of the 38th AAAI Conference on Artificial Intelligence
Subtitle of host publicationAAAI-24 Technical Tracks 7
EditorsMichael Wooldridge, Jennifer Dy, Sriraam Natarajan
PublisherAssociation for the Advancement of Artificial Intelligence
Number of pages9
ISBN (Electronic)1577358872, 9781577358879
StatePublished - 25 Mar 2024
Event38th AAAI Conference on Artificial Intelligence, AAAI 2024 - Vancouver, Canada
Duration: 20 Feb 202427 Feb 2024

Publication series

NameProceedings of the AAAI Conference on Artificial Intelligence
ISSN (Print)2159-5399
ISSN (Electronic)2374-3468


Conference38th AAAI Conference on Artificial Intelligence, AAAI 2024

Bibliographical note

Publisher Copyright:
Copyright © 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.


Dive into the research topics of 'Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation'. Together they form a unique fingerprint.

Cite this