JOINT AUDIO AND SYMBOLIC CONDITIONING FOR TEMPORALLY CONTROLLED TEXT-TO-MUSIC GENERATION

Or Tal, Alon Ziv, Itai Gat, Felix Kreuk, Yossi Adi

Research output: Chapter in Book/Report/Conference proceedingChapterpeer-review

Abstract

We present JASCO, a temporally controlled text-to-music generation model utilizing both symbolic and audio-based conditions. JASCO can generate high-quality music samples conditioned on global text descriptions along with fine-grained local controls. JASCO is based on the Flow Matching modeling paradigm together with a novel conditioning method that allows for both locally (e.g., chords) and globally (text description) controlled music generation. Specifically, we apply information bottleneck layers in conjunction with temporal blurring to extract relevant information with respect to specific controls. This allows the incorporation of both symbolic and audio-based conditions in the same text-to-music model. We experiment with various symbolic control signals (e.g., chords, melody), as well as with audio representations (e.g., separated drum tracks, full-mix). We evaluate JASCO considering both generation quality and condition adherence using objective metrics and human studies. Results suggest that JASCO is comparable to the evaluated baselines considering generation quality while allowing significantly better and more versatile controls over the generated music. Samples are available on our demo page https://pages. cs.huji.ac.il/adiyoss-lab/JASCO.

Original languageEnglish
Title of host publicationProceedings of the International Society for Music Information Retrieval Conference
PublisherInternational Society for Music Information Retrieval
Pages264-271
Number of pages8
StatePublished - 2024

Publication series

NameProceedings of the International Society for Music Information Retrieval Conference
Volume2024
ISSN (Electronic)3006-3094

Bibliographical note

Publisher Copyright:
© O. Tal, A. Ziv, I. Gat, F. Kreuk, and Y. Adi.

Fingerprint

Dive into the research topics of 'JOINT AUDIO AND SYMBOLIC CONDITIONING FOR TEMPORALLY CONTROLLED TEXT-TO-MUSIC GENERATION'. Together they form a unique fingerprint.

Cite this