CLIP-Flow: Decoding images encoded in CLIP space

Hao Ma, Ming Li, Jingyuan Yang, Or Patashnik, Dani Lischinski, Daniel Cohen-Or, Hui Huang*

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

This study introduces CLIP-Flow, a novel network for generating images from a given image or text. To effectively utilize the rich semantics contained in both modalities, we designed a semantics-guided methodology for image- and text-to-image synthesis. In particular, we adopted Contrastive Language-Image Pretraining (CLIP) as an encoder to extract semantics and StyleGAN as a decoder to generate images from such information. Moreover, to bridge the embedding space of CLIP and latent space of StyleGAN, real NVP is employed and modified with activation normalization and invertible convolution. As the images and text in CLIP share the same representation space, text prompts can be fed directly into CLIP-Flow to achieve text-to-image synthesis. We conducted extensive experiments on several datasets to validate the effectiveness of the proposed image-to-image synthesis method. In addition, we tested on the public dataset Multi-Modal CelebA-HQ, for text-to-image synthesis. Experiments validated that our approach can generate high-quality text-matching images, and is comparable with state-of-the-art methods, both qualitatively and quantitatively. (Figure presented.)

Original languageEnglish
JournalComputational Visual Media
DOIs
StateAccepted/In press - 2024

Bibliographical note

Publisher Copyright:
© The Author(s) 2024.

Keywords

  • contrastive language-image pretraining (CLIP)
  • flow
  • image-to-image
  • StyleGAN
  • text-to-image

Fingerprint

Dive into the research topics of 'CLIP-Flow: Decoding images encoded in CLIP space'. Together they form a unique fingerprint.

Cite this