TY - JOUR
T1 - CLIP-Flow
T2 - Decoding images encoded in CLIP space
AU - Ma, Hao
AU - Li, Ming
AU - Yang, Jingyuan
AU - Patashnik, Or
AU - Lischinski, Dani
AU - Cohen-Or, Daniel
AU - Huang, Hui
N1 - Publisher Copyright:
© The Author(s) 2024.
PY - 2024
Y1 - 2024
N2 - This study introduces CLIP-Flow, a novel network for generating images from a given image or text. To effectively utilize the rich semantics contained in both modalities, we designed a semantics-guided methodology for image- and text-to-image synthesis. In particular, we adopted Contrastive Language-Image Pretraining (CLIP) as an encoder to extract semantics and StyleGAN as a decoder to generate images from such information. Moreover, to bridge the embedding space of CLIP and latent space of StyleGAN, real NVP is employed and modified with activation normalization and invertible convolution. As the images and text in CLIP share the same representation space, text prompts can be fed directly into CLIP-Flow to achieve text-to-image synthesis. We conducted extensive experiments on several datasets to validate the effectiveness of the proposed image-to-image synthesis method. In addition, we tested on the public dataset Multi-Modal CelebA-HQ, for text-to-image synthesis. Experiments validated that our approach can generate high-quality text-matching images, and is comparable with state-of-the-art methods, both qualitatively and quantitatively. (Figure presented.)
AB - This study introduces CLIP-Flow, a novel network for generating images from a given image or text. To effectively utilize the rich semantics contained in both modalities, we designed a semantics-guided methodology for image- and text-to-image synthesis. In particular, we adopted Contrastive Language-Image Pretraining (CLIP) as an encoder to extract semantics and StyleGAN as a decoder to generate images from such information. Moreover, to bridge the embedding space of CLIP and latent space of StyleGAN, real NVP is employed and modified with activation normalization and invertible convolution. As the images and text in CLIP share the same representation space, text prompts can be fed directly into CLIP-Flow to achieve text-to-image synthesis. We conducted extensive experiments on several datasets to validate the effectiveness of the proposed image-to-image synthesis method. In addition, we tested on the public dataset Multi-Modal CelebA-HQ, for text-to-image synthesis. Experiments validated that our approach can generate high-quality text-matching images, and is comparable with state-of-the-art methods, both qualitatively and quantitatively. (Figure presented.)
KW - contrastive language-image pretraining (CLIP)
KW - flow
KW - image-to-image
KW - StyleGAN
KW - text-to-image
UR - http://www.scopus.com/inward/record.url?scp=85202498357&partnerID=8YFLogxK
U2 - 10.1007/s41095-023-0375-z
DO - 10.1007/s41095-023-0375-z
M3 - ???researchoutput.researchoutputtypes.contributiontojournal.article???
AN - SCOPUS:85202498357
SN - 2096-0433
JO - Computational Visual Media
JF - Computational Visual Media
ER -