From
Style Transfer
to
Text-Driven Image Manipulation

湯沂達

台灣人工智慧學校技術處

2021/12/10

Replay

Medium :

類神經影像藝術風格轉換系列筆記－基礎

資料 : {(影像,文字)} : OpenAI 2021 一月

用CLIP達成文字操弄/生成影像

講者

湯沂達

Email
changethewhat@gmail.com / yidar@aiacademy.tw

Did you see these?

(A) https://nightcafe.studio/blogs/blog/top-20-ai-generated-artworks

(B) https://twitter.com/CitizenPlain/status/1316760510709338112/photo/1
(C)https://www.ettoday.net/news/20210616/2007703.htm

(D) https://github.com/orpatashnik/StyleCLIP

<= only with

text & input

A

C

B

D

Content

Style Transfer
GAN & StyleGAN
Image Manipulation with StyleGAN
Text Driven Image Manipulation/Generation
Related Topics
Applications/Resource List
Paper List

This Talk

Spirit of some famous methods

Prerequisite

Mean / Std
Convolution & Activation
Loss Function
Gradient Descent

This Talk

Spirit of some famous methods

Warning

The following pages have some math equations.

However, I will explain them from the idea of the algorithms, not from the equations.

Style Transfer

Before Style Transfer

How to summarize texture?

https://paperswithcode.com/dataset/psu-near-regular-texture-database

Before Style Transfer

How to summarize texture?

Define some handmade feature representation
like color, gradient, frequency...
Then use statistics

Image => Feature Extract => Summarize => Distance

https://paperswithcode.com/dataset/psu-near-regular-texture-database

How to summarize texture?

Image

Feature
Extract

Summarize

Distance

Distance measure for vector/distribution

Manjunath , B. S., & Ma, W. Y. (1996). Texture features for browsing and retrieval of image data. pattern analysis and machine intelligence , 18 (8), 837IEEE Transactions on 52 842.

Ojala, Timo, Matti Pietikainen, and Topi Maenpaa. "Multiresolution gray-scale and rotation invariant texture classification with local binary patterns." IEEE Transactions on pattern analysis and machine intelligence 24.7 (2002): 971-987.

How to summarize texture?

Gabor Filter Bank

$ (H, W, 1) \rightarrow (H, W, f \cdot \theta)$

Image

Feature
Extract

Summarize

Distance

use $\mu, \sigma$

$ (H, W, f \cdot \theta) \rightarrow (2\cdot f \cdot \theta)$

Distance measure for vector/distribution

Manjunath , B. S., & Ma, W. Y. (1996). Texture features for browsing and retrieval of image data. pattern analysis and machine intelligence , 18 (8), 837IEEE Transactions on 52 842.

Ojala, Timo, Matti Pietikainen, and Topi Maenpaa. "Multiresolution gray-scale and rotation invariant texture classification with local binary patterns." IEEE Transactions on pattern analysis and machine intelligence 24.7 (2002): 971-987.

How to summarize texture?

Gabor Filter Bank

$ (H, W, 1) \rightarrow (H, W, f \cdot \theta)$

Image

Feature
Extract

Summarize

Distance

use $\mu, \sigma$

$ (H, W, f \cdot \theta) \rightarrow (2\cdot f \cdot \theta)$

Use Histogram

$ (H, W, 1)_{\in \{0,1,...,9\}} \rightarrow$

Distance measure for vector/distribution

Manjunath , B. S., & Ma, W. Y. (1996). Texture features for browsing and retrieval of image data. pattern analysis and machine intelligence , 18 (8), 837IEEE Transactions on 52 842.

Ojala, Timo, Matti Pietikainen, and Topi Maenpaa. "Multiresolution gray-scale and rotation invariant texture classification with local binary patterns." IEEE Transactions on pattern analysis and machine intelligence 24.7 (2002): 971-987.

9

Rotation Invariant Local Binary Pattern

$ (H, W, 1)_{\in \{0,1,...,255\}} \rightarrow (H, W, 1)_{\in \{0,1,...,9\}}$

How to summarize texture?

Gabor Filter Bank

$ (H, W, 1) \rightarrow (H, W, f \cdot \theta)$

Image

Feature
Extract

Summarize

Distance

use $\mu, \sigma$

$ (H, W, f \cdot \theta) \rightarrow (2\cdot f \cdot \theta)$

Use Histogram

$ (H, W, 1)_{\in \{0,1,...,9\}} \rightarrow$

Distance measure for vector/distribution

Manjunath , B. S., & Ma, W. Y. (1996). Texture features for browsing and retrieval of image data. pattern analysis and machine intelligence , 18 (8), 837IEEE Transactions on 52 842.

Ojala, Timo, Matti Pietikainen, and Topi Maenpaa. "Multiresolution gray-scale and rotation invariant texture classification with local binary patterns." IEEE Transactions on pattern analysis and machine intelligence 24.7 (2002): 971-987.

9

<

$\geq$

Thresholding

Rotation Invariant Local Binary Pattern

$ (H, W, 1)_{\in \{0,1,...,255\}} \rightarrow (H, W, 1)_{\in \{0,1,...,9\}}$

11000000

How to summarize texture?

Gabor Filter Bank

$ (H, W, 1) \rightarrow (H, W, f \cdot \theta)$

Rotation Invariant Local Binary Pattern

$ (H, W, 1)_{\in \{0,1,...,255\}} \rightarrow (H, W, 1)_{\in \{0,1,...,9\}}$

Image

Feature
Extract

Summarize

Distance

use $\mu, \sigma$

$ (H, W, f \cdot \theta) \rightarrow (2\cdot f \cdot \theta)$

Use Histogram

$ (H, W, 1)_{\in \{0,1,...,9\}} \rightarrow$

Distance measure for vector/distribution

Manjunath , B. S., & Ma, W. Y. (1996). Texture features for browsing and retrieval of image data. pattern analysis and machine intelligence , 18 (8), 837IEEE Transactions on 52 842.

Ojala, Timo, Matti Pietikainen, and Topi Maenpaa. "Multiresolution gray-scale and rotation invariant texture classification with local binary patterns." IEEE Transactions on pattern analysis and machine intelligence 24.7 (2002): 971-987.

9

rotation invariant

<

$\geq$

Thresholding

<

>

How to summarize data?

Data

Feature
Extract

Describe

Distance

If we have good feature extractor...

Distance measure for vector/distribution

Describe data w/ or w/o statistic...

A Cute Dog Staring You

Feature
Extract

Distance

If we have good feature extractor...

Use Other Task's Pretrained Weight
Create It By Yourself

Distance measure for vector/distribution

Describe

Describe data w/ or w/o statistic...

A Cute Dog Staring You

How to summarize data?

Data

Style Transfer

Lin, Tianwei, et al. "Drafting and Revision: Laplacian Pyramid Network for Fast High-Quality Artistic Style Transfer." arXiv preprint arXiv:2104.05376 (2021).

Content

Style

Styllized

Objective

Find a stylized image, which has

Content image's content
Style image's style

Style Transfer

Jing, Yongcheng, et al. "Neural Style Transfer: A Review." arXiv preprint arXiv:1705.04058 (2017).

Style Transfer

Jing, Yongcheng, et al. "Neural Style Transfer: A Review." arXiv preprint arXiv:1705.04058 (2017).

Style Transfer

Jing, Yongcheng, et al. "Neural Style Transfer: A Review." arXiv preprint arXiv:1705.04058 (2017).

Style Transfer

Jing, Yongcheng, et al. "Neural Style Transfer: A Review." arXiv preprint arXiv:1705.04058 (2017).

Image Optimization (Inference $\equiv$ train:minutes)
Find an image
Model Optimization (Inference:real time; Train:hours)
Find a model can transfer image
1. Per-Style-Per-Model (PSPM)
  Model contain 1 style
2. Multiple-Style-Per-Model (MSPM)
  Model contain n style
3. Arbitrary-Style-Per-Model (ASPM)
  Model contain any style

Style Transfer

Jing, Yongcheng, et al. "Neural Style Transfer: A Review." arXiv preprint arXiv:1705.04058 (2017).

Image Optimization (Inference $\equiv$ train:minutes)
Find an image
Model Optimization (Inference:real time; Train:hours)
Find a model can transfer image
1. Per-Style-Per-Model (PSPM)
  Model contain 1 style
2. Multiple-Style-Per-Model (MSPM)
  Model contain n style
3. Arbitrary-Style-Per-Model (ASPM)
  Model contain any style

Style Transfer

Jing, Yongcheng, et al. "Neural Style Transfer: A Review." arXiv preprint arXiv:1705.04058 (2017).

Image Optimization (Inference $\equiv$ train:minutes)
Find an image
Model Optimization (Inference:real time; Train:hours)
Find a model can transfer image
1. Per-Style-Per-Model (PSPM)
  1 style
2. Multiple-Style-Per-Model (MSPM)
  n styles
3. Arbitrary-Style-Per-Model (ASPM)
  any style

Image style transfer using convolutional neural networks

Notes

Arxiv : 1508.06576
First paper for "Neural" Style Transfer
Get the result by optimize the image
Plenty of later papers use their loss function
Cost minutes to generate an image
Cited by 3267 at 2021 Nov

Feature
Extract

Distance

If we have good feature extractor...

Use Other Task's Pretrained Weight
Create It By Yourself

Distance measure for vector/distribution

Describe

Describe data w/ or w/o statistic...

A Cute Dog Staring You

Data

VGG

(A Pretrained Model)

$\mathcal{L}_{content}$ : Feature tensor close to content image's feature tensor

$\mathcal{L}_{style}$ : Stat(feature) close to style image's stat(feature)

$\mathcal{L}_{total} = \mathcal{L}_{content} + \lambda \mathcal{L}_{style}$

$\hat{I}$

VGG

(A Pretrained Model)

$\mathcal{L}_{content}$=( - )$^2$

$\mathcal{L}_{content}$ : Feature tensor close to content image's feature tensor

$\mathcal{L}_{style}$ : Stat(feature) close to style image's stat(feature)

$\mathcal{L}_{total} = \mathcal{L}_{content} + \lambda \mathcal{L}_{style}$

$\hat{I}$

VGG

(A Pretrained Model)

$\mathcal{L}_{style}$=(G( )-G( ))$^2$

$\mathcal{L}_{content}$=( - )$^2$

$\mathcal{L}_{content}$ : Feature tensor close to content image's feature tensor

$\mathcal{L}_{style}$ : Stat(feature) close to style image's stat(feature)

$\mathcal{L}_{total} = \mathcal{L}_{content} + \lambda \mathcal{L}_{style}$

$\hat{I}$

VGG

(A Pretrained Model)

$\mathcal{L}_{style}$=(G( )-G( ))$^2$

$\mathcal{L}_{content}$=( - )$^2$

$\mathcal{L}_{content}$ : Feature tensor close to content image's feature tensor

$\mathcal{L}_{style}$ : Stat(feature) close to style image's stat(feature)

$\mathcal{L}_{total} = \mathcal{L}_{content} + \lambda \mathcal{L}_{style}$

result $\leftarrow argmin_{\color{red}\hat{I}}\mathcal{L}_{total}({\color{red}\hat{I}})$

$\hat{I}$

VGG

(A Pretrained Model)

$\mathcal{L}_{style}$=(G( )-G( ))$^2$

$\mathcal{L}_{content}$=( - )$^2$

$\mathcal{L}_{content}$ : Feature tensor close to content image's feature tensor

$\mathcal{L}_{style}$ : Stat(feature) close to style image's stat(feature)

$\mathcal{L}_{total} = \mathcal{L}_{content} + \lambda \mathcal{L}_{style}$

result $\leftarrow argmin_{\color{red}\hat{I}}\mathcal{L}_{total}({\color{red}\hat{I}})$

$$\hat{I} = \hat{I} - \alpha\frac{\partial \mathcal{L}_{total}}{\partial \hat{I}}$$

$\hat{I}$

VGG

(A Pretrained Model)

$\mathcal{L}_{style}$=(G( )-G( ))$^2$

$\mathcal{L}_{content}$=( - )$^2$

$\mathcal{L}_{content}$ : Feature tensor close to content image's feature tensor

$\mathcal{L}_{style}$ : Stat(feature) close to style image's stat(feature)

$\mathcal{L}_{total} = \mathcal{L}_{content} + \lambda \mathcal{L}_{style}$

$$F \in R^{C \times X \times Y}, G(F) \in R^{C \times C}$$

$$G(F)_{c1, c2} = \frac{1}{X\cdot Y}\sum_{x, y}[F_{c1,x,y} \cdot F_{c2,x,y}]$$

$G$ : gram matrix

result $\leftarrow argmin_{\color{red}\hat{I}}\mathcal{L}_{total}({\color{red}\hat{I}})$

$$\hat{I} = \hat{I} - \alpha\frac{\partial \mathcal{L}_{total}}{\partial \hat{I}}$$

Get more abstract result while use deeper layer for content loss

Their Result

Perceptual Losses for Real-Time Style Transfer and Super-Resolution

Johnson, Justin, Alexandre Alahi, and Li Fei-Fei. “Perceptual Losses for Real-Time Style Transfer and Super-Resolution.” arXiv preprint arXiv:1603.08155 (2016).

Notes

Arxiv : 1603.08155
Get the result by optimize a model
Thier research has 2 branch :
1. style transfer
2. super resolution
Plenty of later papers use the term "Perceptual"
Per-Style-Per-Model(PSPM)
Real time while inference
Cited by 5962 at 2021 Nov

$\hat{I}$

Perceptual

$\hat{I}$

prev : $argmin_{{\color{red}\hat{I}}}\mathcal{L}_{total}({\color{red}\hat{I}})$

Perceptual

$f_W({\color{blue}I})$

Model $f_W$

prev : $argmin_{{\color{red}\hat{I}}}\mathcal{L}_{total}({\color{red}\hat{I}})$

this: $argmin_{f_W}\sum_{{\color{blue}I}\in dataset}\mathcal{L}_{total}(f_W({\color{blue}I}))$

A learned representation for artistic style

Notes

Arxiv : 1610.07629
Get the result by optimizing a model
Multiple-Style-Per-Model (MSPM)
Use conditional instance normalization(CIN) for multiple style transfer
The standard setting contain 32 styles, each style contain about 0.2% total parameters.
Real time while inference
Cited by 727 at 2021 Nov

Before A Learned Representation For Artistic Style

Color Distrbution Matching

https://www.dreamstime.com/different-times-day-night-morning-noon-evening-city-landscape-buildings-skyscrapers-urban-concept-background-isolated-image189577922

Source

Stat(R)

Stat(G)

Stat(B)

Target

Stat(R)

Stat(G)

Stat(B)

Before A Learned Representation For Artistic Style

Color Distrbution Matching

https://www.dreamstime.com/different-times-day-night-morning-noon-evening-city-landscape-buildings-skyscrapers-urban-concept-background-isolated-image189577922

Before A Learned Representation For Artistic Style

https://www.dreamstime.com/different-times-day-night-morning-noon-evening-city-landscape-buildings-skyscrapers-urban-concept-background-isolated-image189577922

Color Distrbution Matching

Target

Stat(R)

Stat(G)

Stat(B)

Normalize
($\mu=0, \sigma=1$)
($\mu=0, \sigma=1$)
($\mu=0, \sigma=1$)

Source

Stat(R)

Stat(G)

Stat(B)

A Learned Representation For Artistic Style

Each Style Use a $(\gamma, \beta)$ pair

Target

Stat(R)

Stat(G)

Stat(B)

Normalize
($\mu=0, \sigma=1$)
($\mu=0, \sigma=1$)
($\mu=0, \sigma=1$)

Source

Stat(R)

Stat(G)

Stat(B)

$f_W({\color{blue}I})$

Model $f_W$

$n \times $

Act

Conv

$n \times $

Act

$f_W({\color{blue}I})$

Model $f_W$

$n \times $

Act

This

$S_2$

$S_1$

Interpolate

$S$ =

$\alpha S_1+(1-\alpha)S_2$

$S_2$

$S_1$

Interpolate

$S_3$

$S_4$

Exploring the structure of a real-time, arbitrary neural artistic stylization network

Notes

Arxiv : 1705.06830
Get the result by optimizing a model
Arbitrary-Style-Per-Model (ASPM)
It generalized CIN for adaptive to arbitrary Style
Real time to generate an image with gpu
Cited by 180 at 2021 Nov

Prev Work

Style Prediction Network

This Work

Architecture

Small Recap

Papers:

1508.06576 Image Optimization
1603.08155 Per-Style-Per-Model (PSPM)
1610.07629 Multiple-Style-Per-Model (MSPM)
1705.06830 Arbitrary-Style-Per-Model (ASPM)

About half year a big improve

Small Recap

Papers:

1508.06576 Image Optimization
1603.08155 Per-Style-Per-Model (PSPM)
1610.07629 Multiple-Style-Per-Model (MSPM)
1705.06830 Arbitrary-Style-Per-Model (ASPM)

About half year a big improve

Not Enough?

(Methods up to March 2018, Cited by 335 at Nov 2021)

Jing, Yongcheng, et al. "Neural Style Transfer: A Review." arXiv preprint arXiv:1705.04058 (2017).

My Medium
類神經影像藝術風格轉換系列筆記－基礎

GAN & StyleGAN

StyleGAN

Why Named StyleGAN?

Karras, Tero, Samuli Laine, and Timo Aila. "A style-based generator architecture for generative adversarial networks." arXiv preprint arXiv:1812.04948 (2018).

GAN

Data

Feature
Extract

Distance

If we have good feature extractor...

Use Other Task's Pretrained Weight
Create It By Yourself

Distance measure for vector/distribution

Describe

Describe data w/ or w/o statistic...

A Man With Curly Hair

GAN & StyleGAN

GAN & StyleGAN

Notes :

Input for conv is a constant tensor
Apply AdaIN
Add random noise while inference & training
(stochastic items: hair, freckles, skin pores)

StyleGAN

$w \in \mathcal{W}$

StyleMixing

Interpolate

1:10~2:07

StyleGAN

Cost

StyleGAN

Solve Artifact (00:30~1:30)

StyleGAN2

Solve Interpoloate Artifact

StyleGAN3

Tero Karras

Official Branch

Un-Official Forest

Image Manipulation with StyleGAN

Methods shamelessly taken from this video

Image Manipulation with StyleGAN

Warning: We skip a lot

Data

Feature
Extract

Distance

If we have good feature extractor...

Use Other Task's Pretrained Weight
Create It By Yourself

Distance measure for vector/distribution

Describe

Describe data w/ or w/o statistic...

A Man With Curly Hair

Image Manipulation with StyleGAN

Modify

pretrained weight / hidden output

with smart measure

Specific

Image Manipulation with StyleGAN

Contents

GAN Dissection: Visualizing and Understanding Generative Adversarial Networks
Add/Remove semantic of GAN's output
Semantic Photo Manipulation with a Generative Image Prior
Edit Your Own Photo
Rewriting a Deep Generative Model
Edit The Generative Model(like roof=> tree)

Image Manipulation with StyleGAN

Contents

GAN Dissection: Visualizing and Understanding Generative Adversarial Networks
Add/Remove semantic of GAN's output
Semantic Photo Manipulation with a Generative Image Prior
Edit Your Own Photo
Rewriting a Deep Generative Model
Edit The Generative Model (like roof=> tree)

GAN Dissection: Visualizing and Understanding Generative Adversarial Networks

Image generated by GAN

Output by zeroing some activation

Step 1 : Which hidden channels have high correlation to segmentation map?

Step 1 : Which hidden channels have high correlation to segmentation map?

Step 2 : Edit these channels (to constant, to 0)

Notes
It need segment model or manual label

Step 1 : Which hidden channels have high correlation to segmentation map?

Step 2 : Edit these channels (to constant, to 0)

Official GIFs

http://gandissect.csail.mit.edu/

Semantic Photo Manipulation with a Generative Image Prior

Find best matching latents in GAN.

Bad result :(

Find best matching latents in GAN

Allow slight weight modification

Nice :)

Use previous work's editing skill

Rewriting a Deep Generative Model

00:27~00:55

$W$ : weight of layer $L $

$k$ : normal input at layer $L$

$k_*$ : selected input at layer $L$

$v_*$: desired output for $k_*$ at layer $L$

normal output should not change

change source to target

go "Example Results"

Text Driven Image Manipulation/Generation

OpenAI : CLIP
StyleCLIP
CLIPDraw & StyleCLIPDraw
My Method : StyleTransferCLIP
OpenAI : Dall E

CLIP

Connecting Text and Images

Radford, Alec, et al. "Learning transferable visual models from natural language supervision." arXiv preprint arXiv:2103.00020 (2021).

dog

cat

hen

bee

Traditional Classification

CLIP

Connecting Text and Images

Radford, Alec, et al. "Learning transferable visual models from natural language supervision." arXiv preprint arXiv:2103.00020 (2021).

CLIP (Contrastive Language–Image Pre-training)

dog

cat

hen

bee

Traditional Classification

# https://github.com/openai/CLIP#usage
import torch
import clip
from PIL import Image

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

image = preprocess(Image.open("CLIP.png")).unsqueeze(0).to(device)
text = clip.tokenize(["a diagram", "a dog", "a cat"]).to(device)

with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    
    logits_per_image, logits_per_text = model(image, text)
    probs = logits_per_image.softmax(dim=-1).cpu().numpy()

print("Label probs:", probs)  # prints: [[0.9927937  0.00421068 0.00299572]]

StyleCLIP

StyleCLIPDraw

Image Manipulation/Generation with

1 Image, 1 Text

Use CLIP Encoder

$-CLIP_{I}(img)\cdot CLIP_{T}(text)=\mathcal{L}_{CLIP}$

"...."

2021 Sep 09

2021 Nov 12

StyleCLIP Author : A Newstar

2021 Dec 08

StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery. (2021 Mar)

Patashnik, Or, et al. "StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery." arXiv e-prints (2021): arXiv-2103.

In Style Transfer/In StyleCLIP

Image Optimization
Latent Optimization
Model Optimization
1. Per-Style-Per-Model
  Latent Mapper
2. Multiple-Style-Per-Model
  No
3. Arbitrary-Style-Per-Model
  Global Directions

In Style Transfer/In StyleCLIP

Image Optimization
Latent Optimization
Model Optimization
1. Per-Style-Per-Model
  Latent Mapper (skip)
2. Multiple-Style-Per-Model
  No
3. Arbitrary-Style-Per-Model
  Global Directions (skip)

In Style Transfer/In StyleCLIP

Image Optimization
Latent Optimization
Model Optimization
1. Per-Style-Per-Model
  Latent Mapper (skip)
2. Multiple-Style-Per-Model
  No
3. Arbitrary-Style-Per-Model
  Global Directions (skip)

StyleCLIP(GAN Inv with e4e + official mapper)

$w$

"Curly Hair"

Genertate

Latent

StyleGan
(G)

$w_s$

StyleGan

Get reconstructed latent : $w_s$

Latent optimization

Face Regonition

Same Person?

$\mathcal{L}_{ID}$

Same Description?

$\mathcal{L}_{CLIP}$

$\textcolor{red}{w^*} = argmin_{\color{red} w}\mathcal{L}_{CLIP}+\lambda_{L2}||\textcolor{red}{w}-w_s||_2 + \lambda_{ID}\mathcal{L}_{ID}$

G($w$)

CLIPDraw (2021 Jun) & StyleCLIPDraw (2021 Nov)

$\mathcal{L}_{total} = \mathcal{L}_{content}+ \beta\mathcal{L}_{style}$

CLIPDraw : $\beta = 0$

StyleCLIPDraw : $\beta > 0$

Schaldenbrand, Peter, Zhixuan Liu, and Jean Oh. "StyleCLIPDraw: Coupling Content and Style in Text-to-Drawing Synthesis." arXiv preprint arXiv:2111.03133 (2021).

CLIPDraw (2021 Jun) & StyleCLIPDraw (2021 Nov)

In Style Transfer/CLIPDraw

Image Optimization
Line Parameter Optimization
Model Optimization
1. Per-Style-Per-Model
  No
2. Multiple-Style-Per-Model
  No
3. Arbitrary-Style-Per-Model
  No

Before CLIPDraw

Gradient decent from loss to curve's parameters is possible, i.e.

$$\frac{\partial \mathcal{L}}{\partial P_i}$$can be computed

Parameter for control points :

position, rgba, thickness

StyleCLIPDraw

If no Augmentation, the result is bad.

CLIPDraw Results

The Eiffel Tower

StyleCLIPDraw Results

My Method : StyleTransferCLIP

edit style embedding
$E_{initial} = SP(S)$
with $\mathcal{L}_{CLIP}$

Input Image
(C)

Style Image
(S)

Output Image
NST(C, $E_{initial}$)

CLIP Result

Next pages

E

$$argmin_{\red E}(\mathcal{L}_{CLIP}(NST(C,{\red E}), Text))$$

clouds

starry night

colorful dots

honey

jewels

hentai

golf ball

pudding

https://imgur.com/a/XVyHH4f

My Experiment on Neural Style Transfer

clouds

starry night

colorful dots

honey

jewels

hentai

golf ball

pudding

https://imgur.com/a/NSqhTqa

My Experiment on Neural Style Transfer with Augmentation

You can play my method with replicate.ai

https://replicate.com/mistake0316/style-transfer-clip

https://github.com/mistake0316/CLIPStyleTransfer

Dall E : Zero-Shot Text-to-Image Generation

# https://github.com/huggingface/tokenizers
output = tokenizer.encode("Hello, y'all! How are you 😁 ?")
print(output.tokens)
# ["Hello", ",", "y", "'", "all", "!", "How", "are", "you", "[UNK]", "?"]
# string => tokens
# token => idx => embedding

Tokenize text

# https://github.com/huggingface/tokenizers
output = tokenizer.encode("Hello, y'all! How are you 😁 ?")
print(output.tokens)
# ["Hello", ",", "y", "'", "all", "!", "How", "are", "you", "[UNK]", "?"]
# string => tokens
# token => idx => embedding

Tokenize text

Autoregressive Model (Next token prediction)

$$P_{\theta}(\textbf{x})=\Pi_{i=1}^{n} P_{\theta}(x_i|x_1, x_2, \dots, x_{i-1})$$

# https://github.com/huggingface/tokenizers
output = tokenizer.encode("Hello, y'all! How are you 😁 ?")
print(output.tokens)
# ["Hello", ",", "y", "'", "all", "!", "How", "are", "you", "[UNK]", "?"]
# string => tokens
# token => idx => embedding

Tokenize text

Autoregressive Model (Next token prediction)

$$P_{\theta}(\textbf{x})=\Pi_{i=1}^{n} P_{\theta}(x_i|x_1, x_2, \dots, x_{i-1})$$

$P_{\theta}(\text{"sunny"}|\text{"The weather is"})$

$P_{\theta}(\text{"cookie"}|\text{"The weather is"})$

$P_{\theta}(\text{"furry"}|\text{"The weather is"})$

VQ-VAE can tokenize image to $n \times n$ tokens

Image Tokenization

VQ-VAE can tokenize image to $n \times n$ tokens

Autoregressive Model (Next token prediction)

$$P_{\theta}(\textbf{x})=\Pi_{i=1}^{n} P_{\theta}(x_i|x_1, x_2, \dots, x_{i-1})$$

Image Tokenization

Autoregressive Model (Next token prediction)

"a dog is watching you"

$\color{green}x_{t_1}, x_{t_2}, \dots, x_{t_n}$

$\color{blue}x_{i_1}, x_{i_2}, \dots, x_{i_m}$

$$P_{\theta}(\textbf{x})=\Pi_{i=1}^{n} P_{\theta}(x_i|x_1, x_2, \dots, x_{i-1})$$

Autoregressive Model (Next token prediction)

$$ P_{\theta}({\color{green} {x_t}}, {\color{blue}{x_i}}) = \Pi_{p=1}^{m}P_{\theta}({\color{blue}x_{i_p}}|{\color{green} x_{t_1}, x_{t_2}, \dots, x_{t_n}},{\color{blue} x_{i_1}, x_{i_2}, \dots, x_{i_p-1}})$$

Dall E

"a dog is watching you"

$\color{green}x_{t_1}, x_{t_2}, \dots, x_{t_n}$

$\color{blue}x_{i_1}, x_{i_2}, \dots, x_{i_m}$

$$P_{\theta}(\textbf{x})=\Pi_{i=1}^{n} P_{\theta}(x_i|x_1, x_2, \dots, x_{i-1})$$

Core Concept

Image to image token with VQ-VAE
Text to text token
Concat them and make this become an next token prediction problem.

Sad Things

12-billion parameter
($\approx$ 2264 $\times$ efficient-B0)
250 milllion (image, text) pairs
($\approx$18 $\times$ ImageNet)

Core Concept

Image to image token with VQ-VAE
Text to text token
Concat them and make this become an next token prediction problem.

An Explain

Not Enough?

Paper & Code

Not Enough?

My Medium:

Paper & Code

Applications

Resource List

https://www.tensorflow.org/hub/tutorials/tf2_arbitrary_image_stylization

https://replicate.com/explore

https://ali-design.github.io/deepcreativity/

https://www.youtube.com/playlist?list=PLCpMvp7ftsnIbNwRnQJbDNRqO6qiN3EyH

Paper List

Style Transfer

Manjunath , B. S., & Ma, W. Y. (1996). Texture features for browsing and retrieval of image data. pattern analysis and machine intelligence , 18 (8), 837IEEE Transactions on 52 842.
Ojala, Timo, Matti Pietikainen, and Topi Maenpaa. "Multiresolution gray-scale and rotation invariant texture classification with local binary patterns." IEEE Transactions on pattern analysis and machine intelligence 24.7 (2002): 971-987.
Lin, Tianwei, et al. "Drafting and Revision: Laplacian Pyramid Network for Fast High-Quality Artistic Style Transfer." arXiv preprint arXiv:2104.05376 (2021).
Jing, Yongcheng, et al. "Neural Style Transfer: A Review." _arXiv preprint arXiv:1705.04058_ (2017).
Gatys, Leon A., Alexander S. Ecker, and Matthias Bethge. "Image style transfer using convolutional neural networks." _Proceedings of the IEEE conference on computer vision and pattern recognition_. 2016.
Johnson, Justin, Alexandre Alahi, and Li Fei-Fei. "Perceptual Losses for Real-Time Style Transfer and Super-Resolution." arXiv preprint arXiv:1603.08155 (2016).
Dumoulin, Vincent, Jonathon Shlens, and Manjunath Kudlur. "A learned representation for artistic style." _arXiv preprint arXiv:1610.07629_ (2016).
Ghiasi, Golnaz, et al. "Exploring the structure of a real-time, arbitrary neural artistic stylization network." _arXiv preprint arXiv:1705.06830_ (2017).

GAN & StyleGAN & StyleGAN Manipulation

Karras, Tero, Samuli Laine, and Timo Aila. "A style-based generator architecture for generative adversarial networks." _arXiv preprint arXiv:1812.04948_ (2018).
Karras, Tero, et al. "Analyzing and Improving the Image Quality of StyleGAN." arXiv preprint arXiv:1912.04958 (2019).
Karras, Tero, et al. "Alias-Free Generative Adversarial Networks." _arXiv preprint arXiv:2106.12423_ (2021).
Bau, David, et al. "Gan dissection: Visualizing and understanding generative adversarial networks." _arXiv preprint arXiv:1811.10597_ (2018).
Bau, David, et al. "Semantic photo manipulation with a generative image prior." _arXiv preprint arXiv:2005.07727_ (2020).
Bau, David, et al. "Rewriting a deep generative model." _European Conference on Computer Vision_. Springer, Cham, 2020.

Text Driven Image Manipulation/Genearation

Radford, Alec, et al. "Learning transferable visual models from natural language supervision." _arXiv preprint arXiv:2103.00020_ (2021).
Patashnik, Or, et al. "StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery." arXiv preprint arXiv:2103.17249 (2021).
Frans, Kevin, L. B. Soros, and Olaf Witkowski. "Clipdraw: Exploring text-to-drawing synthesis through language-image encoders." arXiv preprint arXiv:2106.14843 (2021).
Schaldenbrand, Peter, Zhixuan Liu, and Jean Oh. "StyleCLIPDraw: Coupling Content and Style in Text-to-Drawing Synthesis." arXiv preprint arXiv:2111.03133 (2021).
Ramesh, Aditya, et al. "Zero-shot text-to-image generation." _arXiv preprint arXiv:2102.12092_ (2021).

From Style Transfer to Text-Driven Image Manipulation

Replay

講者

Content

This Talk

Prerequisite

This Talk

Warning

Style Transfer

Before Style Transfer

Before Style Transfer

How to summarize texture?

How to summarize texture?

How to summarize texture?

How to summarize texture?

How to summarize texture?

How to summarize data?

How to summarize data?

Style Transfer

Style Transfer

Style Transfer

Style Transfer

Style Transfer

Style Transfer

Style Transfer

Image style transfer using convolutional neural networks

Perceptual Losses for Real-Time Style Transfer and Super-Resolution

A learned representation for artistic style

Before A Learned Representation For Artistic Style

Before A Learned Representation For Artistic Style

Before A Learned Representation For Artistic Style

A Learned Representation For Artistic Style

Exploring the structure of a real-time, arbitrary neural artistic stylization network

Small Recap

Small Recap

Not Enough?

My Medium 類神經影像藝術風格轉換系列筆記－基礎

GAN & StyleGAN

StyleGAN

GAN

GAN & StyleGAN

GAN & StyleGAN

StyleGAN

StyleGAN

StyleGAN

Image Manipulation with StyleGAN

Image Manipulation with StyleGAN

Image Manipulation with StyleGAN

Image Manipulation with StyleGAN

Image Manipulation with StyleGAN

Image Manipulation with StyleGAN

GAN Dissection: Visualizing and Understanding Generative Adversarial Networks

Semantic Photo Manipulation with a Generative Image Prior

Rewriting a Deep Generative Model

Text Driven Image Manipulation/Generation

Contents

CLIP

Connecting Text and Images

CLIP

Connecting Text and Images

StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery. (2021 Mar)

StyleCLIP(GAN Inv with e4e + official mapper)

CLIPDraw (2021 Jun) & StyleCLIPDraw (2021 Nov)

CLIPDraw (2021 Jun) & StyleCLIPDraw (2021 Nov)

Before CLIPDraw

StyleCLIPDraw

CLIPDraw Results

StyleCLIPDraw Results

My Method : StyleTransferCLIP

My Experiment on Neural Style Transfer

My Experiment on Neural Style Transfer with Augmentation

You can play my method with replicate.ai

Dall E : Zero-Shot Text-to-Image Generation

Takeaway

Related Topics

An old method : StarGAN(2017)

Applications

Resource List

Paper List

Style Transfer

From
Style Transfer
to
Text-Driven Image Manipulation

My Medium
類神經影像藝術風格轉換系列筆記－基礎