湯沂達
台灣人工智慧學校技術處
2021/12/10
湯沂達
Email
changethewhat@gmail.com / yidar@aiacademy.tw
Did you see these?
(A) https://nightcafe.studio/blogs/blog/top-20-ai-generated-artworks
(B) https://twitter.com/CitizenPlain/status/1316760510709338112/photo/1
(C)https://www.ettoday.net/news/20210616/2007703.htm
(D) https://github.com/orpatashnik/StyleCLIP
<= only with
text & input
A
C
B
D
Spirit of some famous methods
Spirit of some famous methods
The following pages have some math equations.
However, I will explain them from the idea of the algorithms, not from the equations.
How to summarize texture?
https://paperswithcode.com/dataset/psu-near-regular-texture-database
How to summarize texture?
Define some handmade feature representation
like color, gradient, frequency...
Then use statistics
Image => Feature Extract => Summarize => Distance
https://paperswithcode.com/dataset/psu-near-regular-texture-database
Image
Feature
Extract
Summarize
Distance
Distance measure for vector/distribution
Manjunath , B. S., & Ma, W. Y. (1996). Texture features for browsing and retrieval of image data. pattern analysis and machine intelligence , 18 (8), 837IEEE Transactions on 52 842.
Ojala, Timo, Matti Pietikainen, and Topi Maenpaa. "Multiresolution gray-scale and rotation invariant texture classification with local binary patterns." IEEE Transactions on pattern analysis and machine intelligence 24.7 (2002): 971-987.
Gabor Filter Bank
\( (H, W, 1) \rightarrow (H, W, f \cdot \theta)\)
Image
Feature
Extract
Summarize
Distance
use \(\mu, \sigma\)
\( (H, W, f \cdot \theta) \rightarrow (2\cdot f \cdot \theta)\)
Distance measure for vector/distribution
Manjunath , B. S., & Ma, W. Y. (1996). Texture features for browsing and retrieval of image data. pattern analysis and machine intelligence , 18 (8), 837IEEE Transactions on 52 842.
Ojala, Timo, Matti Pietikainen, and Topi Maenpaa. "Multiresolution gray-scale and rotation invariant texture classification with local binary patterns." IEEE Transactions on pattern analysis and machine intelligence 24.7 (2002): 971-987.
Gabor Filter Bank
\( (H, W, 1) \rightarrow (H, W, f \cdot \theta)\)
Image
Feature
Extract
Summarize
Distance
use \(\mu, \sigma\)
\( (H, W, f \cdot \theta) \rightarrow (2\cdot f \cdot \theta)\)
Use Histogram
\( (H, W, 1)_{\in \{0,1,...,9\}} \rightarrow\)
Distance measure for vector/distribution
Manjunath , B. S., & Ma, W. Y. (1996). Texture features for browsing and retrieval of image data. pattern analysis and machine intelligence , 18 (8), 837IEEE Transactions on 52 842.
Ojala, Timo, Matti Pietikainen, and Topi Maenpaa. "Multiresolution gray-scale and rotation invariant texture classification with local binary patterns." IEEE Transactions on pattern analysis and machine intelligence 24.7 (2002): 971-987.
9
Rotation Invariant Local Binary Pattern
\( (H, W, 1)_{\in \{0,1,...,255\}} \rightarrow (H, W, 1)_{\in \{0,1,...,9\}}\)
Gabor Filter Bank
\( (H, W, 1) \rightarrow (H, W, f \cdot \theta)\)
Image
Feature
Extract
Summarize
Distance
use \(\mu, \sigma\)
\( (H, W, f \cdot \theta) \rightarrow (2\cdot f \cdot \theta)\)
Use Histogram
\( (H, W, 1)_{\in \{0,1,...,9\}} \rightarrow\)
Distance measure for vector/distribution
Manjunath , B. S., & Ma, W. Y. (1996). Texture features for browsing and retrieval of image data. pattern analysis and machine intelligence , 18 (8), 837IEEE Transactions on 52 842.
Ojala, Timo, Matti Pietikainen, and Topi Maenpaa. "Multiresolution gray-scale and rotation invariant texture classification with local binary patterns." IEEE Transactions on pattern analysis and machine intelligence 24.7 (2002): 971-987.
9
<
\(\geq\)
Thresholding
Rotation Invariant Local Binary Pattern
\( (H, W, 1)_{\in \{0,1,...,255\}} \rightarrow (H, W, 1)_{\in \{0,1,...,9\}}\)
11000000
Gabor Filter Bank
\( (H, W, 1) \rightarrow (H, W, f \cdot \theta)\)
Rotation Invariant Local Binary Pattern
\( (H, W, 1)_{\in \{0,1,...,255\}} \rightarrow (H, W, 1)_{\in \{0,1,...,9\}}\)
Image
Feature
Extract
Summarize
Distance
use \(\mu, \sigma\)
\( (H, W, f \cdot \theta) \rightarrow (2\cdot f \cdot \theta)\)
Use Histogram
\( (H, W, 1)_{\in \{0,1,...,9\}} \rightarrow\)
Distance measure for vector/distribution
Manjunath , B. S., & Ma, W. Y. (1996). Texture features for browsing and retrieval of image data. pattern analysis and machine intelligence , 18 (8), 837IEEE Transactions on 52 842.
Ojala, Timo, Matti Pietikainen, and Topi Maenpaa. "Multiresolution gray-scale and rotation invariant texture classification with local binary patterns." IEEE Transactions on pattern analysis and machine intelligence 24.7 (2002): 971-987.
9
rotation invariant
<
\(\geq\)
Thresholding
<
>
Data
Feature
Extract
Describe
Distance
If we have good feature extractor...
Distance measure for vector/distribution
Describe data w/ or w/o statistic...
A Cute Dog Staring You
Feature
Extract
Distance
If we have good feature extractor...
Distance measure for vector/distribution
Describe
Describe data w/ or w/o statistic...
A Cute Dog Staring You
Data
Lin, Tianwei, et al. "Drafting and Revision: Laplacian Pyramid Network for Fast High-Quality Artistic Style Transfer." arXiv preprint arXiv:2104.05376 (2021).
Content
Style
Styllized
Objective
Find a stylized image, which has
Jing, Yongcheng, et al. "Neural Style Transfer: A Review." arXiv preprint arXiv:1705.04058 (2017).
Jing, Yongcheng, et al. "Neural Style Transfer: A Review." arXiv preprint arXiv:1705.04058 (2017).
Jing, Yongcheng, et al. "Neural Style Transfer: A Review." arXiv preprint arXiv:1705.04058 (2017).
Jing, Yongcheng, et al. "Neural Style Transfer: A Review." arXiv preprint arXiv:1705.04058 (2017).
Jing, Yongcheng, et al. "Neural Style Transfer: A Review." arXiv preprint arXiv:1705.04058 (2017).
Jing, Yongcheng, et al. "Neural Style Transfer: A Review." arXiv preprint arXiv:1705.04058 (2017).
Notes
Feature
Extract
Distance
If we have good feature extractor...
Distance measure for vector/distribution
Describe
Describe data w/ or w/o statistic...
A Cute Dog Staring You
Data
VGG
(A Pretrained Model)
\(\mathcal{L}_{content}\) : Feature tensor close to content image's feature tensor
\(\mathcal{L}_{style}\) : Stat(feature) close to style image's stat(feature)
\(\mathcal{L}_{total} = \mathcal{L}_{content} + \lambda \mathcal{L}_{style}\)
\(\hat{I}\)
VGG
(A Pretrained Model)
\(\mathcal{L}_{content}\)=( - )\(^2\)
\(\mathcal{L}_{content}\) : Feature tensor close to content image's feature tensor
\(\mathcal{L}_{style}\) : Stat(feature) close to style image's stat(feature)
\(\mathcal{L}_{total} = \mathcal{L}_{content} + \lambda \mathcal{L}_{style}\)
\(\hat{I}\)
\(\hat{I}\)
VGG
(A Pretrained Model)
\(\mathcal{L}_{style}\)=(G( )-G( ))\(^2\)
\(\mathcal{L}_{content}\)=( - )\(^2\)
\(\mathcal{L}_{content}\) : Feature tensor close to content image's feature tensor
\(\mathcal{L}_{style}\) : Stat(feature) close to style image's stat(feature)
\(\mathcal{L}_{total} = \mathcal{L}_{content} + \lambda \mathcal{L}_{style}\)
\(\hat{I}\)
VGG
(A Pretrained Model)
\(\mathcal{L}_{style}\)=(G( )-G( ))\(^2\)
\(\mathcal{L}_{content}\)=( - )\(^2\)
\(\mathcal{L}_{content}\) : Feature tensor close to content image's feature tensor
\(\mathcal{L}_{style}\) : Stat(feature) close to style image's stat(feature)
\(\mathcal{L}_{total} = \mathcal{L}_{content} + \lambda \mathcal{L}_{style}\)
result \(\leftarrow argmin_{\color{red}\hat{I}}\mathcal{L}_{total}({\color{red}\hat{I}})\)
\(\hat{I}\)
VGG
(A Pretrained Model)
\(\mathcal{L}_{style}\)=(G( )-G( ))\(^2\)
\(\mathcal{L}_{content}\)=( - )\(^2\)
\(\mathcal{L}_{content}\) : Feature tensor close to content image's feature tensor
\(\mathcal{L}_{style}\) : Stat(feature) close to style image's stat(feature)
\(\mathcal{L}_{total} = \mathcal{L}_{content} + \lambda \mathcal{L}_{style}\)
result \(\leftarrow argmin_{\color{red}\hat{I}}\mathcal{L}_{total}({\color{red}\hat{I}})\)
$$\hat{I} = \hat{I} - \alpha\frac{\partial \mathcal{L}_{total}}{\partial \hat{I}}$$
\(\hat{I}\)
VGG
(A Pretrained Model)
\(\mathcal{L}_{style}\)=(G( )-G( ))\(^2\)
\(\mathcal{L}_{content}\)=( - )\(^2\)
\(\mathcal{L}_{content}\) : Feature tensor close to content image's feature tensor
\(\mathcal{L}_{style}\) : Stat(feature) close to style image's stat(feature)
\(\mathcal{L}_{total} = \mathcal{L}_{content} + \lambda \mathcal{L}_{style}\)
$$F \in R^{C \times X \times Y}, G(F) \in R^{C \times C}$$
$$G(F)_{c1, c2} = \frac{1}{X\cdot Y}\sum_{x, y}[F_{c1,x,y} \cdot F_{c2,x,y}]$$
\(G\) : gram matrix
result \(\leftarrow argmin_{\color{red}\hat{I}}\mathcal{L}_{total}({\color{red}\hat{I}})\)
$$\hat{I} = \hat{I} - \alpha\frac{\partial \mathcal{L}_{total}}{\partial \hat{I}}$$
Get more abstract result while use deeper layer for content loss
Their Result
Johnson, Justin, Alexandre Alahi, and Li Fei-Fei. “Perceptual Losses for Real-Time Style Transfer and Super-Resolution.” arXiv preprint arXiv:1603.08155 (2016).
Notes
\(\hat{I}\)
Perceptual
\(\hat{I}\)
prev : \(argmin_{{\color{red}\hat{I}}}\mathcal{L}_{total}({\color{red}\hat{I}})\)
Perceptual
\(f_W({\color{blue}I})\)
Model \(f_W\)
prev : \(argmin_{{\color{red}\hat{I}}}\mathcal{L}_{total}({\color{red}\hat{I}})\)
this: \(argmin_{f_W}\sum_{{\color{blue}I}\in dataset}\mathcal{L}_{total}(f_W({\color{blue}I}))\)
Notes
Color Distrbution Matching
Source
Stat(R)
Stat(G)
Stat(B)
Target
Stat(R)
Stat(G)
Stat(B)
Color Distrbution Matching
Color Distrbution Matching
Target
Stat(R)
Stat(G)
Stat(B)
Normalize
(\(\mu=0, \sigma=1\))
(\(\mu=0, \sigma=1\))
(\(\mu=0, \sigma=1\))
Source
Stat(R)
Stat(G)
Stat(B)
Each Style Use a \((\gamma, \beta)\) pair
Target
Stat(R)
Stat(G)
Stat(B)
Normalize
(\(\mu=0, \sigma=1\))
(\(\mu=0, \sigma=1\))
(\(\mu=0, \sigma=1\))
Source
Stat(R)
Stat(G)
Stat(B)
\(f_W({\color{blue}I})\)
Model \(f_W\)
prev
Conv
\(n \times \)
Act
Conv
\(n \times \)
Act
\(f_W({\color{blue}I})\)
Model \(f_W\)
prev
Conv
\(n \times \)
Act
This
\(S_2\)
\(S_1\)
Interpolate
\(S\) =
\(\alpha S_1+(1-\alpha)S_2\)
\(S_2\)
\(S_1\)
Interpolate
\(S_3\)
\(S_4\)
Notes
Prev Work
Style Prediction Network
This Work
Architecture
Papers:
About half year a big improve
Papers:
About half year a big improve
(Methods up to March 2018, Cited by 335 at Nov 2021)
Jing, Yongcheng, et al. "Neural Style Transfer: A Review." arXiv preprint arXiv:1705.04058 (2017).
Why Named StyleGAN?
Karras, Tero, Samuli Laine, and Timo Aila. "A style-based generator architecture for generative adversarial networks." arXiv preprint arXiv:1812.04948 (2018).
Data
Feature
Extract
Distance
If we have good feature extractor...
Distance measure for vector/distribution
Describe
Describe data w/ or w/o statistic...
A Man With Curly Hair
Notes :
\(w \in \mathcal{W}\)
StyleMixing
Interpolate
1:10~2:07
Cost
Solve Artifact (00:30~1:30)
StyleGAN2
Solve Interpoloate Artifact
StyleGAN3
Tero Karras
Un-Official Forest
Methods shamelessly taken from this video
Warning: We skip a lot
Warning: We skip a lot
Warning: We skip a lot
Data
Feature
Extract
Distance
If we have good feature extractor...
Distance measure for vector/distribution
Describe
Describe data w/ or w/o statistic...
A Man With Curly Hair
Modify
pretrained weight / hidden output
with smart measure
Specific
Contents
Contents
Image generated by GAN
Output by zeroing some activation
Step 1 : Which hidden channels have high correlation to segmentation map?
Step 1 : Which hidden channels have high correlation to segmentation map?
Step 2 : Edit these channels (to constant, to 0)
Notes
It need segment model or manual label
Step 1 : Which hidden channels have high correlation to segmentation map?
Step 2 : Edit these channels (to constant, to 0)
Official GIFs
Find best matching latents in GAN.
Bad result :(
Find best matching latents in GAN
Allow slight weight modification
Nice :)
Use previous work's editing skill
00:27~00:55
\(W\) : weight of layer \(L \)
\(k\) : normal input at layer \(L\)
\(k_*\) : selected input at layer \(L\)
\(v_*\): desired output for \(k_*\) at layer \(L\)
normal output should not change
change source to target
go "Example Results"
Radford, Alec, et al. "Learning transferable visual models from natural language supervision." arXiv preprint arXiv:2103.00020 (2021).
dog
cat
hen
bee
Traditional Classification
Radford, Alec, et al. "Learning transferable visual models from natural language supervision." arXiv preprint arXiv:2103.00020 (2021).
CLIP (Contrastive Language–Image Pre-training)
dog
cat
hen
bee
Traditional Classification
# https://github.com/openai/CLIP#usage
import torch
import clip
from PIL import Image
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)
image = preprocess(Image.open("CLIP.png")).unsqueeze(0).to(device)
text = clip.tokenize(["a diagram", "a dog", "a cat"]).to(device)
with torch.no_grad():
image_features = model.encode_image(image)
text_features = model.encode_text(text)
logits_per_image, logits_per_text = model(image, text)
probs = logits_per_image.softmax(dim=-1).cpu().numpy()
print("Label probs:", probs) # prints: [[0.9927937 0.00421068 0.00299572]]
StyleCLIP
StyleCLIPDraw
Image Manipulation/Generation with
1 Image, 1 Text
Use CLIP Encoder
\(-CLIP_{I}(img)\cdot CLIP_{T}(text)=\mathcal{L}_{CLIP}\)
"...."
2021 Sep 09
2021 Nov 12
StyleCLIP Author : A Newstar
2021 Dec 08
Patashnik, Or, et al. "StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery." arXiv e-prints (2021): arXiv-2103.
In Style Transfer/In StyleCLIP
In Style Transfer/In StyleCLIP
In Style Transfer/In StyleCLIP
\(w\)
"Curly Hair"
Genertate
Latent
StyleGan
(G)
\(w_s\)
StyleGan
Get reconstructed latent : \(w_s\)
Latent optimization
Face Regonition
Same Person?
\(\mathcal{L}_{ID}\)
Same Description?
\(\mathcal{L}_{CLIP}\)
\(\textcolor{red}{w^*} = argmin_{\color{red} w}\mathcal{L}_{CLIP}+\lambda_{L2}||\textcolor{red}{w}-w_s||_2 + \lambda_{ID}\mathcal{L}_{ID}\)
G(\(w\))
\(\mathcal{L}_{total} = \mathcal{L}_{content}+ \beta\mathcal{L}_{style}\)
CLIPDraw : \(\beta = 0\)
StyleCLIPDraw : \(\beta > 0\)
Schaldenbrand, Peter, Zhixuan Liu, and Jean Oh. "StyleCLIPDraw: Coupling Content and Style in Text-to-Drawing Synthesis." arXiv preprint arXiv:2111.03133 (2021).
In Style Transfer/CLIPDraw
Gradient decent from loss to curve's parameters is possible, i.e.
$$\frac{\partial \mathcal{L}}{\partial P_i}$$can be computed
Parameter for control points :
position, rgba, thickness
If no Augmentation, the result is bad.
The Eiffel Tower
edit style embedding
\(E_{initial} = SP(S)\)
with \(\mathcal{L}_{CLIP}\)
Input Image
(C)
Style Image
(S)
Output Image
NST(C, \(E_{initial}\))
CLIP Result
Next pages
E
$$argmin_{\red E}(\mathcal{L}_{CLIP}(NST(C,{\red E}), Text))$$
# https://github.com/huggingface/tokenizers
output = tokenizer.encode("Hello, y'all! How are you 😁 ?")
print(output.tokens)
# ["Hello", ",", "y", "'", "all", "!", "How", "are", "you", "[UNK]", "?"]
# string => tokens
# token => idx => embedding
Tokenize text
# https://github.com/huggingface/tokenizers
output = tokenizer.encode("Hello, y'all! How are you 😁 ?")
print(output.tokens)
# ["Hello", ",", "y", "'", "all", "!", "How", "are", "you", "[UNK]", "?"]
# string => tokens
# token => idx => embedding
Tokenize text
Autoregressive Model (Next token prediction)
$$P_{\theta}(\textbf{x})=\Pi_{i=1}^{n} P_{\theta}(x_i|x_1, x_2, \dots, x_{i-1})$$
# https://github.com/huggingface/tokenizers
output = tokenizer.encode("Hello, y'all! How are you 😁 ?")
print(output.tokens)
# ["Hello", ",", "y", "'", "all", "!", "How", "are", "you", "[UNK]", "?"]
# string => tokens
# token => idx => embedding
Tokenize text
Autoregressive Model (Next token prediction)
$$P_{\theta}(\textbf{x})=\Pi_{i=1}^{n} P_{\theta}(x_i|x_1, x_2, \dots, x_{i-1})$$
\(P_{\theta}(\text{"sunny"}|\text{"The weather is"})\)
\(P_{\theta}(\text{"cookie"}|\text{"The weather is"})\)
\(P_{\theta}(\text{"furry"}|\text{"The weather is"})\)
VQ-VAE can tokenize image to \(n \times n\) tokens
Image Tokenization
VQ-VAE can tokenize image to \(n \times n\) tokens
Autoregressive Model (Next token prediction)
$$P_{\theta}(\textbf{x})=\Pi_{i=1}^{n} P_{\theta}(x_i|x_1, x_2, \dots, x_{i-1})$$
Image Tokenization
Autoregressive Model (Next token prediction)
"a dog is watching you"
\(\color{green}x_{t_1}, x_{t_2}, \dots, x_{t_n}\)
\(\color{blue}x_{i_1}, x_{i_2}, \dots, x_{i_m}\)
$$P_{\theta}(\textbf{x})=\Pi_{i=1}^{n} P_{\theta}(x_i|x_1, x_2, \dots, x_{i-1})$$
Autoregressive Model (Next token prediction)
$$ P_{\theta}({\color{green} {x_t}}, {\color{blue}{x_i}}) = \Pi_{p=1}^{m}P_{\theta}({\color{blue}x_{i_p}}|{\color{green} x_{t_1}, x_{t_2}, \dots, x_{t_n}},{\color{blue} x_{i_1}, x_{i_2}, \dots, x_{i_p-1}})$$
Dall E
"a dog is watching you"
\(\color{green}x_{t_1}, x_{t_2}, \dots, x_{t_n}\)
\(\color{blue}x_{i_1}, x_{i_2}, \dots, x_{i_m}\)
$$P_{\theta}(\textbf{x})=\Pi_{i=1}^{n} P_{\theta}(x_i|x_1, x_2, \dots, x_{i-1})$$
Core Concept
Sad Things
Core Concept
An Explain
Not Enough?
Paper & Code
Not Enough?
Paper & Code
2021 Jan
2021 Jan
Edit model without additional image
StyleGAN-NADA (2021 Aug)
Next work of StyleCLIP
2021 Dec
It can train a forward model in about 1 min
Choi, Yunjey, et al. "StarGAN: Unified Generative Adversarial Networks for Multi-Domain Image-to-Image Translation." arXiv e-prints (2017): arXiv-1711.
VQ-GAN (2020 Dec)
a parallel method to DALL E
2021 Nov
2020 Apr
Novel view synthesis
Semantic photo manipulation (This Slide)
Facial and Body Reenactment
Relighting
Free-Viewpoint Video
Photo-realistic avatars for AR/VR
2021 Dec
2021 Dec
Manjunath , B. S., & Ma, W. Y. (1996). Texture features for browsing and retrieval of image data. pattern analysis and machine intelligence , 18 (8), 837IEEE Transactions on 52 842.
Ojala, Timo, Matti Pietikainen, and Topi Maenpaa. "Multiresolution gray-scale and rotation invariant texture classification with local binary patterns." IEEE Transactions on pattern analysis and machine intelligence 24.7 (2002): 971-987.
Bau, David, et al. "Gan dissection: Visualizing and understanding generative adversarial networks." _arXiv preprint arXiv:1811.10597_ (2018).
Bau, David, et al. "Semantic photo manipulation with a generative image prior." _arXiv preprint arXiv:2005.07727_ (2020).
Bau, David, et al. "Rewriting a deep generative model." _European Conference on Computer Vision_. Springer, Cham, 2020.
Text Driven Image Manipulation/Genearation
Radford, Alec, et al. "Learning transferable visual models from natural language supervision." _arXiv preprint arXiv:2103.00020_ (2021).
Patashnik, Or, et al. "StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery." arXiv preprint arXiv:2103.17249 (2021).
Frans, Kevin, L. B. Soros, and Olaf Witkowski. "Clipdraw: Exploring text-to-drawing synthesis through language-image encoders." arXiv preprint arXiv:2106.14843 (2021).
Schaldenbrand, Peter, Zhixuan Liu, and Jean Oh. "StyleCLIPDraw: Coupling Content and Style in Text-to-Drawing Synthesis." arXiv preprint arXiv:2111.03133 (2021).
Ramesh, Aditya, et al. "Zero-shot text-to-image generation." _arXiv preprint arXiv:2102.12092_ (2021).
If have any feedback, please contact me
changethewhat+NST@gmail.com
yidar+NST@aiacademy.tw