湯沂達
台灣人工智慧學校技術處
2021/12/10
湯沂達
Email
changethewhat@gmail.com / yidar@aiacademy.tw
Did you see these?
(A) https://nightcafe.studio/blogs/blog/top-20-ai-generated-artworks
(B) https://twitter.com/CitizenPlain/status/1316760510709338112/photo/1
(C)https://www.ettoday.net/news/20210616/2007703.htm
(D) https://github.com/orpatashnik/StyleCLIP
<= only with
text & input
A
C
B
D
Spirit of some famous methods
Spirit of some famous methods
The following pages have some math equations.
However, I will explain them from the idea of the algorithms, not from the equations.
How to summarize texture?
https://paperswithcode.com/dataset/psu-near-regular-texture-database
How to summarize texture?
Define some handmade feature representation
like color, gradient, frequency...
Then use statistics
Image => Feature Extract => Summarize => Distance
https://paperswithcode.com/dataset/psu-near-regular-texture-database
Image
Feature
Extract
Summarize
Distance
Distance measure for vector/distribution
Manjunath , B. S., & Ma, W. Y. (1996). Texture features for browsing and retrieval of image data. pattern analysis and machine intelligence , 18 (8), 837IEEE Transactions on 52 842.
Ojala, Timo, Matti Pietikainen, and Topi Maenpaa. "Multiresolution gray-scale and rotation invariant texture classification with local binary patterns." IEEE Transactions on pattern analysis and machine intelligence 24.7 (2002): 971-987.
Gabor Filter Bank
(H,W,1)→(H,W,f⋅θ)
Image
Feature
Extract
Summarize
Distance
use μ,σ
(H,W,f⋅θ)→(2⋅f⋅θ)
Distance measure for vector/distribution
Manjunath , B. S., & Ma, W. Y. (1996). Texture features for browsing and retrieval of image data. pattern analysis and machine intelligence , 18 (8), 837IEEE Transactions on 52 842.
Ojala, Timo, Matti Pietikainen, and Topi Maenpaa. "Multiresolution gray-scale and rotation invariant texture classification with local binary patterns." IEEE Transactions on pattern analysis and machine intelligence 24.7 (2002): 971-987.
Gabor Filter Bank
(H,W,1)→(H,W,f⋅θ)
Image
Feature
Extract
Summarize
Distance
use μ,σ
(H,W,f⋅θ)→(2⋅f⋅θ)
Use Histogram
(H,W,1)∈{0,1,...,9}→
Distance measure for vector/distribution
Manjunath , B. S., & Ma, W. Y. (1996). Texture features for browsing and retrieval of image data. pattern analysis and machine intelligence , 18 (8), 837IEEE Transactions on 52 842.
Ojala, Timo, Matti Pietikainen, and Topi Maenpaa. "Multiresolution gray-scale and rotation invariant texture classification with local binary patterns." IEEE Transactions on pattern analysis and machine intelligence 24.7 (2002): 971-987.
9
Rotation Invariant Local Binary Pattern
(H,W,1)∈{0,1,...,255}→(H,W,1)∈{0,1,...,9}
Gabor Filter Bank
(H,W,1)→(H,W,f⋅θ)
Image
Feature
Extract
Summarize
Distance
use μ,σ
(H,W,f⋅θ)→(2⋅f⋅θ)
Use Histogram
(H,W,1)∈{0,1,...,9}→
Distance measure for vector/distribution
Manjunath , B. S., & Ma, W. Y. (1996). Texture features for browsing and retrieval of image data. pattern analysis and machine intelligence , 18 (8), 837IEEE Transactions on 52 842.
Ojala, Timo, Matti Pietikainen, and Topi Maenpaa. "Multiresolution gray-scale and rotation invariant texture classification with local binary patterns." IEEE Transactions on pattern analysis and machine intelligence 24.7 (2002): 971-987.
9
<
≥
Thresholding
Rotation Invariant Local Binary Pattern
(H,W,1)∈{0,1,...,255}→(H,W,1)∈{0,1,...,9}
11000000
Gabor Filter Bank
(H,W,1)→(H,W,f⋅θ)
Rotation Invariant Local Binary Pattern
(H,W,1)∈{0,1,...,255}→(H,W,1)∈{0,1,...,9}
Image
Feature
Extract
Summarize
Distance
use μ,σ
(H,W,f⋅θ)→(2⋅f⋅θ)
Use Histogram
(H,W,1)∈{0,1,...,9}→
Distance measure for vector/distribution
Manjunath , B. S., & Ma, W. Y. (1996). Texture features for browsing and retrieval of image data. pattern analysis and machine intelligence , 18 (8), 837IEEE Transactions on 52 842.
Ojala, Timo, Matti Pietikainen, and Topi Maenpaa. "Multiresolution gray-scale and rotation invariant texture classification with local binary patterns." IEEE Transactions on pattern analysis and machine intelligence 24.7 (2002): 971-987.
9
rotation invariant
<
≥
Thresholding
Data
Feature
Extract
Describe
Distance
If we have good feature extractor...
Distance measure for vector/distribution
Describe data w/ or w/o statistic...
A Cute Dog Staring You
Feature
Extract
Distance
If we have good feature extractor...
Distance measure for vector/distribution
Describe
Describe data w/ or w/o statistic...
A Cute Dog Staring You
Data
Lin, Tianwei, et al. "Drafting and Revision: Laplacian Pyramid Network for Fast High-Quality Artistic Style Transfer." arXiv preprint arXiv:2104.05376 (2021).
Content
Style
Styllized
Objective
Find a stylized image, which has
Jing, Yongcheng, et al. "Neural Style Transfer: A Review." arXiv preprint arXiv:1705.04058 (2017).
Jing, Yongcheng, et al. "Neural Style Transfer: A Review." arXiv preprint arXiv:1705.04058 (2017).
Jing, Yongcheng, et al. "Neural Style Transfer: A Review." arXiv preprint arXiv:1705.04058 (2017).
Jing, Yongcheng, et al. "Neural Style Transfer: A Review." arXiv preprint arXiv:1705.04058 (2017).
Jing, Yongcheng, et al. "Neural Style Transfer: A Review." arXiv preprint arXiv:1705.04058 (2017).
Jing, Yongcheng, et al. "Neural Style Transfer: A Review." arXiv preprint arXiv:1705.04058 (2017).
Notes
Feature
Extract
Distance
If we have good feature extractor...
Distance measure for vector/distribution
Describe
Describe data w/ or w/o statistic...
A Cute Dog Staring You
Data
VGG
(A Pretrained Model)
Lcontent : Feature tensor close to content image's feature tensor
Lstyle : Stat(feature) close to style image's stat(feature)
Ltotal=Lcontent+λLstyle
I^
VGG
(A Pretrained Model)
Lcontent=( - )2
Lcontent : Feature tensor close to content image's feature tensor
Lstyle : Stat(feature) close to style image's stat(feature)
Ltotal=Lcontent+λLstyle
I^
I^
VGG
(A Pretrained Model)
Lstyle=(G( )-G( ))2
Lcontent=( - )2
Lcontent : Feature tensor close to content image's feature tensor
Lstyle : Stat(feature) close to style image's stat(feature)
Ltotal=Lcontent+λLstyle
I^
VGG
(A Pretrained Model)
Lstyle=(G( )-G( ))2
Lcontent=( - )2
Lcontent : Feature tensor close to content image's feature tensor
Lstyle : Stat(feature) close to style image's stat(feature)
Ltotal=Lcontent+λLstyle
result ←argminI^Ltotal(I^)
I^
VGG
(A Pretrained Model)
Lstyle=(G( )-G( ))2
Lcontent=( - )2
Lcontent : Feature tensor close to content image's feature tensor
Lstyle : Stat(feature) close to style image's stat(feature)
Ltotal=Lcontent+λLstyle
result ←argminI^Ltotal(I^)
I^=I^−α∂I^∂Ltotal
I^
VGG
(A Pretrained Model)
Lstyle=(G( )-G( ))2
Lcontent=( - )2
Lcontent : Feature tensor close to content image's feature tensor
Lstyle : Stat(feature) close to style image's stat(feature)
Ltotal=Lcontent+λLstyle
F∈RC×X×Y,G(F)∈RC×C
G(F)c1,c2=X⋅Y1x,y∑[Fc1,x,y⋅Fc2,x,y]
G : gram matrix
result ←argminI^Ltotal(I^)
I^=I^−α∂I^∂Ltotal
Get more abstract result while use deeper layer for content loss
Their Result
Johnson, Justin, Alexandre Alahi, and Li Fei-Fei. “Perceptual Losses for Real-Time Style Transfer and Super-Resolution.” arXiv preprint arXiv:1603.08155 (2016).
Notes
I^
Perceptual
I^
prev : argminI^Ltotal(I^)
Perceptual
fW(I)
Model fW
prev : argminI^Ltotal(I^)
this: argminfW∑I∈datasetLtotal(fW(I))
Notes
Color Distrbution Matching
Source
Stat(R)
Stat(G)
Stat(B)
Target
Stat(R)
Stat(G)
Stat(B)
Color Distrbution Matching
Color Distrbution Matching
Target
Stat(R)
Stat(G)
Stat(B)
Normalize
(μ=0,σ=1)
(μ=0,σ=1)
(μ=0,σ=1)
Source
Stat(R)
Stat(G)
Stat(B)
Each Style Use a (γ,β) pair
Target
Stat(R)
Stat(G)
Stat(B)
Normalize
(μ=0,σ=1)
(μ=0,σ=1)
(μ=0,σ=1)
Source
Stat(R)
Stat(G)
Stat(B)
fW(I)
Model fW
prev
Conv
n×
Act
Conv
n×
Act
fW(I)
Model fW
prev
Conv
n×
Act
This
S2
S1
Interpolate
S =
αS1+(1−α)S2
S2
S1
Interpolate
S3
S4
Notes
Prev Work
Style Prediction Network
This Work
Architecture
Papers:
About half year a big improve
Papers:
About half year a big improve
(Methods up to March 2018, Cited by 335 at Nov 2021)
Jing, Yongcheng, et al. "Neural Style Transfer: A Review." arXiv preprint arXiv:1705.04058 (2017).
Why Named StyleGAN?
Karras, Tero, Samuli Laine, and Timo Aila. "A style-based generator architecture for generative adversarial networks." arXiv preprint arXiv:1812.04948 (2018).
Data
Feature
Extract
Distance
If we have good feature extractor...
Distance measure for vector/distribution
Describe
Describe data w/ or w/o statistic...
A Man With Curly Hair
Notes :
w∈W
StyleMixing
Interpolate
1:10~2:07
Cost
Solve Artifact (00:30~1:30)
StyleGAN2
Solve Interpoloate Artifact
StyleGAN3
Tero Karras
Un-Official Forest
Methods shamelessly taken from this video
Warning: We skip a lot
Warning: We skip a lot
Warning: We skip a lot
Data
Feature
Extract
Distance
If we have good feature extractor...
Distance measure for vector/distribution
Describe
Describe data w/ or w/o statistic...
A Man With Curly Hair
Modify
pretrained weight / hidden output
with smart measure
Specific
Contents
Contents
Image generated by GAN
Output by zeroing some activation
Step 1 : Which hidden channels have high correlation to segmentation map?
Step 1 : Which hidden channels have high correlation to segmentation map?
Step 2 : Edit these channels (to constant, to 0)
Notes
It need segment model or manual label
Step 1 : Which hidden channels have high correlation to segmentation map?
Step 2 : Edit these channels (to constant, to 0)
Official GIFs
Find best matching latents in GAN.
Bad result :(
Find best matching latents in GAN
Allow slight weight modification
Nice :)
Use previous work's editing skill
00:27~00:55
W : weight of layer L
k : normal input at layer L
k∗ : selected input at layer L
v∗: desired output for k∗ at layer L
normal output should not change
change source to target
go "Example Results"
Radford, Alec, et al. "Learning transferable visual models from natural language supervision." arXiv preprint arXiv:2103.00020 (2021).
dog
cat
hen
bee
Traditional Classification
Radford, Alec, et al. "Learning transferable visual models from natural language supervision." arXiv preprint arXiv:2103.00020 (2021).
CLIP (Contrastive Language–Image Pre-training)
dog
cat
hen
bee
Traditional Classification
# https://github.com/openai/CLIP#usage import torch import clip from PIL import Image device = "cuda" if torch.cuda.is_available() else "cpu" model, preprocess = clip.load("ViT-B/32", device=device) image = preprocess(Image.open("CLIP.png")).unsqueeze(0).to(device) text = clip.tokenize(["a diagram", "a dog", "a cat"]).to(device) with torch.no_grad(): image_features = model.encode_image(image) text_features = model.encode_text(text) logits_per_image, logits_per_text = model(image, text) probs = logits_per_image.softmax(dim=-1).cpu().numpy() print("Label probs:", probs) # prints: [[0.9927937 0.00421068 0.00299572]]
StyleCLIP
StyleCLIPDraw
Image Manipulation/Generation with
1 Image, 1 Text
Use CLIP Encoder
−CLIPI(img)⋅CLIPT(text)=LCLIP
"...."
2021 Sep 09
2021 Nov 12
StyleCLIP Author : A Newstar
2021 Dec 08
Patashnik, Or, et al. "StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery." arXiv e-prints (2021): arXiv-2103.
In Style Transfer/In StyleCLIP
In Style Transfer/In StyleCLIP
In Style Transfer/In StyleCLIP
w
"Curly Hair"
Genertate
Latent
StyleGan
(G)
ws
StyleGan
Get reconstructed latent : ws
Latent optimization
Face Regonition
Same Person?
LID
Same Description?
LCLIP
w∗=argminwLCLIP+λL2∣∣w−ws∣∣2+λIDLID
G(w)
Ltotal= Lcontent+βLstyle
CLIPDraw : β=0
StyleCLIPDraw : β>0
Schaldenbrand, Peter, Zhixuan Liu, and Jean Oh. "StyleCLIPDraw: Coupling Content and Style in Text-to-Drawing Synthesis." arXiv preprint arXiv:2111.03133 (2021).
In Style Transfer/CLIPDraw
Gradient decent from loss to curve's parameters is possible, i.e.
∂Pi∂Lcan be computed
Parameter for control points :
position, rgba, thickness
If no Augmentation, the result is bad.
The Eiffel Tower
edit style embedding
Einitial=SP(S)
with LCLIP
Input Image
(C)
Style Image
(S)
Output Image
NST(C, Einitial)
CLIP Result
Next pages
E
argminE(LCLIP(NST(C,E),Text))
# https://github.com/huggingface/tokenizers output = tokenizer.encode("Hello, y'all! How are you 😁 ?") print(output.tokens) # ["Hello", ",", "y", "'", "all", "!", "How", "are", "you", "[UNK]", "?"] # string => tokens # token => idx => embedding
Tokenize text
# https://github.com/huggingface/tokenizers output = tokenizer.encode("Hello, y'all! How are you 😁 ?") print(output.tokens) # ["Hello", ",", "y", "'", "all", "!", "How", "are", "you", "[UNK]", "?"] # string => tokens # token => idx => embedding
Tokenize text
Autoregressive Model (Next token prediction)
Pθ(x)=Πi=1nPθ(xi∣x1,x2,…,xi−1)
# https://github.com/huggingface/tokenizers output = tokenizer.encode("Hello, y'all! How are you 😁 ?") print(output.tokens) # ["Hello", ",", "y", "'", "all", "!", "How", "are", "you", "[UNK]", "?"] # string => tokens # token => idx => embedding
Tokenize text
Autoregressive Model (Next token prediction)
Pθ(x)=Πi=1nPθ(xi∣x1,x2,…,xi−1)
Pθ("sunny"∣"The weather is")
Pθ("cookie"∣"The weather is")
Pθ("furry"∣"The weather is")
VQ-VAE can tokenize image to n×n tokens
Image Tokenization
VQ-VAE can tokenize image to n×n tokens
Autoregressive Model (Next token prediction)
Pθ(x)=Πi=1nPθ(xi∣x1,x2,…,xi−1)
Image Tokenization
Autoregressive Model (Next token prediction)
"a dog is watching you"
xt1,xt2,…,xtn
xi1,xi2,…,xim
Pθ(x)=Πi=1nPθ(xi∣x1,x2,…,xi−1)
Autoregressive Model (Next token prediction)
Pθ(xt,xi)=Πp=1mPθ(xip∣xt1,xt2,…,xtn,xi1,xi2,…,xip−1)
Dall E
"a dog is watching you"
xt1,xt2,…,xtn
xi1,xi2,…,xim
Pθ(x)=Πi=1nPθ(xi∣x1,x2,…,xi−1)
Core Concept
Sad Things
Core Concept
An Explain
Not Enough?
Paper & Code
Not Enough?
Paper & Code
2021 Jan
2021 Jan
Edit model without additional image
StyleGAN-NADA (2021 Aug)
Next work of StyleCLIP
2021 Dec
It can train a forward model in about 1 min
Choi, Yunjey, et al. "StarGAN: Unified Generative Adversarial Networks for Multi-Domain Image-to-Image Translation." arXiv e-prints (2017): arXiv-1711.
VQ-GAN (2020 Dec)
a parallel method to DALL E
2021 Nov
2020 Apr
Novel view synthesis
Semantic photo manipulation (This Slide)
Facial and Body Reenactment
Relighting
Free-Viewpoint Video
Photo-realistic avatars for AR/VR
2021 Dec
2021 Dec
Manjunath , B. S., & Ma, W. Y. (1996). Texture features for browsing and retrieval of image data. pattern analysis and machine intelligence , 18 (8), 837IEEE Transactions on 52 842.
Ojala, Timo, Matti Pietikainen, and Topi Maenpaa. "Multiresolution gray-scale and rotation invariant texture classification with local binary patterns." IEEE Transactions on pattern analysis and machine intelligence 24.7 (2002): 971-987.
Bau, David, et al. "Gan dissection: Visualizing and understanding generative adversarial networks." _arXiv preprint arXiv:1811.10597_ (2018).
Bau, David, et al. "Semantic photo manipulation with a generative image prior." _arXiv preprint arXiv:2005.07727_ (2020).
Bau, David, et al. "Rewriting a deep generative model." _European Conference on Computer Vision_. Springer, Cham, 2020.
Text Driven Image Manipulation/Genearation
Radford, Alec, et al. "Learning transferable visual models from natural language supervision." _arXiv preprint arXiv:2103.00020_ (2021).
Patashnik, Or, et al. "StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery." arXiv preprint arXiv:2103.17249 (2021).
Frans, Kevin, L. B. Soros, and Olaf Witkowski. "Clipdraw: Exploring text-to-drawing synthesis through language-image encoders." arXiv preprint arXiv:2106.14843 (2021).
Schaldenbrand, Peter, Zhixuan Liu, and Jean Oh. "StyleCLIPDraw: Coupling Content and Style in Text-to-Drawing Synthesis." arXiv preprint arXiv:2111.03133 (2021).
Ramesh, Aditya, et al. "Zero-shot text-to-image generation." _arXiv preprint arXiv:2102.12092_ (2021).
If have any feedback, please contact me
changethewhat+NST@gmail.com
yidar+NST@aiacademy.tw