YiDar, Tang
2022 Feb
Neural Rendering, a definition:
Deep neural network for image or video generation that
enable explicit or implicit control of scene properties
Neural Rendering, a definition:
Deep neural network for image or video generation that
enable explicit or implicit control of scene properties
Generative networks that synthesize pixels
Neural Rendering, a definition:
Deep neural network for image or video generation that
enable explicit or implicit control of scene properties
Generative networks that synthesize pixels
Controllable by interpretable parameters or by conditioning input
Neural Rendering, a definition:
Deep neural network for image or video generation that
enable explicit or implicit control of scene properties
Generative networks that synthesize pixels
Controllable by interpretable parameters or by conditioning input
Illumination, camera, pose, geometry, appearance, semantic structure, animation, ...
Inference
Deferred Neural Rendering: Image Synthesis using Neural Textures
Nerual Volumes
Detail in the next section
Notes:
"Semantic Photo Synthesis" is listed in the 2020 Apr paper "State of the Art on Neural Rendering"
2~6 are listed in the 2021 Nov Paper "Advances in Neural Rendering"
(2018) GAN Dissection
(2019) GauGAN
Source : https://syncedreview.com/2019/04/22/everyone-is-an-artist-gaugan-turns-doodles-into-photorealistic-landscapes/
(2021) GauGAN2
(2021) GLIDE
A super hot approach in neural rendering
The first article using notation "neural rendering"
Rendering a given scene from new camera positions, given a set of images and their camera poses as training set.
With lesser training data/inputs.
HoloGAN can train a GAN model with pure 2D training data
The key idea is apply 3D transform in 4D tensor (X,Y,Z,C).
(2021) IBRNet
IBRNet rendering novel view based on the camera rays to other view points
Handling dynamically changing content.
(1999) Bullet Time
Movie : The Matrix
(2020) non-rigid NeRF
Rearranging and affine transforming the objects and altering their structure and appearance.
This method can propagate coarse 2D user scribbles to the 3D space, to modify the color or shape of a local region.
Enabling a variety of editing functions
manipulating the scale and location
duplicating
retiming
(2:28~End)
A novel system for portrait relighting and background replacement.
Using a human video to driving a given monocular RGB image.
(2:20:10~2:22:18)
Editing input mesh with "text" prompt or "image" prompt.
slide: link, video : https://youtu.be/otly9jcZ0Jg?t=4429
All topics in awesome-NeRF : Faster Inference / Faster Training / Unconstrained Images / Deformable / Video / Generalization / Pose Estimation / Lighting / Compositionality / Scene Labelling and Understanding / Editing / Object Category Modeling / Multi-scale / Model Reconstruction / Depth Estimation
Source : https://www.matthewtancik.com/nerf
Idea
suppose we have
color \(c\)
volume density : \(\sigma(t)\)
suppose we have
color \(c\)
density \(\sigma\)
volume density : \(\sigma(t)\)
P[no hit before t] : \(T(t)\)
suppose we have
color \(c\)
density \(\sigma\)
volume density : \(\sigma(t)\)
P[no hit before t] : \(T(t)\)
suppose we have
color \(c\)
density \(\sigma\)
Some Math
volume density : \(\sigma(t)\)
P[no hit before t] : \(T(t)\)
\(=exp(-\int_{t_0}^{t}\sigma(s)ds)\)
suppose we have
color \(c\)
density \(\sigma\)
volume density : \(\sigma(t)\)
P[no hit before t] : \(T(t)\)
\(=exp(-\int_{t_0}^{t}\sigma(s)ds)\)
Probablity of first hit at \(t\) : \(T(t)\sigma(t)dt\)
suppose we have
color \(c\)
density \(\sigma\)
volume density : \(\sigma(t)\)
P[no hit before t] : \(T(t)\)
\(=exp(-\int_{t_0}^{t}\sigma(s)ds)\)
Probablity of first hit at \(t\) : \(T(t)\sigma(t)dt\)
expected color for ray :
\(\int_{t_0}^{t_1}T(t)\sigma(t)c(t)dt\)
suppose we have
color \(c\)
density \(\sigma\)
volume density : \(\sigma(t)\)
P[no hit before t] : \(T(t)\)
\(=exp(-\int_{t_0}^{t}\sigma(s)ds)\)
Probablity of first hit at \(t\) : \(T(t)\sigma(t)dt\)
expected color for ray :
\(\int_{t_0}^{t_1}T(t)\sigma(t)c(t)dt\)
\(\approx\) a lower computation term
\(=\sum_{i=1}^{n}T_i c_i \alpha_i\)
Some Math
suppose we have
color \(c\)
density \(\sigma\)
volume density : \(\sigma(t)\)
P[no hit before t] : \(T(t)\)
\(=exp(-\int_{t_0}^{t}\sigma(s)ds)\)
Probablity of first hit at \(t\) : \(T(t)\sigma(t)dt\)
expected color for ray :
\(\int_{t_0}^{t_1}T(t)\sigma(t)c(t)dt\)
\(\approx\) a lower computation term
\(=\sum_{i=1}^{n}T_i c_i \alpha_i\)
Notation for computation Detail:
\(T_i = \prod_{j=1}^{i-1} (1-\alpha_j)\)
\(\alpha_i = 1-exp(-\sigma_i\delta_i)\)
suppose we have
color \(c\)
density \(\sigma\)
volume density : \(\sigma(t)\)
P[no hit before t] : \(T(t)\)
\(=exp(-\int_{t_0}^{t}\sigma(s)ds)\)
Probablity of first hit at \(t\) : \(T(t)\sigma(t)dt\)
expected color for ray :
\(\int_{t_0}^{t_1}T(t)\sigma(t)c(t)dt\)
\(\approx\) a lower computation term
\(=\sum_{i=1}^{n}T_i c_i \alpha_i\)
Notation for computation Detail:
\(T_i = \prod_{j=1}^{i-1} (1-\alpha_j)\)
\(\alpha_i = 1-exp(-\sigma_i\delta_i)\)
suppose we have
color \(c\)
density \(\sigma\)
volume density : \(\sigma(t)\)
P[no hit before t] : \(T(t)\)
\(=exp(-\int_{t_0}^{t}\sigma(s)ds)\)
Probablity of first hit at \(t\) : \(T(t)\sigma(t)dt\)
expected color for ray :
\(\int_{t_0}^{t_1}T(t)\sigma(t)c(t)dt\)
\(\approx\) a lower computation term
\(=\sum_{i=1}^{n}T_i c_i \alpha_i\)
Notation for computation Detail:
\(T_i = \prod_{j=1}^{i-1} (1-\alpha_j)\)
\(\alpha_i = 1-exp(-\sigma_i\delta_i)\)
Estimated
color \(c\)
density \(\sigma\)
Store Data Approaches
(x,y)
\(\mathcal{O}(n^2)\)
\(\mathcal{O}(p)\)
(x,y) \(\rightarrow\) r,g,b
\(r,g,b = A[x,y]\)
\(r,g,b = f(x,y)\)
Matrix
MLP Network
[particle denstiy \(\sigma\) and color \(r,g,b\)]
depends on
spatial information & view direction
\((x, y, z, \theta, \phi)\)
Store Data Approaches
(x,y)
\(\mathcal{O}(n^2)\)
\(\mathcal{O}(p)\)
\((x, y, z, \theta, \phi)\)
\(\rightarrow r,g,b,\sigma\)
\(r,g,b,\sigma = A[x, y, z, \theta, \phi]\)
\(\mathcal{O}(n^5)\)
Matrix
MLP Network
\(r,g,b,\sigma = f(x, y, z, \theta, \phi)\)
\(\mathcal{O}(p)\)
(x,y) \(\rightarrow\) r,g,b
\(r,g,b = A[x,y]\)
\(r,g,b = f(x,y)\)
\(argmin_{MLP}(\sum_{i=1}^{n}T_i c_i \alpha_i - c_{true})^2\)
A very nice way to connect traditional and modern techniques
And can insert lots of technique easily
\(argmin_{MLP}(\sum_{i=1}^{n}T_i c_i \alpha_i - c_{true})^2\)
NeRF problems
NeRF website
Key Idea : Split a global MLP to sparse voxel wise MLP
Key Idea : Split a global MLP to sparse voxel wise MLP
Advantages :
Key Idea : Split a global MLP to sparse voxel wise MLP
Advantages :
Their method can train NeRF with fewer input and faster
Credit : http://www.cs.cmu.edu/~dsnerf/
RGB
Depth
NeRF
DS-NeRF
NeRF
DS-NeRF
All of following result train with 2 input
Their method can train NeRF with fewer input and faster
Key Idea :
Free lunch from depth estimation of key points
Their method can train NeRF with fewer input and faster
Key Idea :
Free lunch from depth estimation of key points
Supervise it!
Additional : Known light source in training
Relighting and View Synthesis
Material Editing
Relighting and View Synthesis
Material Editing
Additional : Known light source in training
Relighting and View Synthesis
Material Editing
NeRF
Particles absorb enviorment light, then self emitting
NeRV
Phantom Blood
Prev :
won't use traditional techniques
Phantom Blood
After :
won't ignore CS history
Prev :
won't use traditional techniques
arxiv 2112.03221
OpenAI's model : CLIP
This model can measure the similarity between
(text, Image) pair
-0.2
0.4
0.3
\(\cdots\)
\(\cdots\)
(Detic) Detecting Twenty-thousand Classes using Image-level Supervision
arXiv 2201.02605, github
Direct fuse some CLIP text information to training loop
Text2Mesh modifies an input mesh to conform to the target text by predicting color and geometric details. The weights of the neural style network are optimized by rendering multiple 2D images and applying 2D augmentations, which are given a similarity score to the target from the CLIP-based semantic loss.
Edit the given mesh with a trainable MLP
The MLP\((x,y,z)\)
Select Images collection
Global
Local
Local
Displ
Augmentation
aug_x = transforms.RandomPerspective(
fill=1, p=0.8, distortion_scale=0.5
)(x.repaet(n,1,1,1))
Total Loss
code_x = CLIP_I(aug_x)
avg_code_x = code_x.mean(axis=batch)
loss_x = -cos(avg_code_x, code_text)
total_loss = ( loss_global
+ loss_local + loss_displ)
Ablations
‘Candle made of bark’
Beyond Text-Driven Manipulation
original target is \(CLIP_{Text}("...")\)
Can also use \(CLIP_{Image}(I)\) and get good result
Donald John Trump
My Experiments (~20 min/750 iter with colab pro)
based on this kaggle kernel : https://www.kaggle.com/neverix/text2mesh/
Nike Sport Shoe
Feel free to send email to me
changethewhat+talk@gmail.com