Video Frame Synthesis using Deep Voxel Flow
Ziwei Liu, Xiaoou Tang, Raymond Yeh, Yiming Liu, Aseem Agarwala
The Chinese University of Hong Kong, University of Illinois at Urbana-Champaign, Google Inc.
2017, Feb 8
Goal
Frame Interpolation/Extrapolation
Application
- Slow-motion effect
- Increase frame rate
![](https://s3.amazonaws.com/media-p.slid.es/uploads/273031/images/3548718/fi_frame_1-01.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/273031/images/3548726/fi_frame_0-01.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/273031/images/3548729/fi_frame_2-01.png)
Optical Flow
![](https://s3.amazonaws.com/media-p.slid.es/uploads/273031/images/3548870/fi_frame_0-01_-_Copy.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/273031/images/3548871/fi_frame_1-01_-_Copy.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/273031/images/3548882/frame_01_optical_flow_color.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/273031/images/3548883/color_direction_map.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/273031/images/3548895/frame_01_optical_flow_arrow.png)
compute flow
on each pixel
Related Work
CNN Approach
- Predict optical flow
Traditional Approach
- Estimate optical flow between frames
- Interpolate optical flow vector
⇒ Optical flow must be accurate
⇒ Require supervision (flow ground-truth)
- Directly hallucinate RGB pixel values
⇒ Blurry
Outline
Overview
Formulation
Refinement and Extension
Experiment
Summary
Overview
Deep Voxel Flow
Combine the strengths of traditional and CNN approaches
- CNN ⇒ voxel flow
- Volume sampling layer(blending) ⇒ synthesized frame
- Synthesized frame ⇔ ground-truth frame
*voxel=volume pixel=3D pixel
End-to-end trained deep network
No FC layer ⇒ any resolution
Quantitatively and qualitatively improve upon the state-of-the-art
Formulation
Architecture
![](https://s3.amazonaws.com/media-p.slid.es/uploads/273031/images/3551468/Screen_Shot_2017-03-03_at_6.55.53_PM.png)
Input frames
Target frame
Synthesized frame
Formulation
CNN ⇒ Voxel Flow
![](https://s3.amazonaws.com/media-p.slid.es/uploads/273031/images/3554741/voxel_flow_H_func-01.png)
Predict the voxel flow on every pixel of
network parameters
Voxel Flow
![](https://s3.amazonaws.com/media-p.slid.es/uploads/273031/images/3542213/padding_strides.gif)
*Deconvolution
kernel sizes:
5x5, 5x5, 3x3, 3x3, 3x3, 5x5, 5x5
![](https://s3.amazonaws.com/media-p.slid.es/uploads/273031/images/3554738/voxel_flow_H_func-01.png)
it should be:
Formulation
Volume Sampling Layer ⇒ Synthesized Frame
Assume optical flow is temporally symmetric around the in-between frame
Corresponding locations in:
- Previous frame
- Next frame
*(x,y):pixel location in the synthesized frame
Linear blending weight between the previous and next frames
Formulation
Volume Sampling Layer ⇒ Synthesized Frame
![](https://s3.amazonaws.com/media-p.slid.es/uploads/273031/images/3554938/flow_exp-01.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/273031/images/3554957/flow_exp_orig-01.png)
Formulation
Volume Sampling Layer ⇒ Synthesized Frame
Trilinear interpolation
Volume Sampling Function
![](https://s3.amazonaws.com/media-p.slid.es/uploads/273031/images/3555929/voxel_flow_vf_layer-01.png)
Formulation Visualization
Synthesize frame in 1D
![](https://s3.amazonaws.com/media-p.slid.es/uploads/273031/images/3547102/voxel_flow_explained_2d_2.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/273031/images/3547152/voxel_flow_explained_3d-01.png)
Formulation Visualization
Synthesize frame in 2D
Digression
Spatial Transformer Networks
DeepMind, 2015 NIPS, Cited by 222
![](https://s3.amazonaws.com/media-p.slid.es/uploads/273031/images/3562857/STN-01.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/273031/images/3562868/STN_grids-01.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/273031/images/3562885/STN_Demo-01.png)
Formulation
Synthesized Frame ⇔ Ground-Truth Frame
Loss Function
total variation:
L1 approximated by Charbonnier loss
Empirically
Learning settings
- Batch size: 32
- Batch normalization
- Gaussian init:
- ADAM solver:
![](https://s3.amazonaws.com/media-p.slid.es/uploads/273031/images/3556054/l1-01.png)
Formulation
End-to-end Fully Differentiable System
Note
![](https://s3.amazonaws.com/media-p.slid.es/uploads/273031/images/3547152/voxel_flow_explained_3d-01.png)
Formulation Visualization
Another Formulation
Refinement
Multi-scale Flow Fusion
Hard to find large motions that fall outside the kernel
Deal with large and small motions
- Mutiple encoder-decorders
deal with different scales , coarse ⇒ fine
e.g. - Predict voxel flow at that resolution
- Upsample and concatenate, only is retained
- Further convolute ( ) on the fused flow fields ⇒
Refinement
Multi-scale Flow Fusion
![](https://s3.amazonaws.com/media-p.slid.es/uploads/273031/images/3556208/muti-scale_archi-01.png)
Refinement
Multi-scale Flow Fusion
![](https://s3.amazonaws.com/media-p.slid.es/uploads/273031/images/3556733/mutiscale-01.png)
Extension
Multi-step Prediction
Predict D frames given current L frames
Smaller learning rate: 0.00005
![](https://s3.amazonaws.com/media-p.slid.es/uploads/273031/images/3556796/multi-inter-01.png)
Experiment
Compete with the State-of-the-art
Training set: UCF-101 Train, 240k triplets
Test set: UCF-101 Test, THUMOS-15
Competing methods (state-of-the-art)
- EpicFlow + algorithm from Middlebury interpolation benchmark
- BeyondMSE(with little tweaks)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/273031/images/3556716/ucf101-01.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/273031/images/3556720/exp1_result-01.png)
Experiment
Compete with the State-of-the-art
![](https://s3.amazonaws.com/media-p.slid.es/uploads/273031/images/3556844/mutiscale-01.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/273031/images/3556853/mutiscale-01.png)
Experiment
Compete with the State-of-the-art
![](https://s3.amazonaws.com/media-p.slid.es/uploads/273031/images/3556871/mutiscale-01.png)
Interpolation
Extrapolation
Multi-step comparisons
Experiment
Effectiveness of Multi-scale Voxel Flow
Appearance
Motion
![](https://s3.amazonaws.com/media-p.slid.es/uploads/273031/images/3558784/mutiscale-01.png)
UCF-101 test set
Experiment
Generalization to View Synthesis
Evaluate KTTI odometry dataset
![](https://s3.amazonaws.com/media-p.slid.es/uploads/273031/images/3559764/KTTI_fig_1-01.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/273031/images/3559788/KTTI_score_2-01.png)
Experiment
Generalization to View Synthesis
![](https://s3.amazonaws.com/media-p.slid.es/uploads/273031/images/3559777/KTTI_fig_2-01.png)
Experiment
Frame Synthesis as Self-Supervision
Video frame synthesis can serve as a self-supervision task for representation learning( )
![](https://s3.amazonaws.com/media-p.slid.es/uploads/273031/images/3560071/KTTI_score-01.png)
Flow estimation
(endpoint error)
Action recognition
Experiment
Application
Produce slow-motion effects on HD videos(1080x720, 30fps)
- Visual Comparison
- User Study
EpicFlow serve as a strong baseline
Experiment
Application-Visual Comparison
![](https://s3.amazonaws.com/media-p.slid.es/uploads/273031/images/3560264/app-01.png)
EpicFlow
Ground Truth
DVF
![](https://s3.amazonaws.com/media-p.slid.es/uploads/273031/images/3560297/app-01.png)
Experiment
Application-Visual Comparison
EpicFlow
Ground Truth
DVF
![](https://s3.amazonaws.com/media-p.slid.es/uploads/273031/images/3560281/app-01.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/273031/images/3560285/app-01.png)
Experiment
Application-User Study
![](https://s3.amazonaws.com/media-p.slid.es/uploads/273031/images/3560327/app-01.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/273031/images/3560339/app-01.png)
20 subjects were enrolled
For the null hypothesis:
- "EpicFlow is better than our method"
p-value < 0.00001 - "DVF is better than ground truth"
p-value < 0.838193
Experiment
Demo Video
Summary
- End-to-end deep network
- Copy pixels from existing video frames, rather than hallucinate them from scratch
- Improves upon both optical flow and recent CNN techniques
Future Work
- Combine flow layers with pure synthesis layers
⇒ predict pixels that cannot be copied from other video frames - Use the desired temporal step as an input
- Compress the network, and run on mobile devices
Video Frame Synthesis using Deep Voxel Flow
By Maeglin Liao
Video Frame Synthesis using Deep Voxel Flow
- 445