Ziwei Liu, Xiaoou Tang, Raymond Yeh, Yiming Liu, Aseem Agarwala
The Chinese University of Hong Kong, University of Illinois at Urbana-Champaign, Google Inc.
2017, Feb 8
Frame Interpolation/Extrapolation
Application
compute flow
on each pixel
CNN Approach
Traditional Approach
⇒ Optical flow must be accurate
⇒ Require supervision (flow ground-truth)
⇒ Blurry
Overview
Formulation
Refinement and Extension
Experiment
Summary
Deep Voxel Flow
Combine the strengths of traditional and CNN approaches
*voxel=volume pixel=3D pixel
End-to-end trained deep network
No FC layer ⇒ any resolution
Quantitatively and qualitatively improve upon the state-of-the-art
Architecture
Input frames
Target frame
Synthesized frame
CNN ⇒ Voxel Flow
Predict the voxel flow on every pixel of
network parameters
Voxel Flow
*Deconvolution
kernel sizes:
5x5, 5x5, 3x3, 3x3, 3x3, 5x5, 5x5
it should be:
Volume Sampling Layer ⇒ Synthesized Frame
Assume optical flow is temporally symmetric around the in-between frame
Corresponding locations in:
*(x,y):pixel location in the synthesized frame
Linear blending weight between the previous and next frames
Volume Sampling Layer ⇒ Synthesized Frame
Volume Sampling Layer ⇒ Synthesized Frame
Trilinear interpolation
Volume Sampling Function
Synthesize frame in 1D
Synthesize frame in 2D
Spatial Transformer Networks
DeepMind, 2015 NIPS, Cited by 222
Synthesized Frame ⇔ Ground-Truth Frame
Loss Function
total variation:
L1 approximated by Charbonnier loss
Empirically
Learning settings
End-to-end Fully Differentiable System
Note
Another Formulation
Multi-scale Flow Fusion
Hard to find large motions that fall outside the kernel
Deal with large and small motions
Multi-scale Flow Fusion
Multi-scale Flow Fusion
Multi-step Prediction
Predict D frames given current L frames
Smaller learning rate: 0.00005
Compete with the State-of-the-art
Training set: UCF-101 Train, 240k triplets
Test set: UCF-101 Test, THUMOS-15
Competing methods (state-of-the-art)
Compete with the State-of-the-art
Compete with the State-of-the-art
Interpolation
Extrapolation
Multi-step comparisons
Effectiveness of Multi-scale Voxel Flow
Appearance
Motion
UCF-101 test set
Generalization to View Synthesis
Evaluate KTTI odometry dataset
Generalization to View Synthesis
Frame Synthesis as Self-Supervision
Video frame synthesis can serve as a self-supervision task for representation learning( )
Flow estimation
(endpoint error)
Action recognition
Application
Produce slow-motion effects on HD videos(1080x720, 30fps)
EpicFlow serve as a strong baseline
Application-Visual Comparison
EpicFlow
Ground Truth
DVF
Application-Visual Comparison
EpicFlow
Ground Truth
DVF
Application-User Study
20 subjects were enrolled
For the null hypothesis:
Demo Video
Future Work