Depth-Aware Video Frame Interpolation
PhD Candidate
Fall 2020
Wenbo Bao et al.
Context
Temporal up-sampling of video data
Approach?
Machine learning, obviously, but what's the intuition?
- Depth aware optical flow calculation
- Encorporate warping into architecture
- Composition of architectures that are pre-trained to solve problems
This work is a good example of using domain knowledge and transfer learning to produce a state of the art model.
Optical Flow
Models motion between two frames of video

Optical Flow
Visualization


- Hue encodes direction
- Saturation encodes magnitude of displacement
Schuster, R. et al. Combining Stereo Disparity and Optical Flow for Basic Scene Flow
Warping
Warping the image at time \(t\) by the opical flow produces an image similar (no brightness change) to the image at time \(t+1\).



Network

Transfer learning!
Network
Flow Estimation
Depth estimation
D. Sun. PWC-net: CNNs for optical flow using pyramid, warping, and cost volume.
- Uses pretrained PWC weights
- Flow calculation is not a well posed problem, additional constraints must be added (good application for ML)
L. Zhengqi. MegaDepth: Learning Single-View Depth Prediction from Internet Photos
- Uses pretrained network weights
Provides depth maps \(D_0(\mathbf{x})\)and \(D_1(\mathbf{x})\) for the two input frames.
Provides flow functions \(\mathbf{F}_{0\to 1}(\mathbf{x})\) and \(\mathbf{F}_{1\to 0}(\mathbf{x})\) for the two input frames.
Network

Incorporate depth information
Depth Aware Flow Projection
Depth is an important cue, closer objects are more important, helps define object boundaries
\(\mathbf{F}_{t\to 0}(\mathbf{x})=-t\cdot \frac{\displaystyle\sum_{\mathbf{y}\in\mathcal{S}(\mathbf{x})}w(\mathbf{y})\cdot\mathbf{F}_{0\to 1}(\mathbf{y})}{\displaystyle\sum_{\mathbf{y}\in\mathcal{S(\mathbf{x})}}w(\mathbf{y})}\), with \(w_0(\mathbf{y})=\frac{1}{D_0(\mathbf{y})}\)
(similarly for \(\mathbf{F}_{t\to 1}(\mathbf{x})\) )
\(S(\mathbf{x})=\{y:round(\mathbf{y} + t\cdot\mathbf{F}_{0\to 1}(\mathbf{y})) = \mathbf{x}, \ \forall \mathbf{y} \}\)
Network

Encorporate depth information
Adaptive Warping Layer
Actively warps input
\(\mathbf{\hat{I}}(\mathbf{x})=\displaystyle\sum_{r\in[-R+1, R]^2}k_\mathbf{r}(\mathbf{x})\mathbf{I}(\mathbf{x} + \lfloor \mathbf{F}(\mathbf{x})\rfloor + \mathbf{r})\)
Learned Kernel (makes this 'adaptive')
W. Bao et al. MEMC-Net: Motion Estimation and Motion Compensation Driven Neural Network for Video Interpolation and Enhancement
Note these transformations have been differentiable!
Network

U-Net architecture, not pre-trained, then reshaped for the adaptive warping layer.
O. Ronneberger et al. U-Net: Convolutional Networks for Biomedical Image Segmentation
Network

Context extraction isn't that interesting, a bunch of residual blocks (not pretrained, 7x7 convolution blocks ReLU and skips...)
Network

Frame synthesis, stacked residual blocks
Training

Params: AdaMax with \(\beta_1=0.9, \beta_2=0.999\), batch size of 2, learning rate = 1e-4, 1e-6, 1e-7
Pretrained
Training: 30 epochs, half the learning rate then 10
Experiments
- Ran on a variety of datasets: Milddlebury, Vimeo90K, UCF101, HD (see paper for references)
- Different metrics: Interpolation Error (IE) PSNR and SSIM (depending on the dataset)
- Assess the components of the network
- Assess the performance of the network
Experiments


Results

Results

Dain
By Joshua Horacsek
Dain
- 931