Depth-Aware Video Frame Interpolation

PhD Candidate

Fall 2020

Wenbo Bao et al.

Context

Temporal up-sampling of video data

Approach?

Machine learning, obviously, but what's the intuition?

  • Depth aware optical flow calculation
  • Encorporate warping into architecture
  • Composition of architectures that are pre-trained to solve problems

This work is a good example of using domain knowledge and transfer learning to produce a state of the art model.

Optical Flow

Models motion between two frames of video

Optical Flow

Visualization

  • Hue encodes direction
  • Saturation encodes magnitude of displacement

Schuster, R. et al. Combining Stereo Disparity and Optical Flow for Basic Scene Flow 

Warping

Warping  the image at time \(t\) by the opical flow produces an image similar (no brightness change) to the image at time \(t+1\).

Network

Transfer learning!

Network

Flow Estimation

Depth estimation

D. Sun. PWC-net: CNNs for optical flow using pyramid, warping, and cost volume.

  • Uses pretrained PWC weights 
  • Flow calculation is not a well posed problem, additional constraints must be added (good application for ML)

L. Zhengqi. MegaDepth: Learning Single-View Depth Prediction from Internet Photos

  • Uses pretrained network weights 

Provides depth maps \(D_0(\mathbf{x})\)and \(D_1(\mathbf{x})\) for the two input frames.

Provides flow functions \(\mathbf{F}_{0\to 1}(\mathbf{x})\) and \(\mathbf{F}_{1\to 0}(\mathbf{x})\) for the two input frames.

Network

Incorporate depth information

Depth Aware Flow Projection

Depth is an important cue, closer objects are more important, helps define object boundaries

\(\mathbf{F}_{t\to 0}(\mathbf{x})=-t\cdot \frac{\displaystyle\sum_{\mathbf{y}\in\mathcal{S}(\mathbf{x})}w(\mathbf{y})\cdot\mathbf{F}_{0\to 1}(\mathbf{y})}{\displaystyle\sum_{\mathbf{y}\in\mathcal{S(\mathbf{x})}}w(\mathbf{y})}\), with \(w_0(\mathbf{y})=\frac{1}{D_0(\mathbf{y})}\)

 (similarly for \(\mathbf{F}_{t\to 1}(\mathbf{x})\) )

\(S(\mathbf{x})=\{y:round(\mathbf{y} + t\cdot\mathbf{F}_{0\to 1}(\mathbf{y})) = \mathbf{x}, \ \forall \mathbf{y} \}\)

Network

Encorporate depth information

Adaptive Warping Layer

Actively warps input

\(\mathbf{\hat{I}}(\mathbf{x})=\displaystyle\sum_{r\in[-R+1, R]^2}k_\mathbf{r}(\mathbf{x})\mathbf{I}(\mathbf{x} + \lfloor \mathbf{F}(\mathbf{x})\rfloor + \mathbf{r})\)

Learned Kernel (makes this 'adaptive')

W. Bao et al. MEMC-Net: Motion Estimation and Motion Compensation Driven Neural Network for Video Interpolation and Enhancement

Note these transformations have been differentiable!

Network

U-Net architecture, not pre-trained, then reshaped for the adaptive warping layer. 

O. Ronneberger et al. U-Net: Convolutional Networks for Biomedical Image Segmentation

Network

Context extraction isn't that interesting, a bunch of residual blocks (not pretrained, 7x7 convolution blocks ReLU and skips...)

Network

Frame synthesis, stacked residual blocks 

Training

Params: AdaMax with \(\beta_1=0.9, \beta_2=0.999\), batch size of 2, learning rate = 1e-4, 1e-6, 1e-7

Pretrained

Training: 30 epochs, half the learning rate then 10

Experiments

  • Ran on a variety of datasets: Milddlebury, Vimeo90K, UCF101, HD (see paper for references)
  • Different metrics: Interpolation Error (IE) PSNR and SSIM (depending on the dataset)
  • Assess the components of the network
  • Assess the performance of the network

Experiments

Results

Results

Dain

By Joshua Horacsek

Dain

  • 787