PhD Candidate
Fall 2020
Wenbo Bao et al.
Temporal up-sampling of video data
Machine learning, obviously, but what's the intuition?
This work is a good example of using domain knowledge and transfer learning to produce a state of the art model.
Models motion between two frames of video
Schuster, R. et al. Combining Stereo Disparity and Optical Flow for Basic Scene Flow
Warping the image at time \(t\) by the opical flow produces an image similar (no brightness change) to the image at time \(t+1\).
Transfer learning!
Flow Estimation
Depth estimation
D. Sun. PWC-net: CNNs for optical flow using pyramid, warping, and cost volume.
L. Zhengqi. MegaDepth: Learning Single-View Depth Prediction from Internet Photos
Provides depth maps \(D_0(\mathbf{x})\)and \(D_1(\mathbf{x})\) for the two input frames.
Provides flow functions \(\mathbf{F}_{0\to 1}(\mathbf{x})\) and \(\mathbf{F}_{1\to 0}(\mathbf{x})\) for the two input frames.
Incorporate depth information
Depth is an important cue, closer objects are more important, helps define object boundaries
\(\mathbf{F}_{t\to 0}(\mathbf{x})=-t\cdot \frac{\displaystyle\sum_{\mathbf{y}\in\mathcal{S}(\mathbf{x})}w(\mathbf{y})\cdot\mathbf{F}_{0\to 1}(\mathbf{y})}{\displaystyle\sum_{\mathbf{y}\in\mathcal{S(\mathbf{x})}}w(\mathbf{y})}\), with \(w_0(\mathbf{y})=\frac{1}{D_0(\mathbf{y})}\)
(similarly for \(\mathbf{F}_{t\to 1}(\mathbf{x})\) )
\(S(\mathbf{x})=\{y:round(\mathbf{y} + t\cdot\mathbf{F}_{0\to 1}(\mathbf{y})) = \mathbf{x}, \ \forall \mathbf{y} \}\)
Encorporate depth information
Actively warps input
\(\mathbf{\hat{I}}(\mathbf{x})=\displaystyle\sum_{r\in[-R+1, R]^2}k_\mathbf{r}(\mathbf{x})\mathbf{I}(\mathbf{x} + \lfloor \mathbf{F}(\mathbf{x})\rfloor + \mathbf{r})\)
Learned Kernel (makes this 'adaptive')
W. Bao et al. MEMC-Net: Motion Estimation and Motion Compensation Driven Neural Network for Video Interpolation and Enhancement
Note these transformations have been differentiable!
U-Net architecture, not pre-trained, then reshaped for the adaptive warping layer.
O. Ronneberger et al. U-Net: Convolutional Networks for Biomedical Image Segmentation
Context extraction isn't that interesting, a bunch of residual blocks (not pretrained, 7x7 convolution blocks ReLU and skips...)
Frame synthesis, stacked residual blocks
Params: AdaMax with \(\beta_1=0.9, \beta_2=0.999\), batch size of 2, learning rate = 1e-4, 1e-6, 1e-7
Training: 30 epochs, half the learning rate then 10