Arvin Liu @ MiRA
Model A
Model B
Modal A
(RGB Image)
Modal B
(Depth Image)
(Sem. Seg)
output
paired
Model A
Model B
Modal A
(RGB Image)
Modal B
(Depth Image)
(Sem. Seg)
Knowledge Distillation
paired
Inference Mode
Model A
Model B
Modal A
(RGB Image)
Modal B
(Depth Image)
(Sem. Seg)
output
paired
Model A
Model B
Modal A
(RGB Image)
Modal B
(Depth Image)
(A) Joint Representation
(B) Coordinate Representation
constraint / similarity
Feature A
Feature B
Feature A
Feature B
Classifier
Classifier
Feature A
Feature B
Classifier
Strategy 1 - Early Fusion
Merge
Feature A
Feature B
Classifier B
Strategy 2 - Late Fusion
Ensemble
Classifier A
output
Feature A
Feature B
Classifier
TFN
Tensor Fusion Network
Use TFN to make feature more complex.
Low-rank Multimodal Fusion
LMF = Low Rank Approx. + TFN
~= cross-modal distillation
= multi-modal repr.
Feature A
Feature B
Classifier
constraint / similarity
A GAN method to make two modality be merged
Model A
Model B
Modal A
(RGB Image)
Modal B
(Depth Image)
(Sem. Seg)
Knowledge Distillation
paired
Inference Mode
Teacher Net (U)
Dataset
Student Net (U)
Regressor
T's feat
S's feat
S's feat transform
L2
Loss
Model A
Model B
Modal A
(RGB Image)
Modal B
(Depth Image)
(Sem. Seg)
Knowledge Distillation
paired
Inference Mode
(ImageNet Pretrained)
幻覺
Model A
Model B
Modal A
(RGB Image)
Modal B
(Depth Image)
Model B'
Hallucination Network
FitNet
Distillation
Object Detection
Model S
Model T
Modal A
(RGB Image)
Modal B
(Depth Image)
FitNet
Distillation
One of contribution:
Theoretical proved that Model S better than Model T
(pretrained)
What Makes Multi-modal Learning Better than Single (Provably) (NIPS '21)
Tutorial:
Tutorial on Multimodal Machine Learning (ACL '17)