Intro 2 Multi-Modal
Arvin Liu @ MiRA
Agenda
- Intro to various MM problems
- Multimodal Representation
- Joint Representation
- Coordinate Representation
- Cross-Modal Distillation
- Conclusion & Comments
Intro to
Various Settings
Settings - Multimodal Repr
Model A
Model B
Modal A
(RGB Image)
Modal B
(Depth Image)
(Sem. Seg)
output
paired
Settings - Cross-Modal Distillation
Model A
Model B
Modal A
(RGB Image)
Modal B
(Depth Image)
(Sem. Seg)
Knowledge Distillation
paired
Inference Mode
Settings - Multi-Modal Hashing
Just like the inverse of disentangling.
Settings about
Multimodal Repr.
Settings - Multimodal Repr
Model A
Model B
Modal A
(RGB Image)
Modal B
(Depth Image)
(Sem. Seg)
output
paired
- Key: How to merge different modality?
Types of Multimodal Representation
Multimodal Representation
Model A
Model B
Modal A
(RGB Image)
Modal B
(Depth Image)
(A) Joint Representation
(B) Coordinate Representation
constraint / similarity
Feature A
Feature B
Feature A
Feature B
Classifier
Classifier
Joint Representation
Naive method for Joint Representation
Feature A
Feature B
Classifier
Strategy 1 - Early Fusion
Merge
Feature A
Feature B
Classifier B
Strategy 2 - Late Fusion
Ensemble
Classifier A
output
TFN (EMNLP '17)
Feature A
Feature B
Classifier
TFN
Tensor Fusion Network
Use TFN to make feature more complex.
LMF (ACL '18)
Low-rank Multimodal Fusion
LMF = Low Rank Approx. + TFN
Multimodal Deep Learning (ICML '11)
~= cross-modal distillation
= multi-modal repr.
Coordinated Representation
Coordinated Representation
Feature A
Feature B
Classifier
constraint / similarity
- Make Feature A & Feature B more closer.
- Why works?
You can think that Feature A & Feature B are distill from each other. - What Constraints?
- cosine similarity, L2, correlation, etc.
- Why works?
Adaptive Fusion
A GAN method to make two modality be merged
Settings about
Cross-Modal Distillation
Settings - Cross-Modal Distillation
Model A
Model B
Modal A
(RGB Image)
Modal B
(Depth Image)
(Sem. Seg)
Knowledge Distillation
paired
Inference Mode
- Key: How to transfer knowledge in mid?
Recap: FitNet
- FitNet is two-stages algorithm:
- Stage 1: pretrain the upper part of student net.
- Stage 2: Act like original KD (Baseline KD)
- Point: FitNet distill knowledge from features.
Teacher Net (U)
Dataset
Student Net (U)
Regressor
T's feat
S's feat
S's feat transform
L2
Loss
FitNet + MM
Cross modal distillation for supervision transfer (CVPR '16)
Model A
Model B
Modal A
(RGB Image)
Modal B
(Depth Image)
(Sem. Seg)
Knowledge Distillation
paired
Inference Mode
(ImageNet Pretrained)
Modality hallucination (CVPR '16)
幻覺
Model A
Model B
Modal A
(RGB Image)
Modal B
(Depth Image)
Model B'
Hallucination Network
FitNet
Distillation
Object Detection
Cross-Modal Distillation +
Multi-Modal?
Multimodal Knowledge Expansion (CVPR '21)
Model S
Model T
Modal A
(RGB Image)
Modal B
(Depth Image)
FitNet
Distillation
One of contribution:
Theoretical proved that Model S better than Model T
(pretrained)
Conclusion
Conclusion
-
Joint Representation
- Naive methods like early fusion is widely used nowadays. (Like concatenate, outer product)
-
Coordinated Representation
- One of the branch is just like soft parameter sharing in MTL (Multi-task Learning). (Another is like late-fusion.)
-
Cross-Modal Distillation
- People tries to bring some new KD techniques into cross-modal setting, like BSKD, AT, etc.
-
Theoretical Proof?
Comments
-
Cross-Modal & Domain Adaptation
- In fact, different modalities mean different domains, but we have paired data.
- This implies DA techniques can be applied in cross-modal scenario.
-
Cross-Modal & Knowledge Distillation
- Just like hallucination network, you can adopt various KD techniques and left some degree of freedom. (like IE-KD).
-
Multi-Modal & Multi-task Learning
- Network Architecture is similar, but reverse way.
- Joint repr. <-> hard par sharing, and coor repr. <-> soft.
Q & A ?
Reference
- Theoretical proof:
-
What Makes Multi-modal Learning Better than Single (Provably) (NIPS '21)
-
-
Tutorial:
-
Tutorial on Multimodal Machine Learning (ACL '17)
-
- Implementation for Multimodal Representation
- Joint Representation
- Tensor Fusion Network for Multimodal Sentiment Analysis (EMNLP '17)
- Adaptive Fusion Techniques for Multimodal Data (EACL '21)
-
Coordinate Representation
- Deep Canonical Correlation Analysis (ICML '13)
-
FitNet + Multi-modal
-
Left some degree of freedom:
Learning with side information through modality hallucination (CVPR '16) -
Distill from other domain:
Cross modal distillation for supervision transfer (CVPR '16)
-
Left some degree of freedom:
-
DML + Multi-modal (But I think these kind of paper didn't do DML very well.)
- Towards Cross-Modality Medical Image Segmentation with Online Mutual Knowledge Distillation (AAAI '20)
- Joint Representation
Cross Modal
By Arvin Liu
Cross Modal
- 927