Intro 2 Multi-Modal

Arvin Liu @ MiRA

Agenda

Intro to various MM problems
Multimodal Representation
- Joint Representation
- Coordinate Representation
Cross-Modal Distillation
Conclusion & Comments

Intro to
Various Settings

Settings - Multimodal Repr

Model A

Model B

Modal A

(RGB Image)

Modal B

(Depth Image)

(Sem. Seg)

output

paired

Settings - Cross-Modal Distillation

Model A

Model B

Modal A

(RGB Image)

Modal B

(Depth Image)

(Sem. Seg)

Knowledge Distillation

paired

Inference Mode

Settings - Multi-Modal Hashing

Just like the inverse of disentangling.

Settings about
Multimodal Repr.

Settings - Multimodal Repr

Model A

Model B

Modal A

(RGB Image)

Modal B

(Depth Image)

(Sem. Seg)

output

paired

Key: How to merge different modality?

Types of Multimodal Representation

Multimodal Representation

Model A

Model B

Modal A

(RGB Image)

Modal B

(Depth Image)

(A) Joint Representation

(B) Coordinate Representation

constraint / similarity

Feature A

Feature B

Feature A

Feature B

Classifier

Joint Representation

Naive method for Joint Representation

Feature A

Feature B

Classifier

Strategy 1 - Early Fusion

Merge

Feature A

Feature B

Classifier B

Strategy 2 - Late Fusion

Ensemble

Classifier A

output

TFN (EMNLP '17)

Feature A

Feature B

Classifier

TFN

Tensor Fusion Network

Use TFN to make feature more complex.

LMF (ACL '18)

Low-rank Multimodal Fusion

LMF = Low Rank Approx. + TFN

Multimodal Deep Learning (ICML '11)

~= cross-modal distillation

= multi-modal repr.

Coordinated Representation

Feature A

Feature B

Classifier

constraint / similarity

Make Feature A & Feature B more closer.
- Why works?
  You can think that Feature A & Feature B are distill from each other.
- What Constraints?
  - cosine similarity, L2, correlation, etc.

Adaptive Fusion

A GAN method to make two modality be merged

Settings about
Cross-Modal Distillation

Settings - Cross-Modal Distillation

Model A

Model B

Modal A

(RGB Image)

Modal B

(Depth Image)

(Sem. Seg)

Knowledge Distillation

paired

Inference Mode

Key: How to transfer knowledge in mid?

Recap: FitNet

FitNet is two-stages algorithm:
- Stage 1: pretrain the upper part of student net.
- Stage 2: Act like original KD (Baseline KD)
Point: FitNet distill knowledge from features.

Teacher Net (U)

Dataset

Student Net (U)

Regressor

T's feat

S's feat

S's feat transform

L2
Loss

FitNet + MM

Cross modal distillation for supervision transfer (CVPR '16)

Model A

Model B

Modal A

(RGB Image)

Modal B

(Depth Image)

(Sem. Seg)

Knowledge Distillation

paired

Inference Mode

(ImageNet Pretrained)

Modality hallucination (CVPR '16)

幻覺

Model A

Model B

Modal A

(RGB Image)

Modal B

(Depth Image)

Model B'

Hallucination Network

FitNet
Distillation

Object Detection

L_{r}

L_{h}

L_{d}

L_{rd}

L_{rh}

*\, logits_{rh} = softmax(output_r + output_h)

Cross-Modal Distillation +
Multi-Modal?

Multimodal Knowledge Expansion (CVPR '21)

Model S

Model T

Modal A

(RGB Image)

Modal B

(Depth Image)

FitNet
Distillation

L_{s}

One of contribution:

Theoretical proved that Model S better than Model T

(pretrained)

Conclusion

Joint Representation
- Naive methods like early fusion is widely used nowadays. (Like concatenate, outer product)
Coordinated Representation
- One of the branch is just like soft parameter sharing in MTL (Multi-task Learning). (Another is like late-fusion.)
Cross-Modal Distillation
- People tries to bring some new KD techniques into cross-modal setting, like BSKD, AT, etc.
Theoretical Proof?

Comments

Cross-Modal & Domain Adaptation
- In fact, different modalities mean different domains, but we have paired data.
- This implies DA techniques can be applied in cross-modal scenario.
Cross-Modal & Knowledge Distillation
- Just like hallucination network, you can adopt various KD techniques and left some degree of freedom. (like IE-KD).
Multi-Modal & Multi-task Learning
- Network Architecture is similar, but reverse way.
- Joint repr. <-> hard par sharing, and coor repr. <-> soft.

Q & A ?

Reference

Theoretical proof:
- What Makes Multi-modal Learning Better than Single (Provably) (NIPS '21)
Tutorial:
- Tutorial on Multimodal Machine Learning (ACL '17)
Implementation for Multimodal Representation
- Joint Representation
  - Tensor Fusion Network for Multimodal Sentiment Analysis (EMNLP '17)
  - Adaptive Fusion Techniques for Multimodal Data (EACL '21)
- Coordinate Representation
  - Deep Canonical Correlation Analysis (ICML '13)
  - FitNet + Multi-modal
    - Left some degree of freedom:
      Learning with side information through modality hallucination (CVPR '16)
    - Distill from other domain:
      Cross modal distillation for supervision transfer (CVPR '16)
  - DML + Multi-modal (But I think these kind of paper didn't do DML very well.)
    - Towards Cross-Modality Medical Image Segmentation with Online Mutual Knowledge Distillation (AAAI '20)

Intro 2 Multi-Modal

Agenda

Intro to Various Settings

Settings - Multimodal Repr

Settings - Cross-Modal Distillation

Settings - Multi-Modal Hashing

Just like the inverse of disentangling.

Settings about Multimodal Repr.

Settings - Multimodal Repr

Types of Multimodal Representation

Multimodal Representation

Joint Representation

Naive method for Joint Representation

TFN (EMNLP '17)

LMF (ACL '18)

Multimodal Deep Learning (ICML '11)

Coordinated Representation

Coordinated Representation

Adaptive Fusion

Settings about Cross-Modal Distillation

Settings - Cross-Modal Distillation

Recap: FitNet

FitNet + MM

Cross modal distillation for supervision transfer (CVPR '16)

Modality hallucination (CVPR '16)

Cross-Modal Distillation + Multi-Modal?

Multimodal Knowledge Expansion (CVPR '21)

Conclusion

Conclusion

Joint Representation

Coordinated Representation

Cross-Modal Distillation

Theoretical Proof?

Comments

Cross-Modal & Domain Adaptation

Cross-Modal & Knowledge Distillation

Multi-Modal & Multi-task Learning

Q & A ?

Reference

Cross Modal

More from Arvin Liu

Intro to
Various Settings

Settings about
Multimodal Repr.

Settings about
Cross-Modal Distillation

Cross-Modal Distillation +
Multi-Modal?