Intro 2 Multi-Modal

Arvin Liu @ MiRA

Agenda

  • Intro to various MM problems
  • Multimodal Representation
    • Joint Representation
    • Coordinate Representation
  • Cross-Modal Distillation
  • Conclusion & Comments

Intro to
Various Settings

Settings - Multimodal Repr

Model A

Model B

Modal A

(RGB Image)

Modal B

(Depth Image)

(Sem. Seg)

output

paired

Settings - Cross-Modal Distillation

Model A

Model B

Modal A

(RGB Image)

Modal B

(Depth Image)

(Sem. Seg)

Knowledge Distillation

paired

Inference Mode

Settings - Multi-Modal Hashing

Just like the inverse of disentangling.

Settings about
Multimodal Repr.

Settings - Multimodal Repr

Model A

Model B

Modal A

(RGB Image)

Modal B

(Depth Image)

(Sem. Seg)

output

paired

  • Key: How to merge different modality?

Types of Multimodal Representation

Multimodal Representation

Model A

Model B

Modal A

(RGB Image)

Modal B

(Depth Image)

(A) Joint Representation

(B) Coordinate Representation

constraint /  similarity

Feature A

Feature B

Feature A

Feature B

Classifier

Classifier

Joint Representation

Naive method for Joint Representation

Feature A

Feature B

Classifier

Strategy 1 - Early Fusion

Merge

Feature A

Feature B

Classifier B

Strategy 2 - Late Fusion

Ensemble

Classifier A

output

TFN (EMNLP '17)

Feature A

Feature B

Classifier

TFN

Tensor Fusion Network

Use TFN to make feature more complex.

LMF (ACL '18)

Low-rank Multimodal Fusion

LMF = Low Rank Approx. + TFN

Multimodal Deep Learning (ICML '11)

~= cross-modal distillation

= multi-modal repr.

Coordinated Representation

Coordinated Representation

Feature A

Feature B

Classifier

constraint / similarity

  • Make Feature A & Feature B more closer.
    • Why works?
      You can think that Feature A & Feature B are distill from each other.
    • What Constraints?
      •  cosine similarity, L2, correlation, etc.

Adaptive Fusion

A GAN method to make two modality be merged

Settings about
Cross-Modal Distillation

Settings - Cross-Modal Distillation

Model A

Model B

Modal A

(RGB Image)

Modal B

(Depth Image)

(Sem. Seg)

Knowledge Distillation

paired

Inference Mode

  • Key: How to transfer knowledge in mid?

Recap: FitNet

  • FitNet is two-stages algorithm:
    • Stage 1: pretrain the upper part of student net.
    • Stage 2: Act like original KD (Baseline KD)
  • Point: FitNet distill knowledge from features.

Teacher Net (U)

Dataset

Student Net (U)

Regressor

T's feat

S's feat

S's feat transform

L2
Loss

FitNet + MM

Cross modal distillation for supervision transfer (CVPR '16)

Model A

Model B

Modal A

(RGB Image)

Modal B

(Depth Image)

(Sem. Seg)

Knowledge Distillation

paired

Inference Mode

(ImageNet Pretrained)

Modality hallucination (CVPR '16)

幻覺

Model A

Model B

Modal A

(RGB Image)

Modal B

(Depth Image)

Model B'

Hallucination Network

FitNet
Distillation

Object Detection

L_{r}
L_{h}
L_{d}
L_{rd}
L_{rh}
*\, logits_{rh} = softmax(output_r + output_h)

Cross-Modal Distillation +
Multi-Modal?

Multimodal Knowledge Expansion (CVPR '21)

Model S

Model T

Modal A

(RGB Image)

Modal B

(Depth Image)

FitNet
Distillation

L_{s}

One of contribution:

Theoretical proved that Model S better than Model T

(pretrained)

Conclusion

Conclusion

  • Joint Representation

    • Naive methods like early fusion is widely used nowadays. (Like concatenate, outer product)
  • Coordinated Representation

    • One of the branch is just like soft parameter sharing in MTL (Multi-task Learning). (Another is like late-fusion.)
  • Cross-Modal Distillation

    • People tries to bring some new KD techniques into cross-modal setting, like BSKD, AT, etc.
  • Theoretical Proof?

Comments

  • Cross-Modal & Domain Adaptation

    • In fact, different modalities mean different domains, but we have paired data.
    • This implies DA techniques can be applied in cross-modal scenario.
  • Cross-Modal & Knowledge Distillation

    • Just like hallucination network, you can adopt various KD techniques and left some degree of freedom. (like IE-KD). 
  • Multi-Modal & Multi-task Learning

    • Network Architecture is similar, but reverse way.
    • Joint repr. <-> hard par sharing, and coor repr. <-> soft.

Q & A ?

Reference

  • Theoretical proof:
    • What Makes Multi-modal Learning Better than Single (Provably) (NIPS '21)

  • ​Tutorial:

    • Tutorial on Multimodal Machine Learning (ACL '17)

  • Implementation for Multimodal Representation
    • Joint Representation
      • Tensor Fusion Network for Multimodal Sentiment Analysis (EMNLP '17)
      • Adaptive Fusion Techniques for Multimodal Data (EACL '21)
    • ​Coordinate Representation
      • Deep Canonical Correlation Analysis (ICML '13)
      • ​FitNet + Multi-modal
        • Left some degree of freedom:
          Learning with side information through modality hallucination (CVPR '16) ​
        • Distill from other domain:
          Cross modal distillation for supervision transfer (CVPR '16)
      • ​DML + Multi-modal (But I think these kind of paper didn't do DML very well.)
        • Towards Cross-Modality Medical Image Segmentation with Online Mutual Knowledge Distillation (AAAI '20)