Can another modality help
without exact pairs
?
Pairing Bottleneck
Most multimodal methods assume
matched pairs
Real clinical data are often
unmatched
Question: can "extra but unpaired" data still help?
Modality-specific encoders + one
shared backbone
No pair mining
Keep the improved
target-modality
representation at inference
Unpaired Multimodal Learner (UML)
Different modalities = different views of the
same world
Shared weights push toward
concept-level
features
Related modalities transfer useful structure
Modalities describing the similar structure, one can help the others
UML - Could this work?
Benchmarks
Self-supervised setting: UML beats unimodal across reported tasks
Benchmarks
Self-supervised setting: UML beats unimodal across reported tasks
?
Extends to
audio + vision + text
Image help
audio
Text help
audio
Best results with
all three?
Clinical analogy: waveform + EHR + notes, or retinal imaging + outcomes + text
Beyond Image + Text: Three Modalities
Extends to
audio + vision + text
Image + text help
audio
Audio + text help
image
Best results with
all three
Clinical analogy: waveform + EHR + notes, or retinal imaging + outcomes + text
Beyond Image + Text: Three Modalities
Scale: ~
900K fundus
+ ~
700K OCT
CFP and OCT carry
complementary signal
RETFound limitation: CFP-OCT fusion
not investigated
Fundus + OCT After RETFound
Can we learn from
fundus + OCT jointly
, even with imperfect pairs?
Summary
Perfect pairing is
not required
Shared weights are a
strong simple baseline
Key caveat: modalities must be
semantically related
Use retinal data to explore
unpaired fundus-OCT learning
Can we learn from
fundus + OCT jointly
, even with imperfect pairs?
Made with Slides.com