Du Mingzhe
Liu Fengming
Li Yanzhou
Sun Dianxiang
Group 22
G2204045F
G2104574F
G2201488K
G2201037D
Traditional semantic segmentation: train the model on the seen classes and test the model on the seen classes
Zero-shot semantic segmentation: train the model on the seen classes and test the model on the unseen classes
MaskFormer:
Generate class-agnostic segments
CLIP:
zero-shot classification on the segments
Dataset: COCO-Stuff
156 seen classes + 15 unseen classes
Metric
Mean IoU between the prediction and the ground truth
Seen, Unseen, Harmonic
Add a text projection layer to further transform the text feature given by the CLIP text encoder
Implemented as a MLP (512 → 384 → 512)
Alleviates the overfitting to the seen classes
Prompt Engineering
e,g, "This is a photo of <class name>"
Performing referring image segmentation on the prompt = Performing zero-shot semantic segmentaion on the class name
Modification 1
Modification 2
In the vanilla setting, a large object segment has the same weight as a small object segment.
The total loss is less sensitive to the mismatching of smaller objects.
Therefore, we multiply the loss with a coefficient beta that is inversely proportional to the segment size.
Dataset: COCO-Stuff
156 seen classes + 15 unseen classes
Metric
Mean IoU between the prediction and the ground truth
Seen, Unseen, Harmonic
Prompt Engineering
e,g, "This is a photo of <class name>"
Performing zero-shot semantic segmentation guided by the class name