CE6190 Project
Du Mingzhe
Liu Fengming
Li Yanzhou
Sun Dianxiang
Group 22
G2204045F
G2104574F
G2201488K
G2201037D
Zero-shot semantic segmentation
-
Traditional semantic segmentation: train the model on the seen classes and test the model on the seen classes
-
Zero-shot semantic segmentation: train the model on the seen classes and test the model on the unseen classes

Attempt 1
Inserting Text Projection Layer to ZegFormer
What is ZegFormer?

MaskFormer:
Generate class-agnostic segments
CLIP:
zero-shot classification on the segments
-
Dataset: COCO-Stuff
-
156 seen classes + 15 unseen classes
-
-
Metric
-
Mean IoU between the prediction and the ground truth
-
Seen, Unseen, Harmonic
-
Experiment
-
Add a text projection layer to further transform the text feature given by the CLIP text encoder
-
Implemented as a MLP (512 → 384 → 512)
-
Alleviates the overfitting to the seen classes

Modification
Result

- mIoU:
- Decreases for seen classses
- Increases for unseen clasess
- Overfitting for seen classes is slightly improved
Attempt 2
Adapting CRIS to Zero-shot Semantic Segmentation
What is CRIS?
-
CRIS: CLIP-Driven Referring Image Segmentation
-
Referring Image Segmentation
- Segment out the object referred by the sentence
- Can be converted to zero-shot semantic segmentation
-
Prompt Engineering
-
e,g, "This is a photo of <class name>"
-
Performing referring image segmentation on the prompt = Performing zero-shot semantic segmentaion on the class name
-


What is CRIS?
- Text-to-pixel matching
- The loss function helps to pull semantically similar pixels together and push dissimilar pixels away



Modification 1
Modification 2
Modifications
- Improve the segmentation performance on small objects
- Modification 1: image level
- Modification 2: pixel level

Modification 1: Image-level
-
In the vanilla setting, a large object segment has the same weight as a small object segment.
-
The total loss is less sensitive to the mismatching of smaller objects.
-
Therefore, we multiply the loss with a coefficient beta that is inversely proportional to the segment size.


Modification 2: Pixel-level
- A smaller object occupies a smaller area.
- The pixels in the positive set P are much less than in the negative set N.
- Missing a matching in P is not that significant.
- Therefore, we multiply the loss on the positive side with an alpha > 1, so as to make the positive matching to be more oustanding.



-
Dataset: COCO-Stuff
-
156 seen classes + 15 unseen classes
-
-
Metric
-
Mean IoU between the prediction and the ground truth
-
Seen, Unseen, Harmonic
-
-
Prompt Engineering
-
e,g, "This is a photo of <class name>"
-
Performing zero-shot semantic segmentation guided by the class name
-
Experiment

Result
Result Analysis
Examples

Reflection
- Construct Consistent Evaluation Metrics.
- Modify the MaskFormer part in ZegFormer to improve the class-agnostic segmentation performance.
- Polish the CRIS model to segment all objects in an image simultaneously.
- Produce a comprehensive survey among ZegFormer, CRIS-like and traditional models.
Conclusion
- We have tried to modify two models, ZegFormer and CRIS, to do the zero-shot semantic segmentation tasks.
- Insert a text project module to ZegFormer so that it is less overfit to the seen classes.
- Construct two novel loss functions in the CRIS model
- Achieve an impressive performance.
Thank you
Milestones
ZS3Net

SPNet

CE6190
By elfsong
CE6190
- 14