VISAPP 2025
University of Campinas
Institute of Computing
¹ O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein and A.C. Berg. Imagenet Large Scale Visual Recognition Challenge.
In International Journal of Computer Vision, 115, pp.211-252, 2015.
¹ N. Burkart, and M.F. Huber. A survey on the explainability of supervised machine learning. In Journal of Artificial Intelligence Research, 70, pp.245-317., 73, pp.1-15. 2018.
² M. Tan, and Q. Le. EfficientNet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning. PMLR. 2019.
Figure 2: Models of various architectures, pre-trained over ImageNet. Source: Tan and Le².
¹ H. Xiao, D. Li, H. Xu, S. Fu, D. Yan, K. Song, and C. Peng. Semi-Supervised Semantic Segmentation with Cross Teacher Training. Neurocomputing, 508, pp.36-46. 2022.
² H. Zhao, X. Qi, X. Shen, J. Shi, and J. Jia. ICNet for Real-Time Semantic Segmentation on High-Resolution Images. In European Conference on Computer Vision (ECCV), pp. 405-420. 2018.
³ L. Chan, M.S. Hosseini. and K.N. Plataniotis. A Comprehensive Analysis of Weakly-Supervised Semantic Segmentation in Different Image Domains. In International Journal of Computer Vision, 129, pp.361-384. 2021.
Figure 8: Example of annotated CT Scan image. Source: https://radiopaedia.org/cases/liver-segments-annotated-ct-1
Figure 6: Example of road segmentation in SpaceNet dataset. Source: https://www.v7labs.com/open-datasets/spacenet
Figure 7: Example of (a) morphological and (b) functional segmentation of samples in the Atlas of Digital Pathology dataset. Source: L. Chan et al.
Figure 4: Example of samples and ground-truth panoptic segmentation annotation from the MS COCO 2017 dataset. Source: https://cocodataset.org/#panoptic-2020.
Figure 5: Example of semantic segmentation produced by ICNet for a video sample in the Cityscapes dataset. Source: https://gitplanet.com/project/fast-semantic-segmentation.
Figure 3: Samples, proposals¹ and ground-truth segmentation annotation from the Pascal VOC 2012 dataset.
¹ J. Long, E. Shelhamer, and T. Darrell. Fully Convolutional Networks for Semantic Segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3431-3440. 2015.
Figure 9: Fully Convolutional Network (FCN) architecture¹, mapping image samples to their respective semantic segmentation maps.
This information needs the be known and available at training time.
Equation 1: The (naive) categorical cross-entropy loss function.
¹ O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein and A.C. Berg. Imagenet Large Scale Visual Recognition Challenge.
In International Journal of Computer Vision, 115, pp.211-252, 2015.
Figure 10: Samples in the ImageNet 2012 dataset¹. Source: cs.stanford.edu/people/karpathy/cnnembed.
¹ B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba. Learning Deep Features for Discriminative Localization. In Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2921-2929. 2016.
Equation 4: Feed-Forward for a for Convolutional Networks containing GAP layers and the formulation for CAM¹.
¹ B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba. Learning Deep Features for Discriminative Localization. In Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2921-2929. 2016.
Figure 11: Examples of CAMs and approximate bounding boxes found for different birds in the CUB200 dataset. Source: Zhou et al.¹
¹Liu, S., Zhi, S., Johns, E., and Davison, A. J. (2022b). Bootstrapping semantic segmentation with regional contrast. In International Conference on Learning Representations (ICLR).
²Hu, H., Wei, F., Hu, H., Ye, Q., Cui, J., and Wang, L. (2021). Semi-supervised semantic segmentation via adaptive equalization learning. Advances in Neural Information Processing Systems, 34:22106–22118.
³Wang, Y., et al. (2022). Semi-supervised semantic segmentation using unreliable pseudo-labels. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4248–4257.
Figure 12: Illustration of Regional Contrast (ReCo). Representations are extracted for pixel queries and trained in a contrastive learning fashion. Source: Liu et al.¹
Figure 13: Illustration of Adaptative Equalization Learning (AEL). Unlabeled data is fed to both teacher and student models, and the response of the former is used to regularized the latter's. Source: Hu et al.²
Figure 14: Diagram of U²PL. Employs ideas from both ReCo and AEL for more efficient Semi-Supervised Semantic Segmentation training. Source: Hu et al.³
Our approach is inspired in previous research on Mutual Promotion and
Self-Supervised Learning.
Key differences:
(warm-up)
Student
Teacher
where ci⋆=argmaxcsihwct
Teacher
Student
Student
Teacher
Student
Teacher
Ls2c alone results in overfit
+ Lu greatly improves results
+ Lm produces the best outcome
Robust segmentation scores obtained by CSRM: higher and less varying scores for almost all choices of the threshold δ.
CSRM results in the highest (individual) relative improvement
CSRM can be further improved with refinement methods
Puzzle-CAM
Figure 15: Qualitative results of pseudo labels generated by CSRM and refined with dCRF and SEPL.
Figure 16: Qualitative results of prediction proposals made by segmentation models trained over pseudo labels devised by CSRM.