Learning Weakly Supervised Semantic Segmentation Through Cross-Supervision and Contrasting of Pixel-Level Pseudo-Labels

VISAPP 2025

Lucas David lucas.david@ic.unicamp.br

Helio Pedrini helio@ic.unicamp.br

Zanoni Dias zanoni@ic.unicamp.br

University of Campinas

Institute of Computing

Summary

Introduction & Related Works
Methodology
Results
Conclusion

¹ O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein and A.C. Berg. Imagenet Large Scale Visual Recognition Challenge.
In International Journal of Computer Vision, 115, pp.211-252, 2015.

Representation Learning Introduction

Figure 1: Samples in the ImageNet 2012 dataset¹. Source: cs.stanford.edu/people/karpathy/cnnembed.

¹ J. Long, E. Shelhamer, and T. Darrell. Fully Convolutional Networks for Semantic Segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3431-3440. 2015.

Figure 9: Fully Convolutional Network (FCN) architecture¹, mapping image samples to their respective semantic segmentation maps.

This information needs the be known and available at training time.

\text{CE}(p_i, y_i) = -\sum_{c=1}^M y_{ic}\log(p_{ic})

\text{CE}(p_i, y_i) = -\sum_{c=1}^M y_{ic}\log(p_{ic})

Equation 1: The (naive) categorical cross-entropy loss function.

Semantic Segmentation Representation Learning

Weakly Supervised (WSSS) Semantic Segmentation

¹ B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba. Learning Deep Features for Discriminative Localization. In Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2921-2929. 2016.

Equation 4: Feed-Forward for a for Convolutional Networks containing GAP layers and the formulation for CAM¹.

\implies L^c_\text{CAM}(f, x) = \sum_k w_k^c A^k

\implies L^c_\text{CAM}(f, x) = \sum_k w_k^c A^k

f(x) = \frac{1}{hw}\sum_{ij} \sum_k w_k^c A^k_{ij} = \text{GAP} (w^c \cdot A)

f(x) = \frac{1}{hw}\sum_{ij} \sum_k w_k^c A^k_{ij} = \text{GAP} (w^c \cdot A)

f(x) = \sum_k w_k^c \text{GAP}(A^k) = \sum_k w_k^c \frac{1}{hw}\sum_{ij} A^k_{ij}

f(x) = \sum_k w_k^c \text{GAP}(A^k) = \sum_k w_k^c \frac{1}{hw}\sum_{ij} A^k_{ij}

Summary

Introduction & Related Works
Methodology
Results
Conclusion

Objective Functions CSRM

\mathcal{L} = \mathcal{L}_c + \mathcal{L}_s +\lambda_\text{s2c}\mathcal{L}_\text{s2c} +\lambda_u\mathcal{L}_u +\lambda_m\mathcal{L}_m

\mathcal{L} = \mathcal{L}_c + \mathcal{L}_s +\lambda_\text{s2c}\mathcal{L}_\text{s2c} +\lambda_u\mathcal{L}_u +\lambda_m\mathcal{L}_m

0 \mid t = 0

0 \mid t = 0

1 \mid t \neq 0

1 \mid t \neq 0

\{

0 \mid t = 0

0 \mid t = 0

\text{interp}(0.1, 1) \mid t \neq 0

\text{interp}(0.1, 1) \mid t \neq 0

\{

(warm-up)

Classification loss Objective Functions

+ (1-y_{ic})\log({e^{-p_{ic}^s}}/{(1 + e^{-p_{ic}^s})})

+ (1-y_{ic})\log({e^{-p_{ic}^s}}/{(1 + e^{-p_{ic}^s})})

\mathcal{L}_\text{c}(\mathbb{p}_i^s, \mathbb{y}_i) = - \frac{1}{C} \sum_c y_{ic}\log((1 + e^{-p_{ic}^s})^{-1})

\mathcal{L}_\text{c}(\mathbb{p}_i^s, \mathbb{y}_i) = - \frac{1}{C} \sum_c y_{ic}\log((1 + e^{-p_{ic}^s})^{-1})

Segmentation loss Objective Functions

\mathcal{L}_\mathbf{s}(\mathbf{s}_i^s, \mathbf{l}_i^t) = - \frac{1}{|\mathbf{M}^p(\mathbf{l}_i^t)|} \sum_{hw}^{HW} \mathbf{M}_{hw}^p(\mathbf{l}_i^t) \sum_c^C \mathbf{l}_{ihwc}^t \log\mathbf{s}_{ihwc}^s

\mathcal{L}_\mathbf{s}(\mathbf{s}_i^s, \mathbf{l}_i^t) = - \frac{1}{|\mathbf{M}^p(\mathbf{l}_i^t)|} \sum_{hw}^{HW} \mathbf{M}_{hw}^p(\mathbf{l}_i^t) \sum_c^C \mathbf{l}_{ihwc}^t \log\mathbf{s}_{ihwc}^s

\mathbf{M}^p(\mathbb{l}_i^t) = {1}\left[\max_{c\in C} \mathbb{l}_i^{tc}\not\in (\delta_\text{bg}, \delta_\text{fg}]\right]

\mathbf{M}^p(\mathbb{l}_i^t) = {1}\left[\max_{c\in C} \mathbb{l}_i^{tc}\not\in (\delta_\text{bg}, \delta_\text{fg}]\right]

\mathbb{l}_i^t = \text{dCRF}(\psi(\mathbb{A}_i^t))

\mathbb{l}_i^t = \text{dCRF}(\psi(\mathbb{A}_i^t))

Student

Teacher

Activation Consistency loss Objective Functions

where $c_i^\star = \arg\max_c\mathbf{s}_{ihwc}^t$

\mathcal{L}_\text{s2c}(\mathbf{A}_i^s, \mathbf{s}_i^t) = - \frac{1}{|\mathbf{M}^s(\mathbf{s}_i^t)|} \sum_{hw}^{HW} \mathbf{M}_{hw}^s(\mathbf{s}_i^t) \log\mathbf{Q}_{ihwc_i^\star}^s

\mathcal{L}_\text{s2c}(\mathbf{A}_i^s, \mathbf{s}_i^t) = - \frac{1}{|\mathbf{M}^s(\mathbf{s}_i^t)|} \sum_{hw}^{HW} \mathbf{M}_{hw}^s(\mathbf{s}_i^t) \log\mathbf{Q}_{ihwc_i^\star}^s

\mathbf{Q}_i^s = \text{softmax}\left(\mathbf{S}_{i, \text{bg}}^s \mid \text{upscale}(\mathbf{A}_i^s)]\right)

\mathbf{Q}_i^s = \text{softmax}\left(\mathbf{S}_{i, \text{bg}}^s \mid \text{upscale}(\mathbf{A}_i^s)]\right)

\mathbf{M}^s(\mathbf{s}_i^t) = 1\left[\max_{c\in C} \mathbf{s}_{ic}^t > \sigma_\text{s2c}\right]

\mathbf{M}^s(\mathbf{s}_i^t) = 1\left[\max_{c\in C} \mathbf{s}_{ic}^t > \sigma_\text{s2c}\right]

Teacher

Student

Segmentation Consistency loss Objective Functions

\mathcal{L}_\text{u}(\mathbf{s}_i^s, \mathbf{s}_i^t) = -\frac{1}{|\mathbf{M}^{\tilde{s}}(\mathbf{\tilde{s}}_i^t)|} \sum_{hw}^{HW} \mathbf{M}^{\tilde{s}}_{hw}(\mathbf{\tilde{s}}_i^t) \log\mathbf{s}_{ihw{c_i^\star}}^s

\mathcal{L}_\text{u}(\mathbf{s}_i^s, \mathbf{s}_i^t) = -\frac{1}{|\mathbf{M}^{\tilde{s}}(\mathbf{\tilde{s}}_i^t)|} \sum_{hw}^{HW} \mathbf{M}^{\tilde{s}}_{hw}(\mathbf{\tilde{s}}_i^t) \log\mathbf{s}_{ihw{c_i^\star}}^s

\mathcal{H}_{hw}(\mathbf{\tilde{s}}_{i}^t) = -\sum_c^{|C|} \tilde{s}_{ihwc}^t \log \tilde{s}_{ihwc}^t

\mathcal{H}_{hw}(\mathbf{\tilde{s}}_{i}^t) = -\sum_c^{|C|} \tilde{s}_{ihwc}^t \log \tilde{s}_{ihwc}^t

\gamma_t^{\tilde{\mathcal{B}}_i} = \text{quantile}\left(\mathcal{H}\left(\left[ \mathbf{\tilde{s}}_{i}^t \mid \mathbf{\tilde{s}}_{i+1}^t \mid \ldots \mid \mathbf{\tilde{s}}_{i+b-1}^t \right]\right), 1-\alpha_t\right)

\gamma_t^{\tilde{\mathcal{B}}_i} = \text{quantile}\left(\mathcal{H}\left(\left[ \mathbf{\tilde{s}}_{i}^t \mid \mathbf{\tilde{s}}_{i+1}^t \mid \ldots \mid \mathbf{\tilde{s}}_{i+b-1}^t \right]\right), 1-\alpha_t\right)

\mathbf{M}^{\tilde{s}}(\mathbf{\tilde{s}}_i^t) = 1 \left[ \mathcal{H}(\mathbf{\tilde{s}}_{i}^t) \leq \gamma_t^{\tilde{\mathcal{B}}_i} \right]

\mathbf{M}^{\tilde{s}}(\mathbf{\tilde{s}}_i^t) = 1 \left[ \mathcal{H}(\mathbf{\tilde{s}}_{i}^t) \leq \gamma_t^{\tilde{\mathcal{B}}_i} \right]

Student

Teacher

Pixel Contrastive Loss Objective Functions

\mathcal{L}_m = -\frac{1}{|C|\times P}\sum_c^{|C|} \sum_p^{P} \log \left[\frac{e^{\langle\mathbf{r}_{pc}, \mathbf{r}_{pc}^{+}\rangle} / \tau}{e^{\langle\mathbf{r}_{pc}, \mathbf{r}_{pc}^{+}\rangle} / \tau + \sum_j^{N} e^{\langle\mathbf{r}_{pjc}, \mathbf{r}_{pjc}^{-}\rangle} / \tau}\right]

\mathcal{L}_m = -\frac{1}{|C|\times P}\sum_c^{|C|} \sum_p^{P} \log \left[\frac{e^{\langle\mathbf{r}_{pc}, \mathbf{r}_{pc}^{+}\rangle} / \tau}{e^{\langle\mathbf{r}_{pc}, \mathbf{r}_{pc}^{+}\rangle} / \tau + \sum_j^{N} e^{\langle\mathbf{r}_{pjc}, \mathbf{r}_{pjc}^{-}\rangle} / \tau}\right]

Student

Teacher

Pixel Representation Selection Pixel Contrastive Loss

$\mathbf{r}_{pc}^+$ : averaged representation amongst pixels confidently predicted as class c (classification branch) or low segmentation entropy (segmentation branch);
$\mathbf{r}_{pc}^-$ :
- [Pixels segmented in class c with high probability (low rank class) and confidently classified as other classes (classification branch)]
  or
- [Pixels segmented in class c with high probability (low rank class) and not in image-level label]
  or
- [Pixels segmented in class c with medium probability (mid rank class. I.e., hard queries that probably do not belong to c)]

Learning Weakly Supervised Semantic Segmentation Through Cross-Supervision and Contrasting of Pixel-Level Pseudo-Labels VISAPP 2025 Lucas David lucas.david@ic.unicamp.br Helio Pedrini helio@ic.unicamp.br Zanoni Dias zanoni@ic.unicamp.br University of Campinas Institute of Computing

Learning Weakly Supervised Semantic Segmentation Through Cross-Supervision and Contrasting of Pixel-Level Pseudo-Labels

Lucas David lucas.david@ic.unicamp.br

Helio Pedrini helio@ic.unicamp.br

Zanoni Dias zanoni@ic.unicamp.br

Summary

Representation Learning Introduction

Representation Learning Introduction

Segmentation Representation Learning

Semantic Segmentation Representation Learning

Weakly Supervised (WSSS) Semantic Segmentation

Weakly Supervised (WSSS) Semantic Segmentation

Weakly Supervised (WSSS) Semantic Segmentation

Weakly Supervised (WSSS) Semantic Segmentation

Mutual Promotion WSSS

Self-Supervised Learning Semi-Supervised

Summary

CSRM

Architecture CSRM

Objective Functions CSRM

Classification loss Objective Functions

Segmentation loss Objective Functions

Activation Consistency loss Objective Functions

Segmentation Consistency loss Objective Functions

Pixel Contrastive Loss Objective Functions

Pixel Representation Selection Pixel Contrastive Loss

Summary

Ablation Results

Ablation Results

Ablation Results

Refinement Results

Refinement Results

Verification Results

Verification Results

Summary

Conclusion

Future Work