Exploring Explaining Methods in Multi-Label Problems
and Complementary Regularization Strategies
in Weakly Supervised Semantic Segmentation

University of Campinas

Doctoral Qualifying Exam

Candidate: Lucas Oliveira David
Advisor: Prof. Dr. Zanoni Dias
Co-advisor: Prof. Dr. Hélio Pedrini

Schedule

1. Introduction

2. Related Work

3. Research Proposal

4. Preliminary Results

5. Final Considerations

How It is Done? Semantic Segmentation

¹ J. Long, E. Shelhamer, and T. Darrell. Fully Convolutional Networks for Semantic Segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3431-3440. 2015.

Figure 16: Fully Convolutional Network (FCN) architecture¹, mapping image samples to their respective semantic segmentation maps.

This information needs the be known and available at training time.

\text{CE}(p_i, y_i) = -\sum_{c=1}^M y_{ic}\log(p_{ic})

\text{CE}(p_i, y_i) = -\sum_{c=1}^M y_{ic}\log(p_{ic})

Equation 1: The (naive) categorical cross-entropy loss function.

Explainable AI Related Work

\text{If } f_c \approx w^\intercal I + b, \\ S_{f_c}(I_0) = \psi\big(\frac{\partial f_c}{\partial I}\big|_{I_0}\big)

\text{If } f_c \approx w^\intercal I + b, \\ S_{f_c}(I_0) = \psi\big(\frac{\partial f_c}{\partial I}\big|_{I_0}\big)

Equation 2: Saliency map for the concept c of a model S with respect to an input image x, generated by the (Vanilla) Gradients method¹.

¹ K. Simonyan, A. Vedaldi, A. Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034. 2013.

² S. Srinivas and F. Fleuret. Full-gradient representation for neural network visualization. In Advances in neural information processing systems, 32. 2019.

Figure 21: Sensitivity maps produced by Vanilla Gradient¹ (2nd col) and Full-Grad² (3rd col), when employed to explain the predictions made by a ResNet50 model.
Source: keras-explainable.

S_{f_c}(I_0) = \psi(\nabla_I f(I) \circ I_0) + \sum_{l\in L,k\in C_l} \psi(f^k_b(x))

S_{f_c}(I_0) = \psi(\nabla_I f(I) \circ I_0) + \sum_{l\in L,k\in C_l} \psi(f^k_b(x))

Equation 3: Saliency map for the concept c of a model S with respect to an input image x, generated by the Full-Gradient method².

Lack class-sensibility

Expensive to compute

Class Activation Mapping Explainable AI

¹ B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba. Learning Deep Features for Discriminative Localization. In Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2921-2929. 2016.

Equation 4: Feed-Forward for a for Convolutional Networks containing GAP layers and the formulation for CAM¹.

\implies L^c_\text{CAM}(f, x) = \sum_k w_k^c A^k

\implies L^c_\text{CAM}(f, x) = \sum_k w_k^c A^k

f(x) = \frac{1}{hw}\sum_{ij} \sum_k w_k^c A^k_{ij} = \text{GAP} (w^c \cdot A)

f(x) = \frac{1}{hw}\sum_{ij} \sum_k w_k^c A^k_{ij} = \text{GAP} (w^c \cdot A)

f(x) = \sum_k w_k^c \text{GAP}(A^k) = \sum_k w_k^c \frac{1}{hw}\sum_{ij} A^k_{ij}

f(x) = \sum_k w_k^c \text{GAP}(A^k) = \sum_k w_k^c \frac{1}{hw}\sum_{ij} A^k_{ij}

Extensions and Alternatives CAM-Based Explaining Methods

¹ R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. In International Conference on Computer Vision, pp. 618-626. 2017.

L^c_\text{Grad-CAM}(f, x) = \text{ReLU}(\sum_k \alpha_k^c A^k)

L^c_\text{Grad-CAM}(f, x) = \text{ReLU}(\sum_k \alpha_k^c A^k)

\alpha_k^c = \frac{1}{hw}\sum_{ij} \frac{\partial f_c(x)}{\partial A^k_{ij}}

\alpha_k^c = \frac{1}{hw}\sum_{ij} \frac{\partial f_c(x)}{\partial A^k_{ij}}

Equation 5: Definition for Grad-CAM visual explaining method, for an arbitrary convolutional network f.

Grad-CAM

Goal: to explain more complex networks, with non-linear (and yet smooth) operations after the GAP layer.

Figure 23: Examples of Grad-CAM being utilized to explaing a Visual Questioning Network based on convolutional layers and LSTM layers. Source: Selvaraju et al.¹

¹ A. Chattopadhay, A. Sarkar, P. Howlader, and V. N. Balasubramanian. Grad-CAM++: Generalized gradient-based visual explanations for deep convolutional networks.
In Winter Conference on Applications of Computer Vision (WACV), pp. 839-847. IEEE, 2018.

L^c_\text{Grad-CAM++}(f, x) = \text{ReLU}\Big(\sum_k \sum_{ij} \alpha^{kc}_{ij}\text{ReLU}\Big(\frac{\partial S_c}{\partial A_{ij}^k}\Big) A^k\Big)

L^c_\text{Grad-CAM++}(f, x) = \text{ReLU}\Big(\sum_k \sum_{ij} \alpha^{kc}_{ij}\text{ReLU}\Big(\frac{\partial S_c}{\partial A_{ij}^k}\Big) A^k\Big)

\alpha^{kc}_{ij} = \frac{\frac{\partial^2 S_c}{(\partial A_{ij}^k)^2}}{2 \frac{\partial^2 S_c}{(\partial A_{ij}^k)^2} + \sum_{ab} A_{ab}^k \frac{\partial^3 S_c}{(\partial A_{ij}^k)^3}}

\alpha^{kc}_{ij} = \frac{\frac{\partial^2 S_c}{(\partial A_{ij}^k)^2}}{2 \frac{\partial^2 S_c}{(\partial A_{ij}^k)^2} + \sum_{ab} A_{ab}^k \frac{\partial^3 S_c}{(\partial A_{ij}^k)^3}}

Equation 6: Definition of Grad-CAM++ visual explaining method.

Grad-CAM++

Goal: to activate homogeneously over all instances of the explained concept lying the the visual receptive field.

Extensions and Alternatives CAM-Based Explaining Methods

Figure 24: Grad-CAM and Grad-CAM++ being applied to samples in the ImageNet dataset. Source: Chatopadhay et al.¹

¹ H. Wang, Z. Wang, M. Du, F. Yang, Z. Zhang, S. Ding, P. Mardziel, and X. Hu. Score-CAM: Score-weighted visual explanations for convolutional neural networks. In Conference on Computer Vision and Pattern Recognition Workshops (CVPR), pp. 24-25. 2020.

L^c_\text{Score-CAM}(f, x) = \text{ReLU}\Big(\sum_k f_c(x \circ \frac{A^k}{\max A^k}) A^k\Big)

L^c_\text{Score-CAM}(f, x) = \text{ReLU}\Big(\sum_k f_c(x \circ \frac{A^k}{\max A^k}) A^k\Big)

Equation 7: Definition of the Score-CAM visual explaining method¹.

Score-CAM

Goal: to combine the many activation maps, weighted by their contribution towards the Average Drop % metric.

Extensions and Alternatives CAM-Based Explaining Methods

Figure 25: Examples of sensitivity maps obtained from Grad-CAM, Grad-CAM++ and Score-CAM.
Source: Wang et al.¹

FC Conditional Random Fields Refinement of Segmentation Masks

¹ P. Krähenbühl, and V. Koltun. Efficient inference in fully connected CRFs with gaussian edge potentials. In Advances in Neural Information Processing Systems, 24. 2011.

E(x) = ∑_i ψ_u(x_i) + ∑_{i < j} ψ_p(x_i, x_j)

E(x) = ∑_i ψ_u(x_i) + ∑_{i < j} ψ_p(x_i, x_j)

pairwise

unary

ψ_p(x_i, x_j) = μ(x_i, x_j)\Big[w^{(1)}\exp\big(-\frac{|p_i-p_j|^2}{2\theta_\alpha^2}-\frac{|I_i-I_j|^2}{2\theta_\beta^2}\big) + w^{(2)}\exp\big(-\frac{|p_i-p_j|^2}{2\theta_\gamma^2}\big)\Big]

ψ_p(x_i, x_j) = μ(x_i, x_j)\Big[w^{(1)}\exp\big(-\frac{|p_i-p_j|^2}{2\theta_\alpha^2}-\frac{|I_i-I_j|^2}{2\theta_\beta^2}\big) + w^{(2)}\exp\big(-\frac{|p_i-p_j|^2}{2\theta_\gamma^2}\big)\Big]

smoothness kernel

label compatibility function (learnable)

appearance kernel

Figure 27: Qualitative results of dCRF. Source: Krähenbühl and Koltun¹.

Pixel Semantic Affinity Refinement of Segmentation Masks

¹ J. Ahn, and S. Kwak. Learning pixel-level semantic affinity with image-level supervision for weakly supervised semantic segmentation. In Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4981-4990. 2018.

Inference

T = D^{−1}W^{\circβ}, D_{ii} = ∑_j W_{ij}^β

T = D^{−1}W^{\circβ}, D_{ii} = ∑_j W_{ij}^β

\text{vec}(M_c^∗) = T^t\cdot\text{vec}(M_c),

\text{vec}(M_c^∗) = T^t\cdot\text{vec}(M_c),

∀c ∈ C \cup \{\text{bg}\}

∀c ∈ C \cup \{\text{bg}\}

Training

\mathcal{L} = \mathcal{L}_\text{fg}^+ + \mathcal{L}_\text{bg}^+ + 2\mathcal{L}^-

\mathcal{L} = \mathcal{L}_\text{fg}^+ + \mathcal{L}_\text{bg}^+ + 2\mathcal{L}^-

\mathcal{L} = -\frac{1}{|\mathcal{P}_\text{fg}^+|}\sum_{ij\in \mathcal{P}_\text{fg}^+}\log W_{ij} \\ - \frac{1}{|\mathcal{P}_\text{bg}^+|}\sum_{ij\in \mathcal{P}_\text{bg}^+}\log W_{ij}

\mathcal{L} = -\frac{1}{|\mathcal{P}_\text{fg}^+|}\sum_{ij\in \mathcal{P}_\text{fg}^+}\log W_{ij} \\ - \frac{1}{|\mathcal{P}_\text{bg}^+|}\sum_{ij\in \mathcal{P}_\text{bg}^+}\log W_{ij}

W_{ij} = \exp\{-\|f(x_i,y_i) - f(x_j,y_j)\|_1\}

W_{ij} = \exp\{-\|f(x_i,y_i) - f(x_j,y_j)\|_1\}

- 2\frac{1}{|\mathcal{P}^-|}\sum_{ij\in \mathcal{P}^-}\log (1-W_{ij})

- 2\frac{1}{|\mathcal{P}^-|}\sum_{ij\in \mathcal{P}^-}\log (1-W_{ij})

Pairs extraction

Figure 5: Illustration of pairs of pixels selected for affinity evaluation. Source: Ahn and Kwak¹.

Figure 5: AffinityNet architecture. Source: Ahn and Kwak¹.

Figure 28: Qualitative results of random walk using Affinity Network. Source: Ahn and Kwak¹.

L^c_\text{CAM}(f, x) = \sum_k w_k^c A^k

L^c_\text{CAM}(f, x) = \sum_k w_k^c A^k

L^c_\text{Grad-CAM}(f, x) = \sum_k \sum_{ij} \frac{\partial f_c(x)}{\partial A^k_{ij}} A^k

L^c_\text{Grad-CAM}(f, x) = \sum_k \sum_{ij} \frac{\partial f_c(x)}{\partial A^k_{ij}} A^k

Contribution towards the classification of class c.

ReLU and GAP omitted for conciseness

J_c = S_c - \frac{1}{|N_x|} \sum_{n\in N_x} S_n

J_c = S_c - \frac{1}{|N_x|} \sum_{n\in N_x} S_n

Regions that contribute t.t.c. of c, and do not contribute t.t.c. of the adjacent classes.

L^c_\text{MinMax-Grad-CAM}(f, x) = \sum_k \sum_{ij}\frac{\partial J_c}{\partial A_{ij}^k} A^k

L^c_\text{MinMax-Grad-CAM}(f, x) = \sum_k \sum_{ij}\frac{\partial J_c}{\partial A_{ij}^k} A^k

L^c_\text{MinMax-CAM}(f, x) = \sum_k \big[w^c_k - \frac{1}{|N_x|} \sum_{n\in N_x} w^n_k\big]

L^c_\text{MinMax-CAM}(f, x) = \sum_k \big[w^c_k - \frac{1}{|N_x|} \sum_{n\in N_x} w^n_k\big]

MinMax-CAM Contributions for Explainable AI

L^c_\text{D-MinMax-Grad-CAM}(f, x) = \text{ReLU}\Big(\sum_k \alpha^c_k A^k\Big)

L^c_\text{D-MinMax-Grad-CAM}(f, x) = \text{ReLU}\Big(\sum_k \alpha^c_k A^k\Big)

MinMax-CAM Contributions for Explainable AI

\alpha^c_k = \sum_{ij} \bigg[\text{ReLU}\Big(\frac{\partial S_c}{\partial A_{ij}^k}\Big) - \frac{1}{|N_x|}\text{ReLU}\Big(\sum_{n\in N_x} \frac{\partial S_n}{\partial A_{ij}^k}\Big) + \frac{1}{|C_x|}\min\Big(0, \sum_{n\in C_x} \frac{\partial S_n}{\partial A_{ij}^k}\Big) \bigg]

\alpha^c_k = \sum_{ij} \bigg[\text{ReLU}\Big(\frac{\partial S_c}{\partial A_{ij}^k}\Big) - \frac{1}{|N_x|}\text{ReLU}\Big(\sum_{n\in N_x} \frac{\partial S_n}{\partial A_{ij}^k}\Big) + \frac{1}{|C_x|}\min\Big(0, \sum_{n\in C_x} \frac{\partial S_n}{\partial A_{ij}^k}\Big) \bigg]

Positive contributions t.t.c. of c

Positive contributions t.t.c. of n

Negative contributions t.t.c. of all.

Kernel Usage Regularization Contributions for Explainable AI

W^r_\alpha = W \circ \alpha \text{softmax}(W)

W^r_\alpha = W \circ \alpha \text{softmax}(W)

y = \sigma(g \cdot W^r_\alpha + b)

y = \sigma(g \cdot W^r_\alpha + b)

W=[w_k^c]_{K\times C}

W=[w_k^c]_{K\times C}

g=[g^k]_{K} = \text{GAP}_{hw}(A^k)

g=[g^k]_{K} = \text{GAP}_{hw}(A^k)

b = [b_c]_C

b = [b_c]_C

Figure 39: Correlation between different weight vectors in a vanilla (unregularized) sigmoid FC layer. Source: David et al.¹

Figure 40: Correlation between different weight vectors in a sigmoid FC layer trained with Kernel Usage Regularization. Source: David et al.¹

Exploration of Complementary WSSS Strategies Contributions for WSSS

\mathcal{L}_\text{P-OC} = \mathcal{L}_\text{cls} + \mathcal{L}_\text{re-cls} + \mathcal{L}_\text{re} + \lambda_\text{cse}\mathcal{L}_\text{cse}

\mathcal{L}_\text{P-OC} = \mathcal{L}_\text{cls} + \mathcal{L}_\text{re-cls} + \mathcal{L}_\text{re} + \lambda_\text{cse}\mathcal{L}_\text{cse}

= \ell_\text{bce}(p_i, t_i) + \ell_\text{bce}(p^\text{re}_i, t_i)

= \ell_\text{bce}(p_i, t_i) + \ell_\text{bce}(p^\text{re}_i, t_i)

+ \lambda_\text{re}\|A_i - A^\text{re}_i\|_1 + \lambda_\text{cse}\ell_\text{bce}(\hat{p}_i, \hat{t}_i)

+ \lambda_\text{re}\|A_i - A^\text{re}_i\|_1 + \lambda_\text{cse}\ell_\text{bce}(\hat{p}_i, \hat{t}_i)

P-NOC Contributions for WSSS

\mathcal{L}_\text{noc} = \lambda_\text{noc}\ell_\text{bce}(oc(x_i \circ (M_i^{c_k} < \delta_\text{noc})), t_i)

\mathcal{L}_\text{noc} = \lambda_\text{noc}\ell_\text{bce}(oc(x_i \circ (M_i^{c_k} < \delta_\text{noc})), t_i)

\mathcal{L}_\text{P-OC} = \ell_\text{bce}(p_i, t_i) + \ell_\text{bce}(p^\text{re}_i, t_i)

\mathcal{L}_\text{P-OC} = \ell_\text{bce}(p_i, t_i) + \ell_\text{bce}(p^\text{re}_i, t_i)

+ \lambda_\text{re}\|A_i - A^\text{re}_i\|_1 + \lambda_\text{cse}\ell_\text{bce}(\hat{p}_i, \hat{t}_i)

+ \lambda_\text{re}\|A_i - A^\text{re}_i\|_1 + \lambda_\text{cse}\ell_\text{bce}(\hat{p}_i, \hat{t}_i)

C²AM-H Contributions for WSSS

\mathcal{L}^\mathcal{B}_\text{C²AM-H} = \mathcal{L}^\mathcal{B}_\text{pos-f} + \mathcal{L}^\mathcal{B}_\text{pos-b} + \mathcal{L}^\mathcal{B}_\text{neg} + \lambda_{h}\sum_{i\in b}\sum_{h,w} \mathbb{1}_{[A_i^{hw} > \delta_\text{fg}]}\ell_\text{bce}(\hat{y}^{hw}_i, p^{hw}_i)

\mathcal{L}^\mathcal{B}_\text{C²AM-H} = \mathcal{L}^\mathcal{B}_\text{pos-f} + \mathcal{L}^\mathcal{B}_\text{pos-b} + \mathcal{L}^\mathcal{B}_\text{neg} + \lambda_{h}\sum_{i\in b}\sum_{h,w} \mathbb{1}_{[A_i^{hw} > \delta_\text{fg}]}\ell_\text{bce}(\hat{y}^{hw}_i, p^{hw}_i)

Figure 43: CAMs produced by a network trained with P-OC, when presented with samples from the Pascal VOC 2012 train set.

Figure 44: Hints obtained by binarizing the CAMs, using a threshold of 0.4.

Exploring Explaining Methods in Multi-Label Problems and Complementary Regularization Strategies in Weakly Supervised Semantic Segmentation University of Campinas Doctoral Qualifying Exam Candidate: Lucas Oliveira David Advisor: Prof. Dr. Zanoni Dias Co-advisor: Prof. Dr. Hélio Pedrini

EQE - Lucas David - 2023/1

By Lucas David

EQE - Lucas David - 2023/1

2 years ago
183

Lucas David

lucasdavid.github.io