Gated Shape CNNs

for Semantic Segmentation

arXiv

12 Jul 2019

Project Page

What's the idea?

Problem:

In current methods color, shape and texture information are all processed together i.e. we are loosing information

Solution:

Let's simply* process shape in a parallel stream

* with novel Gated Conv Layer, Atrous Spatial Pyramid Pooling, Dual Task Regularizer and other cool stuff

Architecture

Regular Stream $R_θ(I)$

Input Image $$I∈R^{3×H×W}$$

Output feature representation

$$r∈R^{C×\frac{H}{m}×\frac{W}{m}}$$, where $m$ is stride

Shape Stream $S_φ$

Canny edge detector

Output boundary map

$$s∈R^{H×W}$$

concat

$r_0$

$r_{m-2}$

$r_{m-1}$

$r_m$

conv

1x1

$σ$

$$\circ$$

$\circ$ is an element-wise product

$\bigoplus$

conv

1x1

$σ$

$$\circ$$

$\bigoplus$

conv

$$\hat{s_1}$$

...

$$\hat{s_{m-2}}$$

concat

conv

1x1

$σ$

$$\circ$$

$\bigoplus$

conv

$$\hat{s_{m-1}}$$

conv

1x1

$σ$

$$\circ$$

$\bigoplus$

conv

$$\hat{s_m}$$

conv

1x1

Edge

BCE Loss

concat

conv

1x1

$\nabla I$

Fusion Module $F_γ$

Dualtask

Loss

Atrous Spatial

Pyramid Pooling

Semantic

CE Loss

$r$

$s$

$f=p(y|s, r)=F_γ(s,r)∈R^{C×H×W}$

$f$

$L^{θ φ,γ}=λ_1L^{θ,φ}_{BCE}(s,\hat{s})+λ_2L^{θ φ,γ}_{CE}(\hat{y}, f)$, where $λ_1=10, λ_2=1$

Let $ζ∈R^{H×W}$ be a potential that represents whether a particular pixel belongs to a semantic boundary in the input image $I$ and

$ζ=\frac{1}{\sqrt{2}}||\nabla(G ∗ \underset{k}{\mathrm{argmax}} p(y^k|r,s))||$, where $G$ denotes a Gaussian filter. Also $\hat{ζ}$ is a GT binary mask.

$L^{θ φ,γ}_{reg→} = λ_3\sum_{p+}|ζ(p+) − \hat{ζ}(p+)|$

$L^{θ φ,γ}_{reg←} = λ_4\sum_{k,p} 1_{k_p} [\hat{y}^k_p logp(y^k_p|r, s)]$

$L^{θ φ,γ}=L^{θ φ,γ}_{reg→}+L^{θ φ,γ}_{reg←}$

Where $p+$ contains the set of all non-zero pixel coordinates and $1_s=\{1: s > thrs\}$, where $thrs$ is a cofidence treshold $=0.8$ and $λ_3=1, λ_4=1$