Learning mid-level representations for computer vision

Stavros Tsogkas

People don't just "see"

other cars

how far?

Don't run them over!!

which traffic light?

Image classification paradigm

\mathbf{f} = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_D \end{bmatrix}
f=[x1x2xD]\mathbf{f} = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_D \end{bmatrix}

Model

"car"

Input

Discriminative representation

Machine learning algorithm

The deep learning revolution

ImageNet top-5 error rate (%)

Object detection 2010-2012: performance plateau

Performance: %mAP (mean average precision)

DPM variants

Object detection 2013

Girshick et al., Rich feature hierarchies for accurate object detection and semantic segmentation, CVPR 2014

DPM variants

Deep Learning

Brief history of learning in vision

Digit recognition (CNNs, 1989)

Face detection (Haar features + AdaBoost, 2001)

Object detection

(HOGs + SVMs, 2005-2010)

Image classification (Deep CNNs, 2012)

DL success stories

Image captioning

Semantic segmentation

Edge detection

Object detection

Why now?

More data

Better hardware

Support from industry

Better tools

Practical limitations

3-5 minutes per image annotation + $$$

days/weeks to train

low-powered hardware

not enough data

CNNs can be easily fooled...

+\epsilon
+ϵ+\epsilon
=
==

noise

"gibbon"

99.3% confidence

"panda"

57.7% confidence

...and do not generalize well

Correctly classified

Incorrectly classified

Original images

Negative images

CNNs favour appearance over shape

Zebra or an elephant with stripes?

People learn general rules instead

What makes a table, a table?

flat surface

vertical support

Constellation table by Fulo

What are mid-level representations?

Low-level: Edges

  • class agnostic
  • not discriminative

High-level: person segmentation

  • class specific
  • not generalizable

Mid-level:

  • discriminative and robust

  • shareable across object categories

  • simpler to model, scalable

Mid-level representations include

textures

object parts

symmetries

A "vocabulary" for images

"wheel"

"sand"

  • easier to model

  • shareable

"rotational symmetry"

Unsupervised feature learning

X = (       ,       )

Y = 3

Context prediction

Doersch et al., 2015

Jigsaw puzzles

Moroozi and Favaro, 2016

1

2

3

4

8

5

9

6

7

Results comparable to supervised methods!

Outline

1. Medial axes

       [ECCV 2012, ICCV 2017]

2. Object parts

3. Future research

Learning mid-level representations for shape and texture.

head

torso

arms

legs

hands

[arXiv 2016, ISBI 2016, MICCAI 2016]

Medial axis detection

Symmetry is everywhere

Global symmetry is unstable

But local symmetry is more robust

Medial Axis Transform (MAT)

A transformation for extracting new descriptors of shape, H. Blum, Models for the perception of speech and visual form, 1967

Medial Axis Transform (MAT)

A transformation for extracting new descriptors of shape, H. Blum, Models for the perception of speech and visual form, 1967

MAT applications

Shape matching and recognition

Shape simplification

Shape deformation with volume preservation

MAT for natural images is not obvious

So let's learn it from data!

Image from BSDS300

Ground-truth segmentation

Ground-truth skeleton

Medial point detection: binary classification problem

Features designed for bilateral symmetry of image regions

\theta
θ\theta
s
ss

Scale

Orientation

Compute colour and texture histograms for "inside"

1

2

3

h_1
h1h_1
h_2
h2h_2

\(\chi^2\)(     ,     )

h_3
h3h_3
\gg 0
0\gg 0

High chance of symmetry!

\(\chi^2\)(     ,     ) \(\sim\) \(\chi^2\)(     ,     ) \(\approx 0\)?

h_4
h4h_4
h_5
h5h_5

5

4

6

h_4
h4h_4
h_6
h6h_6

Low chance of symmetry!

High histogram distance?

and "outside"

...for both pairs?

\mathbf{f_{\mathbf{x}}}(s_i,\theta_j) = [\chi^2_{12}, \chi^2_{13},\chi^2_{23}\ldots]
fx(si,θj)=[χ122,χ132,χ232]\mathbf{f_{\mathbf{x}}}(s_i,\theta_j) = [\chi^2_{12}, \chi^2_{13},\chi^2_{23}\ldots]

Train with multiple instance learning

Symmetry probability:

p_{\mathbf{x}}(s_i,\theta_j) = \frac{1}{1+e^{-\mathbf{w}^T\mathbf{f_x}(s_i,\theta_i)}}
px(si,θj)=11+ewTfx(si,θi)p_{\mathbf{x}}(s_i,\theta_j) = \frac{1}{1+e^{-\mathbf{w}^T\mathbf{f_x}(s_i,\theta_i)}}

Goal: learn w

Challenge: no ground truth annotation for scale and orientation

\mathbf{f_x}(s_1,\theta_1)
fx(s1,θ1)\mathbf{f_x}(s_1,\theta_1)
\mathbf{f_x}(s_2,\theta_1)
fx(s2,θ1)\mathbf{f_x}(s_2,\theta_1)
\mathbf{f_x}(s_3,\theta_1)
fx(s3,θ1)\mathbf{f_x}(s_3,\theta_1)
\mathbf{f_x}(s_4,\theta_1)
fx(s4,θ1)\mathbf{f_x}(s_4,\theta_1)

The bag is positive if at least one instance is positive

\mathbf{f_y}(s_1,\theta_2)
fy(s1,θ2)\mathbf{f_y}(s_1,\theta_2)
\mathbf{f_y}(s_2,\theta_2)
fy(s2,θ2)\mathbf{f_y}(s_2,\theta_2)
\mathbf{f_y}(s_3,\theta_2)
fy(s3,θ2)\mathbf{f_y}(s_3,\theta_2)
\mathbf{f_y}(s_4,\theta_2)
fy(s4,θ2)\mathbf{f_y}(s_4,\theta_2)
\mathbf{x}
x\mathbf{x}
\mathbf{y}
y\mathbf{y}

The bag is negative if all instances are negative

P_{\mathbf{x}} = 1-\prod_i \prod_j (1-p_{\mathbf{x}}(s_i,\theta_j)) \sim \max_i \max_j p_{\mathbf{x}}(s_i, \theta_j)
Px=1ij(1px(si,θj))maximaxjpx(si,θj)P_{\mathbf{x}} = 1-\prod_i \prod_j (1-p_{\mathbf{x}}(s_i,\theta_j)) \sim \max_i \max_j p_{\mathbf{x}}(s_i, \theta_j)

Noisy-OR = differentiable "max"

Dense feature extraction

Multiple scales

Multiple orientations

\theta_1
θ1\theta_1
\theta_2
θ2\theta_2
\theta_3
θ3\theta_3

Computing symmetry probabilities

Orientation

Scale

Symmetry probability

Non-maximum suppression

Fast detection with decision forests

Symmetry "tokens"

Clustering

p_{medial} = \displaystyle\sum_{t=1}^{N_{tokens}} p_{t} = 1-p_{bg}
pmedial=t=1Ntokenspt=1pbgp_{medial} = \displaystyle\sum_{t=1}^{N_{tokens}} p_{t} = 1-p_{bg}

~0.5 sec per image

(40-60x faster than MIL)

MAT should be invertible

MAT^{-1}
MAT1MAT^{-1}
MAT^{-1}
MAT1MAT^{-1}

Generative definition of medial disks

\mathbf{p}_i
pi\mathbf{p}_i
r_i
rir_i
\mathbf{p}_j
pj\mathbf{p}_j
r_j
rjr_j
  • fsummarizes patch (encoding)
  • g: reconstructs patch (decoding)
f
ff
g
gg
D^I_{\mathbf{p}_i,r_i}
Dpi,riID^I_{\mathbf{p}_i,r_i}
D^I_{\mathbf{p}_j,r_j}
Dpj,rjID^I_{\mathbf{p}_j,r_j}
\tilde{D}^I_{\mathbf{p}_j,r_j}
D~pj,rjI\tilde{D}^I_{\mathbf{p}_j,r_j}
\tilde{D}^I_{\mathbf{p}_i,r_i}
D~pi,riI\tilde{D}^I_{\mathbf{p}_i,r_i}
[\bar{R}_i,\bar{G}_i,\bar{B}_i]
[R¯i,G¯i,B¯i][\bar{R}_i,\bar{G}_i,\bar{B}_i]
[\bar{R}_j,\bar{G}_j,\bar{B}_j]
[R¯j,G¯j,B¯j][\bar{R}_j,\bar{G}_j,\bar{B}_j]
e(
e(e(
)\approx 0
)0)\approx 0
,
,,
,
,,
e(
e(e(
) \gg 0
)0) \gg 0

AppearanceMAT definition

...

for all p,r

\text{Objective: } \min_{\mathbf{p},r} \sum_{i=1}^m e_{\mathbf{p}_i,r_i}
Objective: minp,ri=1mepi,ri\text{Objective: } \min_{\mathbf{p},r} \sum_{i=1}^m e_{\mathbf{p}_i,r_i}
g \circ f \circ D
gfDg \circ f \circ D
\text{Constraint: } I=\bigcup_{i=1}^m D^I_{\mathbf{p}_i,r_i}
Constraint: I=i=1mDpi,riI\text{Constraint: } I=\bigcup_{i=1}^m D^I_{\mathbf{p}_i,r_i}

A trivial solution

Select pixels as medial points (disks of radius 1).

Perfect reconstruction quality!

Not very useful in practice...

Goal: balance between sparsity and reconstruction

  • Dense representation

  • Low reconstruction error

  • Sparse representation

  • High reconstruction error

Favor the selection of larger disks...

Increasing \( w \)

Add regularization term to disk cost: \( c_{\mathbf{p},r} = e_{\mathbf{p},r} + \orange{w}(\frac{1}{r}) \).

...as long as they do not incur a high reconstruction error

AMAT is a weighted geometric set cover problem

WGSC is NP-hard!

PTAS exist

\Bigg\{
{\Bigg\{
\Bigg\{
{\Bigg\{
\Bigg\{
{\Bigg\{

Set we want to cover

Covering elements (range)

Set costs

Cover 2D image

using disks of radii {1,...,R}

with costs \( c_{\mathbf{p},r}\)

Greedy algorithm

  1. Compute all costs \( c_{\mathbf{p},r} \).

  2. While image has not been completely covered:

    • Select disk \( D_{\mathbf{p^*},r^*} \) with lowest cost.    

    • Add point \( (\mathbf{p^*},r^*,\mathbf{f}_{\mathbf{p^*},r^*}) \) to the solution.

    • Mark disk pixels as covered.

    • Update costs \( c_{\mathbf{p},r} \)

 

Approximation algorithms, Vijay V. Vazirani

AMAT Demo

More reconstruction results

Input

MIL

GT-seg

GT-skel

AMAT

Grouping points together...

  • space proximity
  • smooth scale variation
  • color similarity

color similarity

Input

AMAT

Groups

(color coded)

...opens up possibilities

Thinning

Segmentation

  • Object proposals

  • and more...

Qualitative results

Input

AMAT

Groups

Reconstruction

Part segmentation

Fully convolutional neural networks

P(person)

P(horse)
:
P(dog)

dog

person

Finetune for part segmentation

head

torso

arms

legs

hands

Part segmentation in natural images

Small parts are lost due to downsampling

RGB: 152x152

L1: 142x142

L2: 71x71

L3: 63x63

L4: 55x55

L5 25x25

L6 21x21

Extract features at multiple scales

Scale 1x

Scale 1.5x

Scale 2x

Combine with object detector

Towards real-time object detection with region proposal networks, S.Ren et al., NIPS 2015

Find scale that is closest to the network's nominal scale

minimize:\, |\green{h_N}-\red{h_b}| + |\green{w_N}-\red{w_b}|
minimize:hNhb+wNwbminimize:\, |\green{h_N}-\red{h_b}| + |\green{w_N}-\red{w_b}|

: bounding box adjusted for scale

: default size of the network's input

w
ww
h
hh

Use features from the ideal scale

Multi-scale analysis improves results...

...and is efficient for images with many objects

Segmenting brain "parts" is also important

Alzheimer's:

structure degeneration

Schizophrenia: volume abnormalities
[Shenton M.E. et al.,  Psychiatry Res. 2002]

Tumors: avoid radiation on sensitive regions
[Hoehn D. et al., Journal of Medical Cases, 2012]

Why automatic segmentation?

Putamen

Ventricle

Caudate

Amygdala

Hippocampus

Visualization and inspection

No need for manual annotation
(time consuming, need experts,
limited reproducibility)

Non-invasive diagnosis and treatment

Intensity in MRI is not enough

Spatial arrangement patterns matter

Segmenting subcortical structures in 2D FMRI with FCNNs

P(thalamus)

P(putamen)
:
P(caudate)

:

P(white matter)

2D slice

thalamus

white matter

From 2D slice to 3D volume segmentation

CNN architecture

  • 16 layers including max-pooling and dropout.

  • Dilated convolutions for higher resolution.

  • Compact architecture (~4GB GPU RAM)

MRF enforces volume homogeneity

S^{*}=\text{argmin}E(S) = \sum_{i\in\mathcal{V}}\green{U_i} + \lambda\sum_{(i,j)\in\mathcal{E}}\orange{P_{ij}}
S=argminE(S)=iVUi+λ(i,j)EPijS^{*}=\text{argmin}E(S) = \sum_{i\in\mathcal{V}}\green{U_i} + \lambda\sum_{(i,j)\in\mathcal{E}}\orange{P_{ij}}
i
ii
j
jj

f(CNN output)

d(intensities)

Solve with \(\alpha\)-expansion

MRF removes spurious responses

CNN

CNN+MRF

3D segmentation results

Our results

Groundtruth

Deep priors for coregistration and cosegmentation

Future work

Recognition in line drawings

Chair

Monitor

Basket

Office

Intuitive form of communication

Sketch-based image retrieval

3D models from sketches

Grouping is key

Smart scribbles for sketch segmentation, Norris et al., EU computer graphics forum

Example-based sketch segmentation and labelling using CRFs, Schneider et al., TOG 2016

Gestalt grouping principles

Proximity

Parallelism

Continuity

Closure

Learn to group from synthetic data

Use CNN to extract point embeddings

\mathbf{e}_1
e1\mathbf{e}_1
\mathbf{e}_2
e2\mathbf{e}_2
\mathbf{e}_3
e3\mathbf{e}_3
||\mathbf{e}_1 - \mathbf{e}_3|| \approx 0
e1e30||\mathbf{e}_1 - \mathbf{e}_3|| \approx 0
||\mathbf{e}_1 - \mathbf{e}_2|| \gg 0
e1e20||\mathbf{e}_1 - \mathbf{e}_2|| \gg 0

Points on the same shape have similar embeddings

Points on different shapes have dissimilar embeddings

Cluster embeddings to obtain groups

RNNs for shape embeddings

\mathbf{p}_1
p1\mathbf{p}_1
\mathbf{p}_2
p2\mathbf{p}_2
\mathbf{p}_3
p3\mathbf{p}_3
\mathbf{p}_4
p4\mathbf{p}_4
\mathbf{p}_5
p5\mathbf{p}_5

N

\mathbf{p}_1
p1\mathbf{p}_1
\mathbf{h}_1
h1\mathbf{h}_1

N

\mathbf{p}_2
p2\mathbf{p}_2
\mathbf{h}_2
h2\mathbf{h}_2

N

N

N

\mathbf{p}_3
p3\mathbf{p}_3
\mathbf{h}_3
h3\mathbf{h}_3
\mathbf{p}_4
p4\mathbf{p}_4
\mathbf{h}_4
h4\mathbf{h}_4
\mathbf{p}_5
p5\mathbf{p}_5
\mathbf{h}_5
h5\mathbf{h}_5

triangle

square

circle

  1. grouping

  2. classification

Application to complex scenes?

Edge detector result

Boundaries and medial axes are dual representations of objects

edge loss \(l_e\)

skeleton loss \(l_s\)

Edge detection network

Skeleton detection network

But they are usually extracted independently

Exploit duality to jointly learn boundaries and skeletons

edge loss \(l_e\)

skeleton loss \(l_s\)

consistency loss \(l_c\)

L_{total} = l_e + l_s + l_c
Ltotal=le+ls+lcL_{total} = l_e + l_s + l_c
  • Single network (more efficient)

  • Joint optimization should improve accuracy

Autoencoders for image patches

Input patch \(P\)

Reconstruction \(\tilde P\)

Reconstruction loss \(L(P, \tilde{P})\)

Self supervised task

encoder

decoder

Learn homogeneity for textured patches from segmentation data

Rank loss: L(     ,     ) < L(     ,     )

high homogeneity

low homogeneity

Applications

Painterly rendering

Interactive segmentation

Constrained image editing

Acknowledgements

Mahsa Shakeri

Enzo Ferrante

Siddhartha Chandra

Eduard Trulls

P.A. Savalle

George Papandreou

Sven Dickinson

Nikos Paragios

Iasonas Kokkinos

Andrea Vedaldi

Thank you for your attention!

http://tsogkas.github.io/

Symmetry

Medical imaging

Segmentation and parts

Learning mid-level representations for computer vision

By Stavros Tsogkas

Learning mid-level representations for computer vision

  • 1,048