Learning mid-level representations for computer vision

Stavros Tsogkas

People don't just "see"

other cars

how far?

Don't run them over!!

which traffic light?

Image classification paradigm

\mathbf{f} = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_D \end{bmatrix}

\mathbf{f} = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_D \end{bmatrix}

Model

"car"

Input

Discriminative representation

Machine learning algorithm

The deep learning revolution

ImageNet top-5 error rate (%)

http://www.image-net.org/challenges/LSVRC/2014/eccv2014

Object detection 2010-2012: performance plateau

Performance: %mAP (mean average precision)

DPM variants

Object detection 2013

Girshick et al., Rich feature hierarchies for accurate object detection and semantic segmentation, CVPR 2014

DPM variants

Deep Learning

Brief history of learning in vision

Digit recognition (CNNs, 1989)

Face detection (Haar features + AdaBoost, 2001)

Object detection

(HOGs + SVMs, 2005-2010)

Image classification (Deep CNNs, 2012)

DL success stories

Image captioning

Semantic segmentation

Edge detection

Object detection

Why now?

More data

Better hardware

Support from industry

Better tools

Practical limitations

3-5 minutes per image annotation + $$$

days/weeks to train

low-powered hardware

not enough data

CNNs can be easily fooled...

+\epsilon

+\epsilon

=

Explaining and harnessing adversarial examples, Goodfellow et al., ICLR 2015.

noise

"gibbon"

99.3% confidence

"panda"

57.7% confidence

...and do not generalize well

On the limitation of convolutional neural networks in recognizing negative images, Hosseini et al., arXiv 2017

Correctly classified

Incorrectly classified

Original images

Negative images

CNNs favour appearance over shape

Zebra or an elephant with stripes?

People learn general rules instead

What makes a table, a table?

flat surface

vertical support

Constellation table by Fulo

What are mid-level representations?

Low-level: Edges

class agnostic
not discriminative

High-level: person segmentation

class specific
not generalizable

Mid-level:

discriminative and robust
shareable across object categories
simpler to model, scalable

Mid-level representations include

textures

object parts

symmetries

A "vocabulary" for images

"wheel"

"sand"

easier to model
shareable

"rotational symmetry"

Unsupervised feature learning

X = ( , )

Y = 3

Context prediction

Doersch et al., 2015

Jigsaw puzzles

Moroozi and Favaro, 2016

Results comparable to supervised methods!

Outline

1. Medial axes

[ECCV 2012, ICCV 2017]

2. Object parts

3. Future research

Learning mid-level representations for shape and texture.

head

torso

arms

legs

hands

[arXiv 2016, ISBI 2016, MICCAI 2016]

Medial axis detection

Symmetry is everywhere

Global symmetry is unstable

But local symmetry is more robust

Medial Axis Transform (MAT)

A transformation for extracting new descriptors of shape, H. Blum, Models for the perception of speech and visual form, 1967

Medial Axis Transform (MAT)

A transformation for extracting new descriptors of shape, H. Blum, Models for the perception of speech and visual form, 1967

MAT applications

Shape matching and recognition

Shape simplification

Shape deformation with volume preservation

MAT for natural images is not obvious

So let's learn it from data!

Image from BSDS300

Ground-truth segmentation

Ground-truth skeleton

Medial point detection: binary classification problem

Learning-based symmetry detection for natural images, S. Tsogkas, I. Kokkinos, ECCV 2012.

Features designed for bilateral symmetry of image regions

\theta

\theta

s

Scale

Orientation

Compute colour and texture histograms for "inside"

h_1

h_1

h_2

h_2

$\chi^2$( , )

h_3

h_3

\gg 0

\gg 0

High chance of symmetry!

$\chi^2$( , ) $\sim$ $\chi^2$( , ) $\approx 0$?

h_4

h_4

h_5

h_5

h_4

h_4

h_6

h_6

Low chance of symmetry!

High histogram distance?

and "outside"

...for both pairs?

\mathbf{f_{\mathbf{x}}}(s_i,\theta_j) = [\chi^2_{12}, \chi^2_{13},\chi^2_{23}\ldots]

\mathbf{f_{\mathbf{x}}}(s_i,\theta_j) = [\chi^2_{12}, \chi^2_{13},\chi^2_{23}\ldots]

Train with multiple instance learning

Symmetry probability:

p_{\mathbf{x}}(s_i,\theta_j) = \frac{1}{1+e^{-\mathbf{w}^T\mathbf{f_x}(s_i,\theta_i)}}

p_{\mathbf{x}}(s_i,\theta_j) = \frac{1}{1+e^{-\mathbf{w}^T\mathbf{f_x}(s_i,\theta_i)}}

Goal: learn w

Challenge: no ground truth annotation for scale and orientation

\mathbf{f_x}(s_1,\theta_1)

\mathbf{f_x}(s_1,\theta_1)

\mathbf{f_x}(s_2,\theta_1)

\mathbf{f_x}(s_2,\theta_1)

\mathbf{f_x}(s_3,\theta_1)

\mathbf{f_x}(s_3,\theta_1)

\mathbf{f_x}(s_4,\theta_1)

\mathbf{f_x}(s_4,\theta_1)

The bag is positive if at least one instance is positive

\mathbf{f_y}(s_1,\theta_2)

\mathbf{f_y}(s_1,\theta_2)

\mathbf{f_y}(s_2,\theta_2)

\mathbf{f_y}(s_2,\theta_2)

\mathbf{f_y}(s_3,\theta_2)

\mathbf{f_y}(s_3,\theta_2)

\mathbf{f_y}(s_4,\theta_2)

\mathbf{f_y}(s_4,\theta_2)

\mathbf{x}

\mathbf{x}

\mathbf{y}

\mathbf{y}

The bag is negative if all instances are negative

P_{\mathbf{x}} = 1-\prod_i \prod_j (1-p_{\mathbf{x}}(s_i,\theta_j)) \sim \max_i \max_j p_{\mathbf{x}}(s_i, \theta_j)

P_{\mathbf{x}} = 1-\prod_i \prod_j (1-p_{\mathbf{x}}(s_i,\theta_j)) \sim \max_i \max_j p_{\mathbf{x}}(s_i, \theta_j)

Noisy-OR = differentiable "max"

Dense feature extraction

Multiple scales

Multiple orientations

\theta_1

\theta_1

\theta_2

\theta_2

\theta_3

\theta_3

Computing symmetry probabilities

Orientation

Scale

Symmetry probability

Non-maximum suppression

Fast detection with decision forests

Symmetry "tokens"

Clustering

p_{medial} = \displaystyle\sum_{t=1}^{N_{tokens}} p_{t} = 1-p_{bg}

p_{medial} = \displaystyle\sum_{t=1}^{N_{tokens}} p_{t} = 1-p_{bg}

~0.5 sec per image

(40-60x faster than MIL)

MAT should be invertible

MAT^{-1}

MAT^{-1}

MAT^{-1}

MAT^{-1}

Generative definition of medial disks

\mathbf{p}_i

\mathbf{p}_i

r_i

r_i

\mathbf{p}_j

\mathbf{p}_j

r_j

r_j

f: summarizes patch (encoding)
g: reconstructs patch (decoding)

f

g

D^I_{\mathbf{p}_i,r_i}

D^I_{\mathbf{p}_i,r_i}

D^I_{\mathbf{p}_j,r_j}

D^I_{\mathbf{p}_j,r_j}

\tilde{D}^I_{\mathbf{p}_j,r_j}

\tilde{D}^I_{\mathbf{p}_j,r_j}

\tilde{D}^I_{\mathbf{p}_i,r_i}

\tilde{D}^I_{\mathbf{p}_i,r_i}

[\bar{R}_i,\bar{G}_i,\bar{B}_i]

[\bar{R}_i,\bar{G}_i,\bar{B}_i]

[\bar{R}_j,\bar{G}_j,\bar{B}_j]

[\bar{R}_j,\bar{G}_j,\bar{B}_j]

AMAT: Medial Axis Transform for Natural Images, S. Tsogkas, S. Dickinson, ICCV 2017.

e(

)\approx 0

)\approx 0

,

,

e(

) \gg 0

) \gg 0

AppearanceMAT definition

...

for all p,r

\text{Objective: } \min_{\mathbf{p},r} \sum_{i=1}^m e_{\mathbf{p}_i,r_i}

\text{Objective: } \min_{\mathbf{p},r} \sum_{i=1}^m e_{\mathbf{p}_i,r_i}

g \circ f \circ D

g \circ f \circ D

\text{Constraint: } I=\bigcup_{i=1}^m D^I_{\mathbf{p}_i,r_i}

\text{Constraint: } I=\bigcup_{i=1}^m D^I_{\mathbf{p}_i,r_i}

A trivial solution

Select pixels as medial points (disks of radius 1).

Perfect reconstruction quality!

Not very useful in practice...

Goal: balance between sparsity and reconstruction

Dense representation
Low reconstruction error

Sparse representation
High reconstruction error

Favor the selection of larger disks...

Increasing $ w $

Add regularization term to disk cost: $ c_{\mathbf{p},r} = e_{\mathbf{p},r} + \orange{w}(\frac{1}{r}) $.

...as long as they do not incur a high reconstruction error

AMAT is a weighted geometric set cover problem

WGSC is NP-hard!

PTAS exist

\Bigg\{

\Bigg\{

\Bigg\{

\Bigg\{

\Bigg\{

\Bigg\{

Set we want to cover

Covering elements (range)

Set costs

Cover 2D image

using disks of radii {1,...,R}

with costs $ c_{\mathbf{p},r}$

Greedy algorithm

Compute all costs $ c_{\mathbf{p},r} $.
While image has not been completely covered:
- Select disk $ D_{\mathbf{p^*},r^*} $ with lowest cost.
- Add point $ (\mathbf{p^*},r^*,\mathbf{f}_{\mathbf{p^*},r^*}) $ to the solution.
- Mark disk pixels as covered.
- Update costs $ c_{\mathbf{p},r} $

Approximation algorithms, Vijay V. Vazirani

AMAT Demo

More reconstruction results

Input

MIL

GT-seg

GT-skel

AMAT

Grouping points together...

space proximity
smooth scale variation
color similarity

~~color similarity~~

Input

AMAT

Groups

(color coded)

...opens up possibilities

Thinning

Segmentation

Object proposals
and more...

Qualitative results

Input

AMAT

Groups

Reconstruction

Part segmentation

Fully convolutional neural networks

P(person)

P(horse)
:
P(dog)

dog

person

Finetune for part segmentation

head

torso

arms

legs

hands

Part segmentation in natural images

Deep learning for semantic part segmentation with high-level guidance, Tsogkas et al., 2016

Small parts are lost due to downsampling

RGB: 152x152

L1: 142x142

L2: 71x71

L3: 63x63

L4: 55x55

L5 25x25

L6 21x21

Extract features at multiple scales

Scale 1x

Scale 1.5x

Scale 2x

Combine with object detector

Towards real-time object detection with region proposal networks, S.Ren et al., NIPS 2015

Find scale that is closest to the network's nominal scale

minimize:\, |\green{h_N}-\red{h_b}| + |\green{w_N}-\red{w_b}|

minimize:\, |\green{h_N}-\red{h_b}| + |\green{w_N}-\red{w_b}|

: bounding box adjusted for scale

: default size of the network's input

w

h

Use features from the ideal scale

Multi-scale analysis improves results...

...and is efficient for images with many objects

Segmenting brain "parts" is also important

Alzheimer's:

structure degeneration

Schizophrenia: volume abnormalities
[Shenton M.E. et al., Psychiatry Res. 2002]

Tumors: avoid radiation on sensitive regions
[Hoehn D. et al., Journal of Medical Cases, 2012]

Why automatic segmentation?

Putamen

Ventricle

Caudate

Amygdala

Hippocampus

Visualization and inspection

No need for manual annotation
(time consuming, need experts,
limited reproducibility)

Non-invasive diagnosis and treatment

Intensity in MRI is not enough

Spatial arrangement patterns matter

Segmenting subcortical structures in 2D FMRI with FCNNs

P(thalamus)

P(putamen)
:
P(caudate)

P(white matter)

2D slice

thalamus

white matter

Subcortical brain structure segmentation using FCNNs, Tsogkas et al., ISBI 2016

From 2D slice to 3D volume segmentation

Subcortical brain structure segmentation using FCNNs, Tsogkas et al., ISBI 2016

CNN architecture

16 layers including max-pooling and dropout.
Dilated convolutions for higher resolution.
Compact architecture (~4GB GPU RAM)

MRF enforces volume homogeneity

S^{*}=\text{argmin}E(S) = \sum_{i\in\mathcal{V}}\green{U_i} + \lambda\sum_{(i,j)\in\mathcal{E}}\orange{P_{ij}}

S^{*}=\text{argmin}E(S) = \sum_{i\in\mathcal{V}}\green{U_i} + \lambda\sum_{(i,j)\in\mathcal{E}}\orange{P_{ij}}

i

j

f(CNN output)

d(intensities)

Solve with $\alpha$-expansion

MRF removes spurious responses

CNN

CNN+MRF

3D segmentation results

Our results

Groundtruth

Deep priors for coregistration and cosegmentation

Shakeri et al., Prior-based coregistration and cosegmentation, MICCAI 2016

Future work

Recognition in line drawings

Chair

Monitor

Basket

Office

Intuitive form of communication

Sketch-based image retrieval

3D models from sketches

New platforms for design

Grouping is key

Smart scribbles for sketch segmentation, Norris et al., EU computer graphics forum

Example-based sketch segmentation and labelling using CRFs, Schneider et al., TOG 2016

Gestalt grouping principles

Proximity

Parallelism

Continuity

Closure

Learn to group from synthetic data

Use CNN to extract point embeddings

\mathbf{e}_1

\mathbf{e}_1

\mathbf{e}_2

\mathbf{e}_2

\mathbf{e}_3

\mathbf{e}_3

||\mathbf{e}_1 - \mathbf{e}_3|| \approx 0

||\mathbf{e}_1 - \mathbf{e}_3|| \approx 0

||\mathbf{e}_1 - \mathbf{e}_2|| \gg 0

||\mathbf{e}_1 - \mathbf{e}_2|| \gg 0

Points on the same shape have similar embeddings

Points on different shapes have dissimilar embeddings

Cluster embeddings to obtain groups

Lun, et al., Learning to group discrete graphical patters, SIGGRAPH 2017

RNNs for shape embeddings

\mathbf{p}_1

\mathbf{p}_1

\mathbf{p}_2

\mathbf{p}_2

\mathbf{p}_3

\mathbf{p}_3

\mathbf{p}_4

\mathbf{p}_4

\mathbf{p}_5

\mathbf{p}_5

\mathbf{p}_1

\mathbf{p}_1

\mathbf{h}_1

\mathbf{h}_1

\mathbf{p}_2

\mathbf{p}_2

\mathbf{h}_2

\mathbf{h}_2

\mathbf{p}_3

\mathbf{p}_3

\mathbf{h}_3

\mathbf{h}_3

\mathbf{p}_4

\mathbf{p}_4

\mathbf{h}_4

\mathbf{h}_4

\mathbf{p}_5

\mathbf{p}_5

\mathbf{h}_5

\mathbf{h}_5

triangle

square

circle

grouping
classification

Application to complex scenes?

Edge detector result

Boundaries and medial axes are dual representations of objects

edge loss $l_e$

skeleton loss $l_s$

Edge detection network

Skeleton detection network

But they are usually extracted independently

Exploit duality to jointly learn boundaries and skeletons

edge loss $l_e$

skeleton loss $l_s$

consistency loss $l_c$

L_{total} = l_e + l_s + l_c

L_{total} = l_e + l_s + l_c

Single network (more efficient)
Joint optimization should improve accuracy

Autoencoders for image patches

Input patch $P$

Reconstruction $\tilde P$

Reconstruction loss $L(P, \tilde{P})$

Self supervised task

encoder

decoder

Learn homogeneity for textured patches from segmentation data

Rank loss: L( , ) < L( , )

high homogeneity

low homogeneity

Applications

Painterly rendering

Interactive segmentation

Constrained image editing

Acknowledgements

Mahsa Shakeri

Enzo Ferrante

Siddhartha Chandra

Eduard Trulls

P.A. Savalle

George Papandreou

Sven Dickinson

Nikos Paragios

Iasonas Kokkinos

Andrea Vedaldi

Thank you for your attention!

http://tsogkas.github.io/

Symmetry

Learning-based symmetry detection in natural images, ECCV 2012

Medical imaging

Segmentation and parts

AMAT: Medial Axis Transform for Natural Images, ICCV 2017

Understanding objects in detail with fine-grained attributes, CVPR 2014

Segmentation-aware deformable part models, CVPR 2014

Deep learning for semantic segmentation with high-level guidance, arXiv 2016

Subcortical brain segmentation using FCNNs, ISBI 2016

Prior-based coregistration and cosegmentation, MICCAI 2016

Learning mid-level representations for computer vision

By Stavros Tsogkas

Learning mid-level representations for computer vision

7 years ago
1,144

Stavros Tsogkas

Research Scientist at the Samsung AI Center in Toronto. Research associate at the University of Toronto.

tsogkas.github.io

Learning mid-level representations for computer vision

Stavros Tsogkas

People don't just "see"

Image classification paradigm

Discriminative representation

Machine learning algorithm

The deep learning revolution

Object detection 2010-2012: performance plateau

Object detection 2013

Brief history of learning in vision

DL success stories

Why now?

Practical limitations

CNNs can be easily fooled...

...and do not generalize well

CNNs favour appearance over shape

Zebra or an elephant with stripes?

People learn general rules instead

What makes a table, a table?

flat surface

vertical support

What are mid-level representations?

Mid-level:

discriminative and robust

shareable across object categories

simpler to model, scalable

Mid-level representations include

A "vocabulary" for images

easier to model

shareable

Unsupervised feature learning

X = ( , )

Y = 3

Results comparable to supervised methods!

Outline

1. Medial axes

2. Object parts

3. Future research

Medial axis detection

Symmetry is everywhere

Global symmetry is unstable

But local symmetry is more robust

Medial Axis Transform (MAT)

Medial Axis Transform (MAT)

MAT applications

MAT for natural images is not obvious

So let's learn it from data!

Medial point detection: binary classification problem

Features designed for bilateral symmetry of image regions

Scale

Orientation

Compute colour and texture histograms for "inside"

\(\chi^2\)( , )

High chance of symmetry!

\(\chi^2\)( , ) \(\sim\) \(\chi^2\)( , ) \(\approx 0\)?

Low chance of symmetry!

High histogram distance?

and "outside"

...for both pairs?

Train with multiple instance learning

Symmetry probability:

Goal: learn w

Challenge: no ground truth annotation for scale and orientation

The bag is positive if at least one instance is positive

The bag is negative if all instances are negative

Noisy-OR = differentiable "max"

Dense feature extraction

Multiple scales

Multiple orientations

Computing symmetry probabilities

Non-maximum suppression

Fast detection with decision forests

MAT should be invertible

Generative definition of medial disks

AppearanceMAT definition

...

for all p,r

A trivial solution

Select pixels as medial points (disks of radius 1).

Perfect reconstruction quality!

Select disk \( D_{\mathbf{p^},r^} \) with lowest cost.

Add point \( (\mathbf{p^},r^,\mathbf{f}_{\mathbf{p^},r^}) \) to the solution.