Convolutional Patch Representations for Image Retrieval: an Unsupervised Approach

Mattis Paulin · Julien Mairal · Matthijs Douze · Zaid Harchaoui · Florent Perronnin · Cordelia Schmid

By:

Presented by:

Saeid Balaneshinkordan

  • based on a CKN architecture,
  • uses a fast and simple stochastic procedure
  • computes an explicit feature embedding.

“Rome-Patches”

3. Unsupervised patch-level descriptors

1. patch descriptor:

2. Generated Dataset: 

Contributions:

related work:

shallow patch descriptors

image retrieval base on 
deep learning 

patch description
based on

deep learning

Instance-level Recognition

ref: https://www.robots.ox.ac.uk/~vgg/practicals/instance-recognition/index.html

Match (recognize) a specific object or scene

Instance-level Recognition

ref: https://www.robots.ox.ac.uk/~vgg/practicals/instance-recognition/index.html

The object is recognized despite changes in:

  • scale,
  • camera viewpoint,
  • illumination conditions and
  • partial occlusion.

Three steps in 
instance-level retrieval systems:

define a suitable metric between two patch sets

select key points that are reproducible under scale and viewpoint changes

should be robust to viewing conditions.

2) description 

1) interest point detection

3) matching

interest point detection:

description:

matching:

Instance-level Recognition

ref: https://www.robots.ox.ac.uk/~vgg/practicals/instance-recognition/index.html

application: image retrieval

  1. starting from an image (the query),
  2. search through an image dataset and
  3. retrieve those images that contain the target object.

Local Image Patches

ref: http://www.cs.toronto.edu/~kyros/courses/320/Lectures.2013s/lecture.2013s.03.pdf

Image structures can be analyzed at:

  • a local level (eg., small groups of nearby pixels) or
  • a global one (eg. entire image)    

Local Image Patches

ref: http://www.cs.toronto.edu/~kyros/courses/320/Lectures.2013s/lecture.2013s.03.pdf

 Perceptually-significant 

corner

single surface

   uniform texture    

edge

 Image Retrieval Pipeline

 extracte Interest points

encode in descriptor space

 aggregate into a compact representation

 Image Retrieval Pipeline

 extract Interest points

encode in descriptor space

 aggregate into a compact representation

Image Retrieval Pipeline:

Interest Point Detection

Interest Point Description

Patch Matching

Keypoint Detection

Mehtod: Hessian affine region detector

affine-invariant detector

preprocessing step to detect interest point

ref: http://www.mathworks.com/discovery/affine-transformation.html

Matching points in two different images (keypoint detection)

extracting salient Points

rectifying affine 

normalizing rotation

Image Retrieval Pipeline:

Interest Point Detection

Interest Point Description

Patch Matching

Image Retrieval Pipeline:

Interest Point Description

robust to the perturbations that are not covered by the detector (lighting changes, small rotations, blur,...)

normalized patch

feature representation φ(M) in a Euclidean space

mapping
the affine region
to a fixed-size square

  1. Excellent: large amounts of labeled visual data 
  2. ​Moderate : in unsupervised tasks such as image retrieval 

Convolutional neural networks (CNNs)

image classification

image retrieval

modelling natural images

handles local stationary structures

multi-scale

fashion

Performance:

image-level descriptors:

Features output by a CNN’s intermediate layers

Is it possible to derive patch-level descriptors from architectures designed for image-level descriptors ?

image-level descriptors 

outputs of the penultimate layer

patch-level descriptors 

outputs of previous layers (typically the 4th one) 

Convolutional neural networks (CNNs)

tend to encode more task-independent information

tend to be similar regardless of the task, the objective function or the level of supervision

is supervised learning required to make good local convolutional features for patch matching and image retrieval?

earlier layers:

filters learned by the first layer:

is supervised learning required?

Convolutional neural networks (CNNs)

Convolutional Descriptors

use convolutional features

with class supervision for a classification task. 

to extend image retrieval:

encoding local descriptors with a model that has been trained for an unrelated image classification task

devising a surrogate classification problem that is as related as possible to image retrieval

using unsupervised learning, such as a convolutional kernel network

to encode fixed-size image patches (size 51×51 pixels):

 CNNs are normally trained:

Convolutional neural networks (CNNs)

two successive layers of a CNN

Convolutional Neural Networks

image

matrices corresponding to linear operations

pointwise non-linear functions

down-sampling operation (feature pooling)

Learning from category labels

AlexNet

7 layers

the first five are convolutional

the last ones are fully connected

size of images that is processed: 224 × 224,

Convolutional neural networks (CNNs)

description of image patches without supervision

Approximation procedure:
stochastic gradient optimization 

Deep kernel-based convolutional approach

Convolutional Kernel Networks (CKNs)

CNN - CKN

based on a kernel (feature) map.

data-independent.

to yield a CKN that outputs patch descriptors:

using sub-sampling of patches and stochastic gradient optimization

CNN feature representation:

relies on filters that are learned

Kernel embedding approximation

exact computations are overwhelming

an explicit finite-dimensional embedding to approximate them:

Multi-layer CKN kernel

two-layer convolutional kernel architecture

Image Retrieval Pipeline:

Interest Point Detection

Interest Point Description

Patch Matching

Image Retrieval Pipeline:

Patch Matching

matching all possible pairs of patches is too expensive

instead:

aggregating patch descriptors into a fixed-length image descriptor, using the VLAD representation 

normalization to the VLAD descriptor: 

  • A power normalization with exponent 0.5
  • An L2 normalization

Patch Matching

patches matched

significant changes in lighting,

smaller changes in rotation and skew.

Experiments

Datasets

Patch retrieval

Mikolajczyk Dataset

RomePatches

Image retrieval

RomePatches-Image

Oxford

UKbench and Holidays

Datasets

Patch and image retrieval on the Rome dataset.

Top: examples of matching patches.

Bottom: Images of the same bundle, that therefore share the same class for image retrieval.

convolutional architectures for patch retrieval

Thank You!

Convolutional Kernels Networks (CKNs)

M and M' be two patches of size m × m

Ω = {1, . . . , m}2 be the set of pixel locations

pz = sub-patch size from M (fixed)

p'z = sub-patch size from M' (fixed)

centered at location z ∈ Ω

Single-layer kernel:

where

learning without supervision with application to matching and instance-level retrieval

(named Patch-CKN) for patch representation, based on convolutional kernel networks

contribution:

convolutional descriptors

patch descriptors

Comparison with state-of-the-art image retrieval results.

Results with * use a Hessian-Affine detector with gravity assumption

 Implementation details

Patch Extraction

CNN Implementation

CKN Learning

Learning from surrogate labels

augment the dataset with perturbed versions of training patches to learn the filters Wk 

use “virtual patches”, obtained as transformations of randomly extracted ones to fall back to a classification problem

For a set of patches P, and a set a transformations T , the dataset consists of all τ (p), (τ, p) ∈ T × P.

PhilippNet:

 three convolutional and one fully connected layers, takes as input 64x64 patches, and produces a 512-dimensional output

Transformed versions of the same patch share the same label, thus defining surrogate classes.

Convolutional neural networks (CNNs)

Patch retrieval

Parametric exploration of CKNs

number of filters

sub-patch size

subsampling factor

Influence of dimensionality reduction on patch retrieval performance

Made with Slides.com