Multi-view CNNs for 3D shape recognition

Presented by: Elaheh Barati

elaheh@wayne.edu

Wayne State University

Multi-view CNN (MVCNN) for shape recognition

Hang Su, Subhransu, Maji Evangelos, Kalogerakis Erik, Learned-Miller

University of Massachusetts, Amherst

"Multi-view Convolutional Neural Networks for 3D shape Recognition"

"Proceedings of the IEEE international conference on computer vision pp. 945-953. 2015."

3D shapes can be represented with:

descriptors operating on 3D formats
view-based descriptors

3D shapes can be represented with:

descriptors operating on 3D formats
view-based descriptors

Recognizing 3D shapes from a collection of their rendered views on 2D images

2D representations

3D representations

3D representation:

Contains all of the information about an object
In order to use a voxel-based representation in a deep network, the resolution needs to be significantly reduced.
For example:
- 3D ShapeNets use a coarse representation of shape, a 30×30×30 grid of binary voxels.

[1] Wu, Zhirong, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and Jianxiong Xiao. "3d shapenets: A deep representation for volumetric shapes." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1912-1920. 2015.

[1]

2D representation:

A single projection of the 3D model of the same input size corresponds to an image of 164×164 pixels, or slightly smaller if multiple projections are used.

Contains all of the information about an object
In order to use a voxel-based representation in a deep network, the resolution needs to be significantly reduced.
For example:
- 3D ShapeNets use a coarse representation of shape, a 30×30×30 grid of binary voxels.

Trade-off

Increasing the amount of explicit depth information (3D models)

Increasing spatial resolution

(projected 2D models)

2D 3D

using 2D representations:

Leverage advances in image descriptors
Use massive image databases (ImageNet) to pre-train CNN architectures

2D 3D

Learn a good deal about generic features for 2D image categorization

Fine-tune to specifics about 3D model projections

This paper presents:

Classifying views independently
- Using existing 2D image features directly and producing a descriptor for each view
- It works remarkably well
Generating compact object descriptors
- using multi-view CNN

Multi-view CNN

Informative for classification and retrieval
Facilitates efficient retrieval using either a similar 3D object or a simple hand-drawn sketch
State-of-the-art results on:
- 3D object classification
- 3D object retrieval using 3D objects
- 3D object retrieval using sketches
- Sketch classification

View-based descriptors for 3D shapes:

Trainable
Produce informative representations for recognition and retrieval tasks
Efficient to compute

Multi-view Representation

Give an Input 3D Object

Multi-view Representation

Generating multi-views of the 3D object by

rendering engine

Multi-view Representation

Generating multi-views of the 3D object by

rendering engine

Multi-view Representation

Generating multi-views of the 3D object by

rendering engine

view N

Approaches to using multiple views

Average the individual descriptors

treating all the views as equally important

Generate a 2D image descriptor per each view

Use the individual descriptors directly for recognition tasks based on some voting or alignment scheme

Concatenate the 2D descriptors of all the views.

Method 1:

Approaches to using multiple views

An aggregated representation combining features from multiple views

Method 2:

Approaches to using multiple views

Combine information from multiple views using a unified CNN architecture that includes a view-pooling layer

Method 2:

An aggregated representation combining features from multiple views

Multi-view Representation

Multi-view Representation (Input Generation)

3D models in online databases are stored as polygon meshes:
- collections of points connected with edges forming faces
Generate rendered views of polygon meshes using the Phong reflection model

[1] B. T. Phong. Illumination for computer generated pictures. Commun. ACM, 18(6), 1975.

[1]

Multi-view Representation (Input Generation)

Phong reflection model

Approximate surface color as sum of three components:
- an ideal diffuse component
- a glossy/blurred specular component
- an ambient term

The mesh polygons are rendered under a perspective projection
The pixel color is determined by interpolating the reflected intensity of the polygon vertices
Shapes are uniformly scaled to fit into the viewing volume

Multi-view Representation (Input Generation)

Setup viewpoints (virtual cameras) for rendering each mesh:

1st camera setup

Assumption:
- The input shapes are upright oriented along a consistent axis (e.g., z-axis).
Create 12 rendered views by placing 12 virtual cameras around the mesh every 30 degrees
The cameras are elevated 30 degrees from the ground plane, pointing towards the centroid of the mesh
The centroid is calculated as the weighted average of the mesh face centers, where the weights are the face areas

Multi-view Representation (Input Generation)

Setup viewpoints (virtual cameras) for rendering each mesh:

2nd camera setup

Render from several more viewpoints
The renderings are generated by placing 20 virtual cameras at the 20 vertices of an icosahedron enclosing the shape.
All cameras point towards the centroid of the mesh.
Generate 4 rendered views from each camera, using 0, 90, 180, 270 degrees rotation along the axis passing through the camera and the object centroid, yielding total 80 views.

Multi-view Representation (Recognition)

The most straightforward approach to utilizing the multi-view representation:

Using existing 2D image features directly
Produce a descriptor for each view

It results in multiple 2D image descriptors per 3D shape, one per view

Need to be integrated for recognition tasks

Two Types of Image Descriptors:

image descriptor based on Fisher vectors with multiscale SIFT
CNN activation features

Multi-view Representation (Recognition)

[1]

[2]

[1] J. Sanchez, F. Perronnin, T. Mensink, and J. Verbeek. Image classification with the Fisher vector: Theory and practice. 2013.

[2] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. DeCAF: A deep convolutional activation feature for generic visual recognition. CoRR, abs/1310.1531, 2013.

The Fisher vector image descriptor:

Implemented using VLFeat
For each image:
1. multi-scale SIFT descriptors are extracted densely.
2. descriptors are projected to 80 dimensions with PCA
3. Fisher vector pooling with
  1. a Gaussian mixture model with 64 components,
  2. Square-root
  3. ℓ2 normalization

Multi-view Representation (Recognition)

[1]

[1] A. Vedaldi and B. Fulkerson. VLFeat: An open and portable library of computer vision algorithms. http://www. vlfeat.org/, 2008.

CNN features:

VGG-M network :
- Five convolutional layers conv1,...,5
- Three fully connected layers fc6,...,8
- a softmax classification layer
- The layer fc7 (after ReLU non-linearity, 4096-dimensional) is used as image descriptor
- The network is pre-trained on ImageNet images from 1k categories
- fine-tuned on all 2D views of the 3D shapes in training set

Multi-view Representation (Recognition)

[1]

[1] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman. Return of the devil in the details: Delving deep into convolutional nets. In Proc. BMVC, 2014.

Classification

Train one-vs-rest linear SVMs to classify shapes using their image features
- each view is treated as a separate training sample
At test time
- Sum up the SVM decision values over all 12 views
- Return the class with the highest sum

Multi-view Representation (Recognition)

Retrieval

The distance between shape x and y is defined in:

Multi-view Representation (Recognition)

A distance or similarity measure is required for retrieval tasks

image descriptors

Multi-view CNN: Learning to Aggregate Views

Share the same parameters

Multi-view CNN: Learning to Aggregate Views

element-wise maximum pooling across the views

Multi-view CNN: Learning to Aggregate Views

Produce shape descriptors

Multi-view CNN: Learning to Aggregate Views

Fine-tuned using Stochastic gradient descent with back-propagation

Dataset

Evaluated on the Princeton ModelNet dataset
ModelNet currently contains 127,915 3D CAD models from 662 categories.
ModelNet40: a 40-class well-annotated subset containing 12,311 shapes from 40 common categories

[1] The Princeton ModelNet. http://modelnet.cs. princeton.edu/.

[1]

3D Shape Classification and Retrieval

Baselines

3D ShapeNets
Spherical Harmonic descriptor (SPH)
LightField descriptor (LFD)
Fisher vectors

Classification and retrieval results on the ModelNet40 dataset

Saliency map among views

Rank pixels in the 2D views w.r.t. their influence on the output score Fc of the network for the class c

a set of K 2D views

The saliency maps are computed by back-propagating the gradients of the class score onto the image via the view-pooling layer.

Examples of saliency maps

Top three views with the highest saliency

Sketch Recognition

Whether aggregating multiple views of a 2D image also improve performance?

Sketch Recognition (Jittering revisited)

Data jittering, or data augmentation:

Generate extra samples from a given image

the process of perturbing the image by transformations that change its appearance while leaving the high-level information (class, label, attributes, etc.) intact

Applied at training time to augment training samples and to reduce overfitting
Applied at test time to provide more robust predictions

Sketch Recognition (Jittering revisited)

Data jittering improves the performance of deep representations on 2D image classification tasks

includes random image translations (implemented as random crops), horizontal reflections, and color perturbations

only includes a few crops (e.g., four at the corners, one at the center and their horizontal reflections)

training

test

Sketch Recognition (Jittering revisited)

20,000 hand-drawn sketches of 250 object categories such as airplanes, apples, bridges, etc

[1]

[1] M. Eitz, J. Hays, and M. Alexa. How do humans sketch objects? ACM Trans. Graph., 31(4):44:1–44:10, 2012.

human sketch dataset

The cleaned dataset (SketchClean) contains 160 categories

Sketch Recognition (Jittering revisited)

To get multiple views from 2D images, jittering is used to mimic the effect of views.

For each hand-drawn sketch
- in-plane rotation with three angles: −45°, 0°, 45°
- horizontal reflections (hence 6 samples per image).

Sketch Recognition