Structure-from-Motion

CS6501 - Feb 2015

Michael Holroyd, Ph.D.

Research digitizing objects/spaces

Arqball

Arqball Research

Triangulation

Core problem in computer vision: where are points in space?

Triangulation

Image Formation

How does your camera take a 3D point in space and project it onto a 2D image plane?

2D example on the board for pinhole camera

x = f * X / Z

see excellent notes

Triangulation

Could try to triangulate every pixel in a camera...

Image Formation

Q: How many "unknowns" are there a real camera's configuration?

Structure from Motion

What if you don't know where the cameras are located?

5 unknowns for each camera

(x,y,z) position and (θ,φ) rotation

called the camera extrinsics

Structure from Motion

What if you don't know how 3D points (x,y,z) map to

2D points (x,y) in the image?

5 more unknowns for each camera:

focal point (f), principal point (cx,cy),

radial distortion (polynomial approximation with 3 parameters)

called the camera intrinsics

Structure from Motion

10 unknowns for each camera...

3 unknowns for each point in space...

2 equations (x,y) each time you see a point in an image

if (10*nCameras + 3*nPoints < 2*nObservations)

you're in business.

Feature Detection

So the entire problem boils down to finding corresponding points in multiple images.

Or track frames through video

Feature Detection

What makes a "good feature"?

Feature detectors

Gold standard is the Harris Corner Detector:

Moving the window slightly in any direction creates a big difference in the resulting window

Difference of Gaussians:

The feature responds differently to Gaussians of different scales.

FAST:

Don't have time for complicated feature detectors, test if pixels in a circle around a point are much brighter/darker than the center.

Harris Corner Detector

Key idea: a good feature is one that looks different if the patch is shifted slightly in any direction.

Check explicitly, or check the first two eigenvectors of the structure tensor. Even BETTER, just check difference between determinant and trace.

Difference of Gaussians

Blur the image by two slightly different amounts and subtract.

Basically picks off one "frequency" of the image's spectrum. These can sometimes be used as additional features if a corner detector is inappropriate:

FAST detector

Check in a ring around each point. If several pixels in a row are "different enough" from the center, call it a feature.

Used for real-time applications where Harris is too expensive.

Feature Descriptors

How to compare two similar features?

Pixel-by-pixel metrics only work for EXACT matches

Many feature descriptors have been proposed to achieve invariance to scale, rotation, lighting, etc.

Could probably be its own talk...

Scale-invariant feature transform (SIFT):

Compute histograms of gradients and rotate everything into a consistent frame. Slow, but first "effective" descriptor.

Speeded Up Robust Features (SURF):

Approximate gradient responses with fast integer filters.

Can precompute an integral image and compute

filter response in O(1) time.

Faster and works about as well as SIFT.

Random Binary Descriptors

Make a bunch of pseudo-random pixel comparisons and create a big binary vector.

SIFT/SURF is a very big descriptor (128 or 64 floats typically). BRIEF descriptor gets comparable performance with 16 bytes.

Random Binary Descriptors

Also much faster to compare two features.

Many chips have a built-in XOR + bitcount instruction.

2010 - BRIEF descriptor was first on the scene, but did not handle rotation invariance.

2011 - ORB (Oriented FAST and Rotated BRIEF) showed how to handle rotation invariance and might be considered state-of-the-art in realtime-ish descriptors.

Random Binary Descriptors

2011 - BRISK (Binary Robust Invariant Scalable Keypoints) adds scale invariance by considering pyramid of images at different scales. Probably the best descriptor out right now carefully tested.

2012 - FREAK (Fast Retina Keypoint) uses a creative sampling pattern loosely based off spatial encoding in the human retina.

2013+ Attention has turned toward machine learning approaches to feature descriptors, but no real success yet.

In other words, come up with a clever acronym and design yet another feature descriptor.

Feature Matching

Detected good feature points... have a descriptor for each...

Feature Matching

Naive solution: compare M² features in N² images.

2010 - Image Webs: "bag of words" approach to image matching. Requires post-processing to find matches initially missed. Good for dense image collections.

2012 - MatchMiner: Iteratively add photos. Algorithm for greedily testing images that are likely to match existing data first.

Skip redundant images and prioritize merging disconnected components.

Clusters of related photos:

Feature Matching

Naive solution: compare M² features in N² images.

Classic acceleration structure is a k-d tree:

Feature Matching

k-d trees break down for >20 dimension data, but this is still what almost everybody uses.

Locality Sensitive Hashing (LSH) has become popular for high-dimensional data. Choose random low-dimensional projections and bin nearby points. Do set-union to find potential nearest neighbors. Verify distances on this smaller set.

LSH has lots of variants:

Spherical LSH for normalized data - project points onto nearest vertex of a randomly rotated regular N-dimensional polygon.
Terasawa, K., & Tanaka, Y. (2007). Spherical LSH for Approximate Nearest Neighbor Search on Unit Hypersphere.
Use generic clustering to create dynamic cells. Creates much better hash functions, but they are very hard to index into quickly.
Nistér, D., Stewénius, H., (2006). Scalable recognition with a vocabulary tree.
Leech-lattice works insanely well in 24D. Some suggest random projections onto 24D and using it as your hash function.

Hamming Distance LSH

There is a nice parallel between using the Binary Feature Descriptors and using Bit-Sampling on {0,1}^p for your

"low-dimensional projection" hash function.

01101001101111101101 > 0111

00101100000101010000 > 0000

11000101011111001101 > 0111

01010000110111101010 > 1010

11011100000110101100 > 1011

Back to SfM

Compute feature points and descriptors for every image
Use MatchMiner to propose image pairs that are likely to match
To find all matches, build LSH for image A and query features from image B. (cache LSH for later)
Throw out outlier matches based on geometric constraints and other sanity checks.
Build huge non-linear system of equations to solve...

Non-linear solvers

Definitely an entire talks worth. Short version:

Use Google's CERES solver

https://code.google.com/p/ceres-solver/

Generic non-linear solvers don't exploit the special structure of structure from motion problems. Key insight is that most variables don't interact (3D points, camera parameters).

Ceres solver

Big benefit of ceres-solver is that it uses auto-differentiation to figure our your function's partial derivatives, so you don't have to provide them explicitly.

ceres-solver example if there is time

Auto-differentiation: take all instances of float, and replace them with type Jet that tracks the differentiation via chain rule.

Applications:

http://www.cs.cornell.edu/projects/bigsfm/

open Dropbox/2015_01_10_Stadhuis/

Geolocation:

Geolocation:

PANOPTES

Automatically geotag images by references a huge database of previously geotagged images.

geotagged Instagram photos

Thanks

openMVG:

https://github.com/openMVG/openMVG

CERES:

https://code.google.com/p/ceres-solver/

Arqball (summer interns):

http://arqspin.com

Michael Holroyd

http://meekohi.com

Structure from Motion for CS6501

By Michael Holroyd

Structure from Motion for CS6501

Brief introduction to the concepts behind the structure-from-algorithm for finding camera pose and sparse 3D geometry.

4,077

Michael Holroyd

Computer Graphics Ph.D.

Structure-from-Motion

Research digitizing objects/spaces

Arqball

Arqball Research

Triangulation

Triangulation

Image Formation

Triangulation

Image Formation

Structure from Motion

Structure from Motion

Structure from Motion

Feature Detection

Or track frames through video

Feature Detection

Feature detectors

Harris Corner Detector

Difference of Gaussians

FAST detector

Feature Descriptors

Scale-invariant feature transform (SIFT):

Speeded Up Robust Features (SURF):

Random Binary Descriptors

Random Binary Descriptors

Random Binary Descriptors

Feature Matching

Feature Matching

Clusters of related photos:

Feature Matching

Naive solution: compare M2 features in N2 images. Classic acceleration structure is a k-d tree:

Feature Matching

LSH has lots of variants:

Hamming Distance LSH

Back to SfM

Non-linear solvers

Ceres solver

Applications:

Geolocation:

Geolocation:

PANOPTES

geotagged Instagram photos

Thanks

Structure from Motion for CS6501

More from Michael Holroyd

Naive solution: compare M² features in N² images.

Classic acceleration structure is a k-d tree: