CS6501 - Feb 2015 

Michael Holroyd, Ph.D.


Research digitizing objects/spaces



Arqball Research


Core problem in computer vision: where are points in space?


Image Formation

How does your camera take a 3D point in space and project it onto a 2D image plane?

2D example on the board for pinhole camera
x = f * X / Z


Could try to triangulate every pixel in a camera...

Image Formation

Q: How many "unknowns" are there a real camera's configuration?

Structure from Motion

What if you don't know where the cameras are located?

5 unknowns for each camera

(x,y,z) position and (θ,φ) rotation

called the camera extrinsics

Structure from Motion

What if you don't know how 3D points (x,y,z) map to 

2D points (x,y) in the image?

5 more unknowns for each camera:

focal point (f), principal point (cx,cy), 

radial distortion (polynomial approximation with 3 parameters)

called the  camera intrinsics

Structure from Motion

10 unknowns for each camera...
3 unknowns for each point in space...

2 equations (x,y) each time you see a point in an image

if (10*nCameras + 3*nPoints < 2*nObservations)
you're in business.

Feature Detection

So the entire problem boils down to finding corresponding points in multiple images.

Or track frames through video

Feature Detection

What makes a "good feature"?

Feature detectors

Gold standard is the Harris Corner Detector:
Moving the window slightly in any direction creates a big difference in the resulting window

Difference of Gaussians:
The feature responds differently to Gaussians of different scales.

Don't have time for complicated feature detectors, test if pixels in a circle around a point are much brighter/darker than the center.

Harris Corner Detector

Key idea: a good feature is one that looks different if the patch is shifted slightly in any direction.

Check explicitly, or check the first two eigenvectors of the structure tensor. Even BETTER, just check difference between determinant and trace.

Difference of Gaussians

Blur the image by two slightly different amounts and subtract.

Basically picks off one "frequency" of the image's spectrum. These can sometimes be used as additional features if a corner detector is inappropriate:

FAST detector

Check in a ring around each point. If several pixels in a row are "different enough" from the center, call it a feature.

Used for real-time applications where Harris is too expensive.

Feature Descriptors

How to compare two similar features?

Pixel-by-pixel metrics only work for EXACT matches

Many feature descriptors have been proposed to achieve invariance to scale, rotation, lighting, etc.

Could probably be its own talk...

Scale-invariant feature transform (SIFT):

Compute histograms of gradients and rotate everything into a consistent frame. Slow, but first "effective" descriptor.

Speeded Up Robust Features (SURF):

Approximate gradient responses with fast integer filters.

Can precompute an integral image and compute 

filter response in O(1) time.

Faster and works about as well as SIFT.

Random Binary Descriptors

Make a bunch of pseudo-random pixel comparisons and create a big binary vector.

SIFT/SURF is a very big descriptor (128 or 64 floats typically). BRIEF descriptor gets comparable performance with 16 bytes.

Random Binary Descriptors

Also much faster to compare two features.

Many chips have a built-in XOR + bitcount instruction.

2010 - BRIEF descriptor was first on the scene, but did not handle rotation invariance.

2011 - ORB (Oriented FAST and Rotated BRIEF) showed how to handle rotation invariance and might be considered state-of-the-art in realtime-ish descriptors.

Random Binary Descriptors

2011 - BRISK (Binary Robust Invariant Scalable Keypoints) adds scale invariance by considering pyramid of images at different scales. Probably the best descriptor out right now carefully tested.

2012 - FREAK (Fast Retina Keypoint) uses a creative sampling pattern loosely based off spatial encoding in the human retina.

2013+ Attention has turned toward machine learning approaches to feature descriptors, but no real success yet.

In other words, come up with a clever acronym and design yet another feature descriptor.

Feature Matching

Detected good feature points... have a descriptor for each...


Feature Matching

Naive solution: compare M2 features in N2 images.

2010 - Image Webs: "bag of words" approach to image matching. Requires post-processing to find matches initially missed. Good for dense image collections.

2012 - MatchMiner: Iteratively add photos. Algorithm for greedily testing images that are likely to match existing data first.

Skip redundant images and prioritize merging disconnected components.

Clusters of related photos:

Feature Matching

Naive solution: compare M2 features in N2 images.
Classic acceleration structure is a k-d tree:

Feature Matching

k-d trees break down for >20 dimension data, but this is still what almost everybody uses.

Locality Sensitive Hashing (LSH) has become popular for high-dimensional data. Choose random low-dimensional projections and bin nearby points. Do set-union to find potential nearest neighbors. Verify distances on this smaller set.

LSH has lots of variants:

  • Spherical LSH for normalized data - project points onto nearest vertex of a randomly rotated regular N-dimensional polygon.
    Terasawa, K., & Tanaka, Y. (2007). Spherical LSH for Approximate Nearest Neighbor Search on Unit Hypersphere. 

  • Use generic clustering to create dynamic cells. Creates much better hash functions, but they are very hard to index into quickly.
    Nistér, D., Stewénius, H., (2006). Scalable recognition with a vocabulary tree.

  • Leech-lattice works insanely well in 24D. Some suggest random projections onto 24D and using it as your hash function.

Hamming Distance LSH

There is a nice parallel between using the Binary Feature Descriptors and using Bit-Sampling on {0,1}^p for your 
"low-dimensional projection" hash function.

01101001101111101101 > 0111
00101100000101010000 > 0000
11000101011111001101 > 0111
01010000110111101010 > 1010
11011100000110101100 > 1011

Back to SfM

  1. Compute feature points and descriptors for every image
  2. Use MatchMiner to propose image pairs that are likely to match
  3. To find all matches, build LSH for image A and query features from image B. (cache LSH for later)
  4. Throw out outlier matches based on geometric constraints and other sanity checks.
  5. Build huge non-linear system of equations to solve...

Non-linear solvers

Definitely an entire talks worth. Short version:
Use Google's CERES solver

Generic non-linear solvers don't exploit the special structure of structure from motion problems. Key insight is that most variables don't interact (3D points, camera parameters).

Ceres solver

Big benefit of ceres-solver is that it uses auto-differentiation to figure our your function's partial derivatives, so you don't have to provide them explicitly.

Auto-differentiation: take all instances of float, and replace them with type Jet that tracks the differentiation via chain rule.


open Dropbox/2015_01_10_Stadhuis/ 




Automatically geotag images by references a huge database of previously geotagged images.

geotagged Instagram photos




Arqball (summer interns):

Michael Holroyd

Structure from Motion for CS6501

By Michael Holroyd

Structure from Motion for CS6501

Brief introduction to the concepts behind the structure-from-algorithm for finding camera pose and sparse 3D geometry.

  • 3,915