# Structure-from-Motion

beCraft - April 2014

Michael Holroyd, Ph.D.

## Triangulation

Core problem in computer vision: where are points in space?

## Triangulation

Could try to triangulate every pixel in a camera...

## Structure from Motion

What if you don't know where the cameras are located?

5 unknowns for each camera

(x,y,z) position and (θ,φ) rotation

called the camera extrinsics

## Structure from Motion

What if you don't know how 3D points (x,y,z) map to

2D points (x,y) in the image?

5 more unknowns for each camera:

focal point (f), principal point (cx,cy),

radial distortion (polynomial approximation with 3 parameters)

called the camera intrinsics

## Structure from Motion

10 unknowns for each camera...
3 unknowns for each point in space...

2 equations (x,y) each time you see a point in an image

if (10*nCameras + 3*nPoints < 2*nObservations)

## Feature Detection

So the entire problem boils down to finding corresponding points in multiple images.

## Feature Detection

What makes a "good feature"?

## Feature detectors

Gold standard is the Harris Corner Detector:
Moving the window slightly in any direction creates a big difference in the resulting window

Difference of Gaussians:
The feature responds differently to Gaussians of different scales.

FAST:
Don't have time for complicated feature detectors, test if pixels in a circle around a point are much brighter/darker than the center.

## Feature Descriptors

How to compare two similar features?

Pixel-by-pixel metrics only work for EXACT matches

Many feature descriptors have been proposed to achieve invariance to scale, rotation, lighting, etc.

Could probably be its own talk...

## Scale-invariant feature transform (SIFT):

Compute histograms of gradients and rotate everything into a consistent frame. Slow, but first "effective" descriptor.

## Speeded Up Robust Features (SURF):

Approximate gradient responses with fast integer filters.

Can precompute an integral image and compute

filter response in O(1) time.

Faster and works about as well as SIFT.

## Random Binary Descriptors

Make a bunch of pseudo-random pixel comparisons and create a big binary vector.

SIFT/SURF is a very big descriptor (128 or 64 floats typically). BRIEF descriptor gets comparable performance with 16 bytes.

## Random Binary Descriptors

Also much faster to compare two features.

Many chips have a built-in XOR + bitcount instruction.

2010 - BRIEF descriptor was first on the scene, but did not handle rotation invariance.

2011 - ORB (Oriented FAST and Rotated BRIEF) showed how to handle rotation invariance and might be considered state-of-the-art in realtime-ish descriptors.

## Random Binary Descriptors

2011 - BRISK (Binary Robust Invariant Scalable Keypoints) adds scale invariance by considering pyramid of images at different scales. Probably the best descriptor out right now carefully tested.

2012 - FREAK (Fast Retina Keypoint) uses a creative sampling pattern loosely based off spatial encoding in the human retina.

2013+ Attention has turned toward machine learning approaches to feature descriptors, but no real success yet.

In other words, come up with a clever acronym and design yet another feature descriptor.

## Feature Matching

Naive solution: compare M2 features in N2 images.

2010 - Image Webs: "bag of words" approach to image matching. Requires post-processing to find matches initially missed. Good for dense image collections.

2012 - MatchMiner: Iteratively add photos. Algorithm for greedily testing images that are likely to match existing data first.

Skip redundant images and prioritize merging disconnected components.

## Feature Matching

k-d trees really suck for >20 dimension data, but this is still what almost everybody uses.

Locality Sensitive Hashing (LSH) has become popular for high-dimensional data. Choose random low-dimensional projections and bin nearby points. Do set-union to find potential nearest neighbors. Verify distances on this smaller set.

## LSH has lots of variants:

• Spherical LSH for normalized data - project points onto nearest vertex of a randomly rotated regular N-dimensional polygon.
Terasawa, K., & Tanaka, Y. (2007). Spherical LSH for Approximate Nearest Neighbor Search on Unit Hypersphere.

• Use generic clustering to create dynamic cells. Creates much better hash functions, but they are very hard to index into quickly.
Nistér, D., Stewénius, H., (2006). Scalable recognition with a vocabulary tree.

• Leech-lattice works insanely well in 24D. Some suggest random projections onto 24D and using it as your hash function.

## Hamming Distance LSH

There is a nice parallel between using the Binary Feature Descriptors and using Bit-Sampling on {0,1}^p for your
"low-dimensional projection" hash function.

01101001101111101101 > 0111
00101100000101010000 > 0000
11000101011111001101 > 0111
01010000110111101010 > 1010
11011100000110101100 > 1011

## Back to SfM

1. Compute feature points and descriptors for every image
2. Use MatchMiner to propose image pairs that are likely to match
3. To find all matches, build LSH for image A and query features from image B. (cache LSH for later)
4. Throw out outlier matches based on geometric constraints and other sanity checks.
5. Build huge non-linear system of equations to solve...

## Non-linear solvers

Definitely an entire talks worth. Short version:

Generic non-linear solvers don't exploit the special structure of structure from motion problems. Key insight is that most variables don't interact (3D points, camera parameters).

## Applications:

3D scanning people:
```open ~/denseAlexx.ply` ``

Outdoor spaces:
```open ~/denseArqOffice.ply` ``

## Thanks

openMVG:
https://github.com/openMVG/openMVG

CERES:

Arqball (computer vision kung-fu):
http://arqspin.com

Knollop (online education search):
http://knollop.com

Michael Holroyd
http://meekohi.com

#### Structure from Motion

By Michael Holroyd

# Structure from Motion

Brief introduction to the concepts behind the structure-from-algorithm for finding camera pose and sparse 3D geometry.

• 4,210