Core problem in computer vision: where are points in space?
Triangulation
Could try to triangulate every pixel in a camera...
Structure from Motion
What if you don't know where the cameras are located?
5 unknowns for each camera
(x,y,z) position and (θ,φ) rotation
called the camera extrinsics
Structure from Motion
What if you don't know how 3D points (x,y,z) map to
2D points (x,y) in the image?
5 more unknowns for each camera:
focal point (f), principal point (cx,cy),
radial distortion (polynomial approximation with 3 parameters)
called the camera intrinsics
Structure from Motion
10 unknowns for each camera...
3 unknowns for each point in space...
2 equations (x,y) each time you see a point in an image
if (10*nCameras + 3*nPoints < 2*nObservations)
you're in business.
Feature Detection
So the entire problem boils down to finding corresponding points in multiple images.
Feature Detection
What makes a "good feature"?
Feature detectors
Gold standard is the Harris Corner Detector:
Moving the window slightly in any direction creates a big difference in the resulting window
Difference of Gaussians:
The feature responds differently to Gaussians of different scales.
FAST:
Don't have time for complicated feature detectors, test if pixels in a circle around a point are much brighter/darker than the center.
Feature Descriptors
How to compare two similar features?
Pixel-by-pixel metrics only work for EXACT matches
Many feature descriptors have been proposed to achieve invariance to scale, rotation, lighting, etc.
Could probably be its own talk...
Scale-invariant feature transform (SIFT):
Compute histograms of gradients and rotate everything into a consistent frame. Slow, but first "effective" descriptor.
Speeded Up Robust Features (SURF):
Approximate gradient responses with fast integer filters.
Can precompute an integral image and compute
filter response in O(1) time.
Faster and works about as well as SIFT.
Random Binary Descriptors
Make a bunch of pseudo-random pixel comparisons and create a big binary vector.
SIFT/SURF is a very big descriptor (128 or 64 floats typically). BRIEF descriptor gets comparable performance with 16 bytes.
Random Binary Descriptors
Also much faster to compare two features.
Many chips have a built-in XOR + bitcount instruction.
2010 - BRIEF descriptor was first on the scene, but did not handle rotation invariance.
2011 - ORB (Oriented FAST and Rotated BRIEF) showed how to handle rotation invariance and might be considered state-of-the-art in realtime-ish descriptors.
Random Binary Descriptors
2011 - BRISK (Binary Robust Invariant Scalable Keypoints) adds scale invariance by considering pyramid of images at different scales. Probably the best descriptor out right now carefully tested.
2012 - FREAK (Fast Retina Keypoint) uses a creative sampling pattern loosely based off spatial encoding in the human retina.
2013+ Attention has turned toward machine learning approaches to feature descriptors, but no real success yet.
In other words, come up with a clever acronym and design yet another feature descriptor.
Feature Matching
Feature Matching
Naive solution: compare M2 features in N2 images.
2010 - Image Webs: "bag of words" approach to image matching. Requires post-processing to find matches initially missed. Good for dense image collections.
2012 - MatchMiner: Iteratively add photos. Algorithm for greedily testing images that are likely to match existing data first.
Skip redundant images and prioritize merging disconnected components.
Feature Matching
Naive solution: compare M2 features in N2 images.
Classic acceleration structure is a k-d tree:
Feature Matching
k-d trees really suck for >20 dimension data, but this is still what almost everybody uses.
Locality Sensitive Hashing (LSH) has become popular for high-dimensional data. Choose random low-dimensional projections and bin nearby points. Do set-union to find potential nearest neighbors. Verify distances on this smaller set.
LSH has lots of variants:
Spherical LSH for normalized data - project points onto nearest vertex of a randomly rotated regular N-dimensional polygon. Terasawa, K., & Tanaka, Y. (2007). Spherical LSH for Approximate Nearest Neighbor Search on Unit Hypersphere.
Use generic clustering to create dynamic cells. Creates much better hash functions, but they are very hard to index into quickly. Nistér, D., Stewénius, H., (2006). Scalable recognition with a vocabulary tree.
Leech-lattice works insanely well in 24D. Some suggest random projections onto 24D and using it as your hash function.
Hamming Distance LSH
There is a nice parallel between using the Binary Feature Descriptors and using Bit-Sampling on {0,1}^p for your
"low-dimensional projection" hash function.
01101001101111101101 > 0111
00101100000101010000 > 0000
11000101011111001101 > 0111
01010000110111101010 > 1010
11011100000110101100 > 1011
Back to SfM
Compute feature points and descriptors for every image
Use MatchMiner to propose image pairs that are likely to match
To find all matches, build LSH for image A and query features from image B. (cache LSH for later)
Throw out outlier matches based on geometric constraints and other sanity checks.
Build huge non-linear system of equations to solve...
Non-linear solvers
Definitely an entire talks worth. Short version:
Use Google's CERES solver
https://code.google.com/p/ceres-solver/
Generic non-linear solvers don't exploit the special structure of structure from motion problems. Key insight is that most variables don't interact (3D points, camera parameters).
Applications:
3D scanning people:
`open ~/denseAlexx.ply`
Outdoor spaces:
`open ~/denseArqOffice.ply`
Geolocation:
Geolocation:
Thanks
openMVG:
https://github.com/openMVG/openMVG
CERES:
https://code.google.com/p/ceres-solver/
Arqball (computer vision kung-fu):
http://arqspin.com
Knollop (online education search):
http://knollop.com
Michael Holroyd
http://meekohi.com
Structure-from-Motion beCraft - April 2014 Michael Holroyd, Ph.D.