Jon Cantwell
Stuff & Things. Mostly devtricks slides for synacor.
And Other Probabilistic Data Structure Friends
Find the number of distinct elements in a datastream where elements repeat
i.e. IP addresses passing through a router, unique visitors to website
Naive solution does not scale in memory terribly well
Try to estimate # of times coin was flipped by length of longest run of heads or tails
Hash the incoming data, look for runs
Bucket many estimates, take harmonic mean of the results
First N bits of hash determine bucket
i.e. 01001001011000101 -> bucket 5, run of 3 zeroes = if buckets[5] < 4, buckets[5] = 4
By Jon Cantwell