HyperLogLog
Thomas Depierre
@DianaO
Diana Olympos
Twitter :
Github :
When ?
How ?
In BEAM ?
When do i need this
When do i need this
- COUNT DISTINCT
- aka cardinality of a set of records
- Naive solution: build the set and size it at the end
- Size(set) = N and Cardinality(set) = N
- Boom memory
- Oh!
- What if my set is distributed ?
[1,2,3]
[1,4,5,6]
[7,8,9]
Memory
boom
[1,2,3]
[1,4,5,6]
[7,8,9,7,7,7]
Network
boom
When do i need this
- COUNT DISTINCT
- aka cardinality of a set of records
- Naive solution: build the set and size it at the end
- Size(set) = N and Cardinality(set) = N
- Boom memory
- Oh!
- What if my set is distributed ?
- Boom
How ?
Data Sketches
- Just like a real sketch
- Only keep a "shape" of the data
- Depends on the question you ask
- Two steps
- Build a datastructure keeping the "shape"
- "add" function
- Build a brain that extract the information
- "Estimator"
- It is just a complex function
- Usually probabilistic
- Build a datastructure keeping the "shape"
Draw the rest of the bloody Howl
Ex: COUNT
X
add(Y) => X + 1
estimator(X) => X
Shape kept: size
HyperLogLog?
- Hashing all the way
- And a lot of bins
- Decide on a precision, p < 64
- 14 for us, it is the most used in the wild
- We get 2**14 bins
- Hash the record into a 64 bits binary
- Please have enough entropy
- Take the p first bits. That is your bin number
- Count the leading zeros of the rest of the bits
- That is the value of the bin
- Take Max of that and the current value of the bin
010101010101010101010101010101...
14. Promise
01010101010101
0101010101010101...
leading zeros: 1
Put Max(1, existing)
Results ?
- Fixed memory size
- (P x 2**P x 6) bits)
- Stable relative error
- 1/sqrt(2**P)
- the probabilistic estimator become... complicated
- don't ask. You don't want to know
- Oh you want to know ?
- You asked for it
Distributed ?
- We need to be able to combine the shapes
- It happens that by picking the shapes we can
- And maybe keeping the memory limit
- And no worse error
- Possible
- We can just combine the bins
- and just put the max of the two bins with the same offset
- same error !
- We can just combine the bins
In BEAM ?
Hyper !
- Reference implementation
- By GameAnalytics
- Multiple backends
- More or less unmaintained
- Test not passing on OTP 23+
- Not building with rebar3
- urgh NIFs
- Not on Hex
- No docs
I can fix it !
- Rebar3
- Relatively easy, get rid of NIFs
- Docs
- WIP
- Tests
- Just :rand vs :random right ?
- Weeeeeeellllll .....
Paaaaaaain
- Nope
- The tests were broken all along
- Just hardcoded seed that pass
- New random generation found the bug
- Aaaargh
- Happens to be fundamental
- Had to rediscover a fundamental property of HLL
- How to reduce precision accurately
- Noone else got it right in more than a decade...
Hyper !
- Forked by yours truly
- Rebar3
- Hex
- some docs
- WIP
- https://hex.pm/packages/hyper
- https://github.com/LivewareProblems/hyper
- More coming
- When I have time
Thank You
Questions ?
Hyperloglog in beam
By di4nao
Hyperloglog in beam
- 599