HyperLogLog

Thomas Depierre

@DianaO

Diana Olympos

 

Twitter :

Github :

When ?
How ?

In BEAM ?

When do i need this

When do i need this

  • COUNT DISTINCT
    • aka cardinality of a set of records
  • Naive solution: build the set and size it at the end
    • Size(set) = N and Cardinality(set) = N
    • Boom memory
    • Oh!
  • What if my set is distributed ?

[1,2,3]

[1,4,5,6]

[7,8,9]

Memory

boom

[1,2,3]

[1,4,5,6]

[7,8,9,7,7,7]

Network

boom

When do i need this

  • COUNT DISTINCT
    • aka cardinality of a set of records
  • Naive solution: build the set and size it at the end
    • Size(set) = N and Cardinality(set) = N
    • Boom memory
    • Oh!
  • What if my set is distributed ?
    • Boom

How ?

Data Sketches

  • Just like a real sketch
    • Only keep a "shape" of the data
    • Depends on the question you ask
  • Two steps
    • Build a datastructure keeping the "shape"
      • "add" function
    • Build a brain that extract the information
      • "Estimator"
      • It is just a complex function
      • Usually probabilistic

Draw the rest of the bloody Howl

Ex: COUNT

X

add(Y) => X + 1

estimator(X) => X

Shape kept: size

HyperLogLog?

  • Hashing all the way
    • And a lot of bins
  • Decide on a precision, p < 64
    • 14 for us, it is the most used in the wild
    • We get 2**14 bins
  • Hash the record into a 64 bits binary
    • Please have enough entropy
  • Take the p first bits. That is your bin number
  • Count the leading zeros of the rest of the bits
    • That is the value of the bin
    • Take Max of that and the current value of the bin
010101010101010101010101010101...

14. Promise

01010101010101
0101010101010101...

leading zeros: 1

Put Max(1, existing)

Results ?

  • Fixed memory size
    • (P x 2**P x 6) bits)
  • Stable relative error
    • 1/sqrt(2**P)
  • the probabilistic estimator become... complicated
    • don't ask. You don't want to know
    • Oh you want to know ?
    • You asked for it

Distributed ?

  • We need to be able to combine the shapes
    • It happens that by picking the shapes we can
    • And maybe keeping the memory limit
    • And no worse error
  • Possible
    • We can just combine the bins
      • and just put the max of the two bins with the same offset
      • same error !

In BEAM ?

Hyper !

  • Reference implementation
    • By GameAnalytics
    • Multiple backends
  • More or less unmaintained
    • Test not passing on OTP 23+
    • Not building with rebar3
      • urgh NIFs
    • Not on Hex
    • No docs

I can fix it !

  • Rebar3
    • Relatively easy, get rid of NIFs
  • Docs
    • WIP
  • Tests
    • Just :rand vs :random right ?
    • Weeeeeeellllll .....

Paaaaaaain

  • Nope
    • The tests were broken all along
    • Just hardcoded seed that pass
    • New random generation found the bug
  • Aaaargh
    • Happens to be fundamental
    • Had to rediscover a fundamental property of HLL
      • How to reduce precision accurately
    • Noone else got it right in more than a decade...

Hyper !

Thank You

Questions ?