Set-specific Probabilistic
Data Structures

PyThess Jan 2019

HyperLogLog and Bloom Filters

HyperLogLog

1 December 1948

Philippe Flajolet

Five concepts

Hashing

Ranking

Stochastic Averaging

Harmonic mean

Small and large range correction

0

1

0

2^64

Hash output space

Hash input space

0

1

Hash in base 2

010101100111100

Ranking

410987 = 4 x 10^5 + change

Stochastic Averaging

Harmonic mean

vs

the geometric mean

Code

m = 2^b #with b in [4...16]
 
if m == 16:
    alpha = 0.673
elif m == 32:
    alpha = 0.697
elif m == 64:
    alpha = 0.709
else:
    alpha = 0.7213/(1 + 1.079/m)
 
registers = [0] * m # initialize m registers to 0
# Construct the HLL structure
for h in hashed(data):

    # binary address of the rightmost b bits
    register_index = 1 + get_register_index(h, b)

    # length of the run of zeroes starting at bit b+1
    run_length = run_of_zeros(h, b)

    registers[register_index] = max(registers[register_index], run_length)

# Determine the cardinality
DV_est = alpha * m^2 * 1/sum(2^ - register)  # the DV estimate
 
if DV_est < 5/2 * m: # small range correction

 

    # the number of registers equal to zero
    V = count_of_zero_registers(registers)
    
    if V == 0:  # if none of the registers are empty, use the HLL estimate
          DV = DV_est
    else:
          DV = m * log(m/V)  # i.e. balls and bins correction
 
if DV_est <= ( 1/30 * 2^32 ):  # intermediate range, no correction
     DV = DV_est
if DV_est > ( 1/30 * 2^32 ):  # large range correction
     DV = -2^32 * log( 1 - DV_est/2^32)

DEMO

Bloom Filters

Burton Howard Bloom

HyperLogLog
Cardinality estimation for a set

Bloom Filters
Membership estimation for a set

• Definitely no

• Probably yes

Bloom Filter result

DEMO

An improvement on Bloom Filters

  • Bin Fan
  • David G. Andersen
  • Michael Kaminsky
  • Michael D. Mitzenmacher

Cuckoo Filters

Some differences to Bloom:

Ability to add and remove items dynamically

Bounded false positive probability

END

Set-specific Probabilistic Data Structures: HyperLogLog and Bloom Filters

By sirodoht

Set-specific Probabilistic Data Structures: HyperLogLog and Bloom Filters

  • 481