Set-specific Probabilistic
Data Structures

SPACE4 — Oct 2021

HyperLogLog and Bloom Filters

HyperLogLog

0

1

0

2^64

Hash output space

Hash input space

Hash in base 2

010101100111100

Harmonic mean

Code

m = 2^b #with b in [4...16]
 
if m == 16:
    alpha = 0.673
elif m == 32:
    alpha = 0.697
elif m == 64:
    alpha = 0.709
else:
    alpha = 0.7213/(1 + 1.079/m)
 
registers = [0] * m # initialize m registers to 0
# Construct the HLL structure
for h in hashed(data):

    # binary address of the rightmost b bits
    register_index = 1 + get_register_index(h, b)

    # length of the run of zeroes starting at bit b+1
    run_length = run_of_zeros(h, b)

    registers[register_index] = max(registers[register_index], run_length)

# Determine the cardinality
DV_est = alpha * m^2 * 1/sum(2^ - register)  # the DV estimate
 
if DV_est < 5/2 * m: # small range correction

 

    # the number of registers equal to zero
    V = count_of_zero_registers(registers)
    
    if V == 0:  # if none of the registers are empty, use the HLL estimate
          DV = DV_est
    else:
          DV = m * log(m/V)  # i.e. balls and bins correction
 
if DV_est <= ( 1/30 * 2^32 ):  # intermediate range, no correction
     DV = DV_est
if DV_est > ( 1/30 * 2^32 ):  # large range correction
     DV = -2^32 * log( 1 - DV_est/2^32)

DEMO

Bloom Filters

HyperLogLog
Cardinality estimation for a set

Bloom Filters
Membership estimation for a set

• Definitely no

• Probably yes

Bloom Filter result

END

THANK YOU

FOR LISTENING

Set-specific Probabilistic Data Structures Oct 2021

By sirodoht

Set-specific Probabilistic Data Structures Oct 2021

  • 136