Set-specific Probabilistic
Data Structures
SPACE4 — Oct 2021
HyperLogLog and Bloom Filters
HyperLogLog
0
1
0
2^64
Hash output space
Hash input space
Hash in base 2
010101100111100
Harmonic mean
Code
m = 2^b #with b in [4...16] if m == 16: alpha = 0.673 elif m == 32: alpha = 0.697 elif m == 64: alpha = 0.709 else: alpha = 0.7213/(1 + 1.079/m) registers = [0] * m # initialize m registers to 0
# Construct the HLL structure for h in hashed(data): # binary address of the rightmost b bits register_index = 1 + get_register_index(h, b)
# length of the run of zeroes starting at bit b+1
run_length = run_of_zeros(h, b)
registers[register_index] = max(registers[register_index], run_length)
# Determine the cardinality DV_est = alpha * m^2 * 1/sum(2^ - register) # the DV estimate if DV_est < 5/2 * m: # small range correction
# the number of registers equal to zero V = count_of_zero_registers(registers) if V == 0: # if none of the registers are empty, use the HLL estimate DV = DV_est else: DV = m * log(m/V) # i.e. balls and bins correction if DV_est <= ( 1/30 * 2^32 ): # intermediate range, no correction DV = DV_est if DV_est > ( 1/30 * 2^32 ): # large range correction DV = -2^32 * log( 1 - DV_est/2^32)
DEMO
Bloom Filters
HyperLogLog
Cardinality estimation for a set
Bloom Filters
Membership estimation for a set
• Definitely no
• Probably yes
Bloom Filter result
END
THANK YOU
FOR LISTENING
Set-specific Probabilistic Data Structures Oct 2021
By sirodoht
Set-specific Probabilistic Data Structures Oct 2021
- 139