PyThess Jan 2019

# HyperLogLog

## Five concepts

Hashing

Ranking

Stochastic Averaging

Harmonic mean

Small and large range correction

0

1

0

2^64

Hash output space

Hash input space

0

1

010101100111100

## Ranking

410987 = 4 x 10^5 + change

## Code

```m = 2^b #with b in [4...16]

if m == 16:
alpha = 0.673
elif m == 32:
alpha = 0.697
elif m == 64:
alpha = 0.709
else:
alpha = 0.7213/(1 + 1.079/m)

registers =  * m # initialize m registers to 0
```
```# Construct the HLL structure
for h in hashed(data):

# binary address of the rightmost b bits
register_index = 1 + get_register_index(h, b)

```
```    # length of the run of zeroes starting at bit b+1
run_length = run_of_zeros(h, b)

registers[register_index] = max(registers[register_index], run_length)

```
```# Determine the cardinality
DV_est = alpha * m^2 * 1/sum(2^ - register)  # the DV estimate

if DV_est < 5/2 * m: # small range correction```

```    # the number of registers equal to zero
V = count_of_zero_registers(registers)

if V == 0:  # if none of the registers are empty, use the HLL estimate
DV = DV_est
else:
DV = m * log(m/V)  # i.e. balls and bins correction

if DV_est <= ( 1/30 * 2^32 ):  # intermediate range, no correction
DV = DV_est
if DV_est > ( 1/30 * 2^32 ):  # large range correction
DV = -2^32 * log( 1 - DV_est/2^32)```

# An improvement on Bloom Filters

• Bin Fan
• David G. Andersen
• Michael Kaminsky
• Michael D. Mitzenmacher

# Cuckoo Filters

Some differences to Bloom:

Ability to add and remove items dynamically

Bounded false positive probability

By sirodoht

• 64