# Agenda

• Why use probabilistic data structures
• 3 problems solved using probabilistic data structures

• 1GB
• 1TB
• 1PB
• 1EB

# Big data

1980

1GB = 250 000 $455 kg https://en.wikipedia.org/wiki/History_of_IBM_magnetic_disk_drives#IBM_3380 data are big when we can't compute them with the available resources and methods Too many - Volume Too different - Variety Too fast - Velocity # Big data # How do we compute data? ### Tape ### HDD ### SSD ### Memory Ease of use (developer) CPU works only here ### Tape ### HDD ### SSD ### Memory Speed ### Tape ### HDD ### SSD ### Memory Economicity ### Tape ### HDD ### SSD ### Memory How can we do more here? Size Ease of use Speed Economicity # Probabilistic data structures ## In exchange ofpredictable errors • ## extremely scalable • ## extremely low memory # Hash functions h(a) = b Domain A Domain B h(a) = b h('abc...') = 7987884 h(123) = 'a' h(1234) = 5 md5(string$str): string;

sha1(string $str): string; crc32(string$str): int;

• ### Diffusion

• One Way hashing
• Fixed output size
• No limits on input size
• Collision resistant
• preimage resistance
• second-preimage resistance
• Can allow init-key

# Cryptographic hash functions

SHA-1 SHA-256 SHA-512 MD2 MD4 MD5 MD6 RadioGatún Whirlpool Tiger BLAKE-256 BLAKE-512 HAVAL Argon2

# Non-Cryptographic hash functions

CityHash FarmHash MetroHash SpookyHash xxHash MurmurHash JenkinsHash FNV

https://cyan4973.github.io/xxHash/

Name Speed
xxHash 5.4 GB/s
MurmurHash3a 2.7 GB/s
MD5 0.33 GB/s
SHA1 0.25 GB/s

Non
Cryptographic

Cryptographic

# PHP

### Good support for cryptographic hash functions

// 52 algo supported in PHP 7.3
hash($str,$algo);

password_hash($password,$algo);

# PHP

### Bad support for non-cryptographic hash functions

https://bugs.php.net/bug.php?id=62063

// some available in hash()
hash($str,$algo);

Custom extensions can be found online

# Akamai CDN

https://doi.org/10.1145%2F2805789.2805800

0.1% error

# Bloom Filter

interface BloomFilter
{

function insert(mixed $element): void; function exists(mixed$element): bool;

}
0 1 2 3 4 5 6 7 8 9
0 0 0 0 0 0 0 0 0 0

bit array

0 1 2 3 4 5 6 7 8 9
0 0 0 1 0 0 0 1 0 0

h1(Rome) = Murmur3(Rome) mod 10 = 7

h2(Rome) = Fnv1a(Rome) mod 10 = 3

## Insert

0 1 2 3 4 5 6 7 8 9
0 1 0 1 0 0 0 1 0 0

h1(Berlin) = Murmur3(Berlin) mod 10 = 1

h2(Berlin) = Fnv1a(Berlin) mod 10 = 3

## Insert

0 1 2 3 4 5 6 7 8 9
0 1 0 1 0 0 0 1 0 0
0 1 2 3 4 5 6 7 8 9
0 1 0 1 0 0 0 1 0 0

h1(Rome) = Murmur3(Rome) mod 10 = 7

h2(Rome) = Fnv1a(Rome) mod 10 = 3

## exists?

Rome is part
of the dataset

0 1 2 3 4 5 6 7 8 9
0 1 0 1 0 0 0 1 0 0

h1(Paris) = Murmur3(Paris) mod 10 = 3

h2(Paris) = Fnv1a(Paris) mod 10 = 6

## exists?

Paris is not part
of the dataset

0 1 2 3 4 5 6 7 8 9
0 1 0 1 0 0 0 1 0 0

## exists?

of the dataset

items mem (mb)
1 000 3
100 000 37
1 000 000 264
mem (mb)
0.02
0.1
0.9

bloom 1%

list ~100chr strings

# Bit-array size?

m = - \frac{n \ln{P}}{(ln{2})^2}
- \frac{1000 \ln{0.01}}{(ln{2})^2} = 9585\ bits

~1.2Kbyte

k = \frac{m}{n}\ln{2}

# How many functions?

\frac{9585}{1000}\ln{2} = 6.6

https://hur.st/bloomfilter/

# Counting Bloom filter

Approximate count of elements

# Quotient filter

Storage friendly

Allows delete

Fast (only one hash)

Can be resized

Faster

Allows delete

Less memory

Can be resized

# PHP

### rocket-labs/bloom-filter

Murmur3

Redis support

Counting filter support

# bbc.co.uk

### ~2bln page views

https://www.similarweb.com/website/bbc.co.uk#overview

0.81% error

h('abc') =

h('xyz') =

h('foo') =

10110110

11011000

10111011

rank(\cdot)

1

3

0

0 1 2 3 4 5 6 7
0 0 0 0 0 0 0 0
0 1 2 3 4 5 6 7
1 1 0 1 0 0 0 0

rank('foo') = 0

rank('xyz') = 3

rank('abc') = 1

0 1 2 3 4 5 6 7
1 1 0 1 0 0 0 0
R = 2^2 = 4

# Probabilistic counting

n \approx \frac{1}{0.77351} {R}
R_{p_0} = 0
R = 2^p
\frac{1}{0.77351} 4 \approx 5.17 \approx 5

Not really correct

Redis version

h('abc') =

h('xyz') =

h('foo') =

10110110

11011000

10111011

m

# LogLog

rank(\cdot)
R_{m1} = 0
R_{m2} = 0
R_{m3} = 2^2 = 4
R_{m4} = 0
0 1 2 3 4 5
0 0 0 0 0 0
0 1 2 3 4 5
0 0 0 0 0 0
0 1 2 3 4 5
0 0 0 1 0 0
0 1 2 3 4 5
1 1 0 0 0 0

00

01

10

11

m_1
m_2
m_3
m_3

No elements fall in this group

No elements fall in this group

h('abc') = 10110110
h('foo') = 10111011

h('xyz') = 11011000

n \approx a_m \cdot m \cdot 2^{\frac{1}{m} sum(R_m)}
0.39701 \cdot 4 \cdot 2^{\frac{1}{4} \cdot (0+0+4+0) } = 3.17

Real value = 3

# HyperLogLog++

Improvements of LogLog

harmonic mean, 64bit, less memory

standard error ~0.81 %

# PHP

### https://github.com/shabbyrobe/phphll

PHP extension

Port of Redis HyperLogLog

# PHP

### joegreen0991/hyperloglog

Port of Redis HyperLogLog

....

0.01% error

# Count-Min Sketch

4,4,4,4,2,3,5,4,6,4,3,3,4,2,3,3,3,2

c1 c2 c3 c4
0 0 0 0
0 0 0 0

h1(x), h2(x)

h1(x)

h2(x)

counters array

c

r

4,4,4,4,2,3,5,4,6,4,3,3,4,2,3,3,3,2

h1(4) = 4

h2(4) = 4

c1 c2 c3 c4
0 0 0 1
0 0 0 1

h1(x)

h2(x)

4,4,4,4,2,3,5,4,6,4,3,3,4,2,3,3,3,2

h1(4) = 4

h2(4) = 4

c1 c2 c3 c4
0 0 0 2
0 0 0 2

h1(x)

h2(x)

4,4,4,4,2,3,5,4,6,4,3,3,4,2,3,3,3,2

h1(4) = 4

h2(4) = 4

c1 c2 c3 c4
0 0 0 3
0 0 0 3

h1(x)

h2(x)

4,4,4,4,2,3,5,4,6,4,3,3,4,2,3,3,3,2

h1(4) = 4

h2(4) = 4

c1 c2 c3 c4
0 0 0 4
0 0 0 4

h1(x)

h2(x)

4,4,4,4,2,3,5,4,6,4,3,3,4,2,3,3,3,2

h1(2) = 4

h2(2) = 1

c1 c2 c3 c4
0 0 0 5
1 0 0 4

h1(x)

h2(x)

4,4,4,4,2,3,5,4,6,4,3,3,4,2,3,3,3,2

c1 c2 c3 c4
8 0 0 10
3 4 6 7

h1(x)

h2(x)

# Estimate frequency

h1(4) = 4

h2(4) = 4

## f(4) = ?

c1 c2 c3 c4
8 0 0 10
3 4 6 7

h1(x)

h2(x)

c1 c2 c3 c4
8 0 0 10
1 4 6 7
f(4) = min(10, 7) = 7

h1(x)

h2(x)

h1(2) = 4

h2(2) = 1

## f(2) = ?

c1 c2 c3 c4
8 0 0 10
3 4 6 7

h1(x)

h2(x)

c1 c2 c3 c4
8 0 0 10
3 4 6 7
f(2) = min(10, 3) = 3

h1(x)

h2(x)

h1(6) = 1

h2(6) = 2

## f(6) = ?

c1 c2 c3 c4
8 0 0 10
3 4 6 7

h1(x)

h2(x)

c1 c2 c3 c4
8 0 0 10
3 4 6 7
f(4) = min(8, 4) = 4

h1(x)

h2(x)

# Hash functions

(and counter rows)
= the frequency overestimation error

r = \ln(\frac{1}{\delta})
\ln(\frac{1}{0.02}) \approx 4

Example: 2% error

# Counter columns

= standard error

c = \frac{e}{\epsilon}
\frac{2.718}{0.02} \approx 136

Example: 2% error

# Heavy Hitters

4,4,4,4,2,3,5,4,6,4,3,3,4,2,3,3,3,2

## Find 3 Top elements

Top 3 elements are [(4, 7), (3, 6), (2, 3)]

x \in H_k \iff f(x) \geq \frac{N}{k}
\frac{100}{3} \geq 33
N = 100, k = 3
t(x)

4,4,4,4,2,3,5,4,6,4,3,3,4,2,3,3,3,2

## 3 Top elements

Top 3 elements are [(4, 7), (3, 6)]

\frac{18}{3} \geq 6
N = 18, k = 3

4,4,4,4,2,3,5,4,6,4,3,3,4,2,3,3,3,2

c1 c2 c3 c4
0 0 0 1
0 0 0 1

He = [ (4, 1) ]

t(x) = \frac{N}{k} = \frac{1}{3} = 0.3
f(4) = min(1,1) = 1

h1(x)

h2(x)

4,4,4,4,2,3,5,4,6,4,3,3,4,2,3,3,3,2

c1 c2 c3 c4
0 0 0 2
0 0 0 2

He = [ (4, 2) ]

t(x) = \frac{N}{k} = \frac{2}{3} = 0.6
f(4) = min(2,2) = 2

h1(x)

h2(x)

4,4,4,4,2,3,5,4,6,4,3,3,4,2,3,3,3,2

c1 c2 c3 c4
7 0 0 9
1 3 6 7

He = [ (4, 7) ]

t(x) = \frac{N}{k} = \frac{13}{3} = 4.3
f(4) = min(9,7) = 7

h1(x)

h2(x)

4,4,4,4,2,3,5,4,6,4,3,3,4,2,3,3,3,2

c1 c2 c3 c4
7 0 0 10
2 4 6 6

He = [ (4, 7), (3, 6) ]

t(x) = \frac{N}{k} = \frac{17}{3} = 5.6
f(3) = min(7,6) = 6

h1(x)

h2(x)

He = [ (4, 7) ]

4,4,4,4,2,3,5,4,6,4,3,3,4,2,3,3,3,2

He = [ (4, 7), (3, 6) ]

# Count-Min-Log Sketch

Better frequency approximation for low frequency elements

# Count-Mean-Min Sketch

subtracts the median, good when
under-estimation is preferred

# PHP

### https://github.com/mrjgreen/CountMinSketch

Not registered on packagist

No unit tests (has benchmark test)

Weird hash function crc32 + md5

Only frequency estimation, no heavy hitters

