# Probabilistic data structures and algorithms for Big Data applications

Asmir Mustafic

PHP CE 2019 - Dresden - Germany

Berlin

# Community

• jms/serializer (maintainer)
• masterminds/html5 (maintainer)
• hautelook/templated-uri-bundle (maintainer)
• goetas-webservices/xsd2php (author)
• goetas-webservices/soap-client (author)
• goetas/twital (author)

• PHP-FIG secretary

# Agenda

• Why use probabilistic data structures
• 3 problems solved using probabilistic data structures

• 1GB
• 1TB
• 1PB
• 1EB

# Big data

1980

1GB = 250 000 $455 kg https://en.wikipedia.org/wiki/History_of_IBM_magnetic_disk_drives#IBM_3380 data are big when we can't compute them with the available resources and methods Too many - Volume Too different - Variety Too fast - Velocity # Big data # How do we compute data? ### Tape ### HDD ### SSD ### Memory Ease of use (developer) CPU works only here ### Tape ### HDD ### SSD ### Memory Speed ### Tape ### HDD ### SSD ### Memory Economicity ### Tape ### HDD ### SSD ### Memory How can we do more here? Size Ease of use Speed Economicity # Probabilistic data structures ## In exchange ofpredictable errors • ## extremely scalable • ## extremely low memory # Hash functions h(a) = b Domain A Domain B h(a) = b h('abc...') = 7987884 h(123) = 'a' h(1234) = 5 md5(string$str): string;

sha1(string $str): string; crc32(string$str): int;

• ### Diffusion

• One Way hashing
• Fixed output size
• No limits on input size
• Collision resistant
• preimage resistance
• second-preimage resistance
• Can allow init-key

# Cryptographic hash functions

SHA-1 SHA-256 SHA-512 MD2 MD4 MD5 MD6 RadioGatún Whirlpool Tiger BLAKE-256 BLAKE-512 HAVAL Argon2

# Non-Cryptographic hash functions

CityHash FarmHash MetroHash SpookyHash xxHash MurmurHash JenkinsHash FNV

https://cyan4973.github.io/xxHash/

Name Speed
xxHash 5.4 GB/s
MurmurHash3a 2.7 GB/s
MD5 0.33 GB/s
SHA1 0.25 GB/s

Non
Cryptographic

Cryptographic

# PHP

### Good support for cryptographic hash functions

// 52 algo supported in PHP 7.3
hash($str,$algo);

password_hash($password,$algo);

# PHP

### Bad support for non-cryptographic hash functions

https://bugs.php.net/bug.php?id=62063

// some available in hash()
hash($str,$algo);

Custom extensions can be found online

# Akamai CDN

https://doi.org/10.1145%2F2805789.2805800

0.1% error

# Bloom Filter

interface BloomFilter
{

function insert(mixed $element): void; function exists(mixed$element): bool;

}
0 1 2 3 4 5 6 7 8 9
0 0 0 0 0 0 0 0 0 0

bit array

0 1 2 3 4 5 6 7 8 9
0 0 0 1 0 0 0 1 0 0

h1(Rome) = Murmur3(Rome) mod 10 = 7

h2(Rome) = Fnv1a(Rome) mod 10 = 3

## Insert

0 1 2 3 4 5 6 7 8 9
0 1 0 1 0 0 0 1 0 0

h1(Berlin) = Murmur3(Berlin) mod 10 = 1

h2(Berlin) = Fnv1a(Berlin) mod 10 = 3

## Insert

0 1 2 3 4 5 6 7 8 9
0 1 0 1 0 0 0 1 0 0
0 1 2 3 4 5 6 7 8 9
0 1 0 1 0 0 0 1 0 0

h1(Rome) = Murmur3(Rome) mod 10 = 7

h2(Rome) = Fnv1a(Rome) mod 10 = 3

## exists?

Rome is part
of the dataset

0 1 2 3 4 5 6 7 8 9
0 1 0 1 0 0 0 1 0 0

h1(Paris) = Murmur3(Paris) mod 10 = 3

h2(Paris) = Fnv1a(Paris) mod 10 = 6

## exists?

Paris is not part
of the dataset

0 1 2 3 4 5 6 7 8 9
0 1 0 1 0 0 0 1 0 0

## exists?

of the dataset

items mem (mb)
1 000 3
100 000 37
1 000 000 264
mem (mb)
0.02
0.1
0.9

bloom 1%

list ~100chr strings

# Bit-array size?

m = - \frac{n \ln{P}}{(ln{2})^2}
- \frac{1000 \ln{0.01}}{(ln{2})^2} = 9585\ bits

~1.2Kbyte

k = \frac{m}{n}\ln{2}

# How many functions?

\frac{9585}{1000}\ln{2} = 6.6

https://hur.st/bloomfilter/

# Counting Bloom filter

Approximate count of elements

# Quotient filter

Storage friendly

Allows delete

Fast (only one hash)

Can be resized

Faster

Allows delete

Less memory

Can be resized

# PHP

### rocket-labs/bloom-filter

Murmur3

Redis support

Counting filter support

# bbc.co.uk

### ~2bln page views

https://www.similarweb.com/website/bbc.co.uk#overview

0.81% error

h('abc') =

h('xyz') =

h('foo') =

10110110

11011000

10111011

rank(\cdot)

1

3

0

0 1 2 3 4 5 6 7
0 0 0 0 0 0 0 0
0 1 2 3 4 5 6 7
1 1 0 1 0 0 0 0

rank('foo') = 0

rank('xyz') = 3

rank('abc') = 1

0 1 2 3 4 5 6 7
1 1 0 1 0 0 0 0
R = 2^2 = 4

# Probabilistic counting

n \approx \frac{1}{0.77351} {R}
R_{p_0} = 0
R = 2^p
\frac{1}{0.77351} 4 \approx 5.17 \approx 5

Not really correct

Redis version

h('abc') =

h('xyz') =

h('foo') =

10110110

11011000

10111011

m

# LogLog

rank(\cdot)
R_{m1} = 0
R_{m2} = 0
R_{m3} = 2^2 = 4
R_{m4} = 0
0 1 2 3 4 5
0 0 0 0 0 0
0 1 2 3 4 5
0 0 0 0 0 0
0 1 2 3 4 5
0 0 0 1 0 0
0 1 2 3 4 5
1 1 0 0 0 0

00

01

10

11

m_1
m_2
m_3
m_3

No elements fall in this group

No elements fall in this group

h('abc') = 10110110
h('foo') = 10111011

h('xyz') = 11011000

n \approx a_m \cdot m \cdot 2^{\frac{1}{m} sum(R_m)}
0.39701 \cdot 4 \cdot 2^{\frac{1}{4} \cdot (0+0+4+0) } = 3.17

Real value = 3

# HyperLogLog++

Improvements of LogLog

harmonic mean, 64bit, less memory

standard error ~0.81 %

# PHP

### https://github.com/shabbyrobe/phphll

PHP extension

Port of Redis HyperLogLog

# PHP

### joegreen0991/hyperloglog

Port of Redis HyperLogLog

....

0.01% error

# Count-Min Sketch

4,4,4,4,2,3,5,4,6,4,3,3,4,2,3,3,3,2

c1 c2 c3 c4
0 0 0 0
0 0 0 0

h1(x), h2(x)

h1(x)

h2(x)

counters array

c

r

4,4,4,4,2,3,5,4,6,4,3,3,4,2,3,3,3,2

h1(4) = 4

h2(4) = 4

c1 c2 c3 c4
0 0 0 1
0 0 0 1

h1(x)

h2(x)

4,4,4,4,2,3,5,4,6,4,3,3,4,2,3,3,3,2

h1(4) = 4

h2(4) = 4

c1 c2 c3 c4
0 0 0 2
0 0 0 2

h1(x)

h2(x)

4,4,4,4,2,3,5,4,6,4,3,3,4,2,3,3,3,2

h1(4) = 4

h2(4) = 4

c1 c2 c3 c4
0 0 0 3
0 0 0 3

h1(x)

h2(x)

4,4,4,4,2,3,5,4,6,4,3,3,4,2,3,3,3,2

h1(4) = 4

h2(4) = 4

c1 c2 c3 c4
0 0 0 4
0 0 0 4

h1(x)

h2(x)

4,4,4,4,2,3,5,4,6,4,3,3,4,2,3,3,3,2

h1(2) = 4

h2(2) = 1

c1 c2 c3 c4
0 0 0 5
1 0 0 4

h1(x)

h2(x)

4,4,4,4,2,3,5,4,6,4,3,3,4,2,3,3,3,2

c1 c2 c3 c4
8 0 0 10
3 4 6 7

h1(x)

h2(x)

# Estimate frequency

h1(4) = 4

h2(4) = 4

## f(4) = ?

c1 c2 c3 c4
8 0 0 10
3 4 6 7

h1(x)

h2(x)

c1 c2 c3 c4
8 0 0 10
1 4 6 7
f(4) = min(10, 7) = 7

h1(x)

h2(x)

h1(2) = 4

h2(2) = 1

## f(2) = ?

c1 c2 c3 c4
8 0 0 10
3 4 6 7

h1(x)

h2(x)

c1 c2 c3 c4
8 0 0 10
3 4 6 7
f(2) = min(10, 3) = 3

h1(x)

h2(x)

h1(6) = 1

h2(6) = 2

## f(6) = ?

c1 c2 c3 c4
8 0 0 10
3 4 6 7

h1(x)

h2(x)

c1 c2 c3 c4
8 0 0 10
3 4 6 7
f(4) = min(8, 4) = 4

h1(x)

h2(x)

# Hash functions

(and counter rows)
= the frequency overestimation error

r = \ln(\frac{1}{\delta})
\ln(\frac{1}{0.02}) \approx 4

Example: 2% error

# Counter columns

= standard error

c = \frac{e}{\epsilon}
\frac{2.718}{0.02} \approx 136

Example: 2% error

# Heavy Hitters

4,4,4,4,2,3,5,4,6,4,3,3,4,2,3,3,3,2

## Find 3 Top elements

Top 3 elements are [(4, 7), (3, 6), (2, 3)]

x \in H_k \iff f(x) \geq \frac{N}{k}
\frac{100}{3} \geq 33
N = 100, k = 3
t(x)

4,4,4,4,2,3,5,4,6,4,3,3,4,2,3,3,3,2

## 3 Top elements

Top 3 elements are [(4, 7), (3, 6)]

\frac{18}{3} \geq 6
N = 18, k = 3

4,4,4,4,2,3,5,4,6,4,3,3,4,2,3,3,3,2

c1 c2 c3 c4
0 0 0 1
0 0 0 1

He = [ (4, 1) ]

t(x) = \frac{N}{k} = \frac{1}{3} = 0.3
f(4) = min(1,1) = 1

h1(x)

h2(x)

4,4,4,4,2,3,5,4,6,4,3,3,4,2,3,3,3,2

c1 c2 c3 c4
0 0 0 2
0 0 0 2

He = [ (4, 2) ]

t(x) = \frac{N}{k} = \frac{2}{3} = 0.6
f(4) = min(2,2) = 2

h1(x)

h2(x)

4,4,4,4,2,3,5,4,6,4,3,3,4,2,3,3,3,2

c1 c2 c3 c4
7 0 0 9
1 3 6 7

He = [ (4, 7) ]

t(x) = \frac{N}{k} = \frac{13}{3} = 4.3
f(4) = min(9,7) = 7

h1(x)

h2(x)

4,4,4,4,2,3,5,4,6,4,3,3,4,2,3,3,3,2

c1 c2 c3 c4
7 0 0 10
2 4 6 6

He = [ (4, 7), (3, 6) ]

t(x) = \frac{N}{k} = \frac{17}{3} = 5.6
f(3) = min(7,6) = 6

h1(x)

h2(x)

He = [ (4, 7) ]

4,4,4,4,2,3,5,4,6,4,3,3,4,2,3,3,3,2

He = [ (4, 7), (3, 6) ]

# Count-Min-Log Sketch

Better frequency approximation for low frequency elements

# Count-Mean-Min Sketch

subtracts the median, good when
under-estimation is preferred

# PHP

### https://github.com/mrjgreen/CountMinSketch

Not registered on packagist

No unit tests (has benchmark test)

Weird hash function crc32 + md5

Only frequency estimation, no heavy hitters

# Thank you!

#### Probabilistic data structures and algorithms for Big Data applications - PHPCE 2019

By Asmir Mustafic

# Probabilistic data structures and algorithms for Big Data applications - PHPCE 2019

As the amount of data we produce is continuously growing, we need also more sophisticated methods to elaborate them. Probabilistic data structures are based on different hashing techniques and provide approximate answers with predictable errors. The potential errors are compensated by the incredibly low memory usage, query time and scaling factors. This talk will cover the most common strategies used to solve membership, counting, similarity, frequency and ranking problems in a Big Data context.

• 155