Probabilistic data structures
and algorithms
for Big Data applications

Asmir Mustafic

PHP CE 2019 - Dresden - Germany

Me

Asmir Mustafic

Me

@goetas

Berlin

Community

  • jms/serializer (maintainer)
  • masterminds/html5 (maintainer)
  • hautelook/templated-uri-bundle (maintainer)
  • goetas-webservices/xsd2php (author)
  • goetas-webservices/xsd-reader (author)
  • goetas-webservices/soap-client (author)
  • goetas/twital (author)

 

  • PHP-FIG secretary

Agenda

  • Why use probabilistic data structures
  • 3 problems solved using probabilistic data structures

Big data

  • 1GB
  • 1TB
  • 1PB
  • 1EB

Big data

1980

1GB = 250 000 $

 455 kg

https://en.wikipedia.org/wiki/History_of_IBM_magnetic_disk_drives#IBM_3380

data are big
when we can't compute them
with the available resources and methods

Too many - Volume

Too different - Variety

Too fast - Velocity

Big data

How do we compute data?

Tape

HDD

SSD

Memory

Ease of use (developer)

CPU works only here

Tape

HDD

SSD

Memory

Speed

Tape

HDD

SSD

Memory

Economicity

Tape

HDD

SSD

Memory

How can we do more here?

Size

Ease of use

Speed

Economicity

Probabilistic data structures

In exchange of
predictable errors

 

  • extremely scalable

  • extremely low memory

Hash functions

h(a) = b

Domain A 

Domain B

h(a) = b

h('abc...') = 7987884

h(123) = 'a'

h(1234) = 5

md5(string $str): string;

sha1(string $str): string;

crc32(string $str): int;
  • Work factor

  • Sticky state

  • Diffusion

  • One Way hashing
    • Fixed output size
    • No limits on input size
  • Collision resistant
    • preimage resistance
    • second-preimage resistance
  • Can allow init-key

Cryptographic
hash functions

Cryptographic
hash functions

https://password-hashing.net/submissions.html

SHA-1 SHA-256 SHA-512 MD2 MD4 MD5 MD6 RadioGatún Whirlpool Tiger BLAKE-256 BLAKE-512 HAVAL Argon2

Trade some Cryptographic proprieties for performance

Non-Cryptographic
hash functions

Non-Cryptographic
hash functions

CityHash FarmHash MetroHash SpookyHash xxHash MurmurHash JenkinsHash FNV

https://cyan4973.github.io/xxHash/

Name Speed
xxHash 5.4 GB/s
MurmurHash3a 2.7 GB/s
MD5 0.33 GB/s
SHA1 0.25 GB/s

Non
Cryptographic

Cryptographic

Non-Cryptographic
hash functions speed

PHP

PHP

Good support for cryptographic hash functions

// 52 algo supported in PHP 7.3
hash($str, $algo); 

// password-hashing optimized
password_hash($password, $algo);

PHP

Bad support for non-cryptographic hash functions

https://bugs.php.net/bug.php?id=62063

// some available in hash()
hash($str, $algo);

Custom extensions can be found online

Probabilistic data structures

The problem

Membership

Akamai CDN

serves 15-30 % of all web traffic

~75% of files are requested only once

Akamai CDN

How to avoid caching on-disk one-hit requests?

Akamai CDN

Keeping a list of URLs and cache
only on the 2nd request!

Akamai CDN

URL length ~100 chars

4 GB

40M unique URL in 24h per node

to avoid caching those one-hit files Akamai uses bloom filters

Akamai CDN

https://doi.org/10.1145%2F2805789.2805800

using ~ 68 MB

0.1% error

Bloom Filter

interface BloomFilter 
{

  function insert(mixed $element): void;

  function exists(mixed $element): bool;

}
0 1 2 3 4 5 6 7 8 9
0 0 0 0 0 0 0 0 0 0

bit array

0 1 2 3 4 5 6 7 8 9
0 0 0 1 0 0 0 1 0 0

h1(Rome) = Murmur3(Rome) mod 10 = 7

h2(Rome) = Fnv1a(Rome) mod 10 = 3

Insert

0 1 2 3 4 5 6 7 8 9
0 1 0 1 0 0 0 1 0 0

h1(Berlin) = Murmur3(Berlin) mod 10 = 1

h2(Berlin) = Fnv1a(Berlin) mod 10 = 3

Insert

0 1 2 3 4 5 6 7 8 9
0 1 0 1 0 0 0 1 0 0
0 1 2 3 4 5 6 7 8 9
0 1 0 1 0 0 0 1 0 0

h1(Rome) = Murmur3(Rome) mod 10 = 7

h2(Rome) = Fnv1a(Rome) mod 10 = 3

exists?

Rome is part
 of the dataset

0 1 2 3 4 5 6 7 8 9
0 1 0 1 0 0 0 1 0 0

h1(Paris) = Murmur3(Paris) mod 10 = 3

h2(Paris) = Fnv1a(Paris) mod 10 = 6

exists?

Paris is not part
 of the dataset

0 1 2 3 4 5 6 7 8 9
0 1 0 1 0 0 0 1 0 0

h1(Madrid) = Murmur3(Madrid) mod 10 = 1

h2(Madrid) = Fnv1a(Madrid) mod 10 = 7

exists?

Madrid is part
 of the dataset

items mem (mb)
1 000 3
100 000 37
1 000 000 264
mem (mb)
0.02
0.1
0.9

bloom 1%

list ~100chr strings

Space efficiency

Bit-array size?

m = - \frac{n \ln{P}}{(ln{2})^2}
- \frac{1000 \ln{0.01}}{(ln{2})^2} = 9585\ bits

~1.2Kbyte

k = \frac{m}{n}\ln{2}

How many functions?

\frac{9585}{1000}\ln{2} = 6.6

https://hur.st/bloomfilter/

Variants

Counting Bloom filter

Approximate count of elements 

Quotient filter

Storage friendly

Allows delete

Fast (only one hash)

Can be resized

Cuckoo filter

Faster

Allows delete

Less memory

Can be resized

PHP

PHP

pecl.php.net/package/bloomy

Not maintained

PHP

rocket-labs/bloom-filter

Murmur3

Redis support

Counting filter support

The problem

Cardinality

Unique visitors of a webiste

bbc.co.uk

558M visits/month (Nov 2018)

3.45 pages / visit

~2bln page views

https://www.similarweb.com/website/bbc.co.uk#overview

bbc.co.uk

Count unique visitors?

Keeping a list of visits 

grouped by some unique identifier

bbc.co.uk

10% unique users

Unique identifier is 10 bytes

List size ~ 2 GB

bbc.co.uk

HyperLogLog counts unique elements using 12 KB

0.81% error

Probabilistic
 counting

h('abc') = 

h('xyz') =

h('foo') =

10110110

11011000

10111011

rank(\cdot)

1

3

0

Flajolet, Martin 1985

Probabilistic counting

0 1 2 3 4 5 6 7
0 0 0 0 0 0 0 0
0 1 2 3 4 5 6 7
1 1 0 1 0 0 0 0

rank('foo') = 0

Probabilistic counting

rank('xyz') = 3

rank('abc') = 1

0 1 2 3 4 5 6 7
1 1 0 1 0 0 0 0
R = 2^2 = 4

Probabilistic counting

n \approx \frac{1}{0.77351} {R}
R_{p_0} = 0
R = 2^p
\frac{1}{0.77351} 4 \approx 5.17 \approx 5

Not really correct

Probabilistic counting

LogLog

Redis version

h('abc') = 

h('xyz') =

h('foo') =

10110110

11011000

10111011

m

LogLog

rank(\cdot)
R_{m1} = 0
R_{m2} = 0
R_{m3} = 2^2 = 4
R_{m4} = 0
0 1 2 3 4 5
0 0 0 0 0 0
0 1 2 3 4 5
0 0 0 0 0 0
0 1 2 3 4 5
0 0 0 1 0 0
0 1 2 3 4 5
1 1 0 0 0 0

00

01

10

11

m_1
m_2
m_3
m_3

No elements fall in this group

No elements fall in this group

h('abc') = 10110110
h('foo') = 10111011

h('xyz') = 11011000

n \approx a_m \cdot m \cdot 2^{\frac{1}{m} sum(R_m)}
0.39701 \cdot 4 \cdot 2^{\frac{1}{4} \cdot (0+0+4+0) } = 3.17

Real value = 3

LogLog

HyperLogLog

HyperLogLog++

Improvements of LogLog

adding bias correction, adjusted constants
harmonic mean, 64bit, less memory

standard error ~0.81 % 

PHP

PHP

https://github.com/shabbyrobe/phphll

PHP extension

Port of Redis HyperLogLog

PHP

joegreen0991/hyperloglog

Port of Redis HyperLogLog

The problem

Frequency

Twitter 

Twitter 

http://www.internetlivestats.com/twitter-statistics/

6000 tweet/s

some tweets have hash-tags

Twitter 

1) Count tweets for
 most popular tags
 

2) Compare today's tag-count
with
other periods or regions

Twitter 

Keep sorted lists with hashtag counts

....

Twitter 

Count-Min Sketch
solves the problem using
~  200 KB

0.01% error

Count-Min Sketch

4,4,4,4,2,3,5,4,6,4,3,3,4,2,3,3,3,2

c1 c2 c3 c4
0 0 0 0
0 0 0 0

h1(x), h2(x)

h1(x)

h2(x)

counters array

c

r

4,4,4,4,2,3,5,4,6,4,3,3,4,2,3,3,3,2

h1(4) = 4

h2(4) = 4

c1 c2 c3 c4
0 0 0 1
0 0 0 1

h1(x)

h2(x)

4,4,4,4,2,3,5,4,6,4,3,3,4,2,3,3,3,2

h1(4) = 4

h2(4) = 4

c1 c2 c3 c4
0 0 0 2
0 0 0 2

h1(x)

h2(x)

4,4,4,4,2,3,5,4,6,4,3,3,4,2,3,3,3,2

h1(4) = 4

h2(4) = 4

c1 c2 c3 c4
0 0 0 3
0 0 0 3

h1(x)

h2(x)

4,4,4,4,2,3,5,4,6,4,3,3,4,2,3,3,3,2

h1(4) = 4

h2(4) = 4

c1 c2 c3 c4
0 0 0 4
0 0 0 4

h1(x)

h2(x)

4,4,4,4,2,3,5,4,6,4,3,3,4,2,3,3,3,2

h1(2) = 4

h2(2) = 1

c1 c2 c3 c4
0 0 0 5
1 0 0 4

h1(x)

h2(x)

4,4,4,4,2,3,5,4,6,4,3,3,4,2,3,3,3,2

c1 c2 c3 c4
8 0 0 10
3 4 6 7

h1(x)

h2(x)

Estimate frequency

h1(4) = 4

h2(4) = 4

f(4) = ?

c1 c2 c3 c4
8 0 0 10
3 4 6 7

h1(x)

h2(x)

c1 c2 c3 c4
8 0 0 10
1 4 6 7
f(4) = min(10, 7) = 7

real frequency 7

h1(x)

h2(x)

h1(2) = 4

h2(2) = 1

f(2) = ?

c1 c2 c3 c4
8 0 0 10
3 4 6 7

h1(x)

h2(x)

c1 c2 c3 c4
8 0 0 10
3 4 6 7
f(2) = min(10, 3) = 3

real frequency 3

h1(x)

h2(x)

h1(6) = 1

h2(6) = 2

f(6) = ?

c1 c2 c3 c4
8 0 0 10
3 4 6 7

h1(x)

h2(x)

c1 c2 c3 c4
8 0 0 10
3 4 6 7
f(4) = min(8, 4) = 4

real frequency 1

h1(x)

h2(x)

Hash functions

(and counter rows)
= the frequency overestimation error

r = \ln(\frac{1}{\delta})
\ln(\frac{1}{0.02}) \approx 4

Example: 2% error

Counter columns

= standard error

c = \frac{e}{\epsilon}
\frac{2.718}{0.02} \approx 136

Example: 2% error

~4bln unique elements

0.1% errors and accuracy

7 rows (hash functions)

2718 columns

32 bit counters (~4bln elements)

7 x 2718 x 32 / 8 = 76 KB

Heavy Hitters

4,4,4,4,2,3,5,4,6,4,3,3,4,2,3,3,3,2

Find 3 Top elements

Top 3 elements are [(4, 7), (3, 6), (2, 3)]

x \in H_k \iff f(x) \geq \frac{N}{k}
\frac{100}{3} \geq 33
N = 100, k = 3
t(x)

4,4,4,4,2,3,5,4,6,4,3,3,4,2,3,3,3,2

3 Top elements

Top 3 elements are [(4, 7), (3, 6)]

\frac{18}{3} \geq 6
N = 18, k = 3

4,4,4,4,2,3,5,4,6,4,3,3,4,2,3,3,3,2

c1 c2 c3 c4
0 0 0 1
0 0 0 1

He = [ (4, 1) ]

t(x) = \frac{N}{k} = \frac{1}{3} = 0.3
f(4) = min(1,1) = 1

h1(x)

h2(x)

4,4,4,4,2,3,5,4,6,4,3,3,4,2,3,3,3,2

c1 c2 c3 c4
0 0 0 2
0 0 0 2

He = [ (4, 2) ]

t(x) = \frac{N}{k} = \frac{2}{3} = 0.6
f(4) = min(2,2) = 2

h1(x)

h2(x)

4,4,4,4,2,3,5,4,6,4,3,3,4,2,3,3,3,2

c1 c2 c3 c4
7 0 0 9
1 3 6 7

He = [ (4, 7) ]

t(x) = \frac{N}{k} = \frac{13}{3} = 4.3
f(4) = min(9,7) = 7

h1(x)

h2(x)

4,4,4,4,2,3,5,4,6,4,3,3,4,2,3,3,3,2

c1 c2 c3 c4
7 0 0 10
2 4 6 6

He = [ (4, 7), (3, 6) ]

t(x) = \frac{N}{k} = \frac{17}{3} = 5.6
f(3) = min(7,6) = 6

h1(x)

h2(x)

He = [ (4, 7) ]

4,4,4,4,2,3,5,4,6,4,3,3,4,2,3,3,3,2

He = [ (4, 7), (3, 6) ]

Variants

Count-Min-Log Sketch

Better frequency approximation for low frequency elements

Count-Mean-Min Sketch

subtracts the median, good when
under-estimation is preferred

PHP

PHP

https://github.com/mrjgreen/CountMinSketch

Not registered on packagist

No unit tests (has benchmark test)

Weird hash function crc32 + md5

Only frequency estimation, no heavy hitters

What about PHP and big data?

Probabilistic data structures
reside in memory

...but PHP cleans the memory on each request

PHP is faster on each release

PHP daemons
started to emerge

 

swoole, amp, react

General knowledge of advanced data structures

helps to take decisions
when is the right moment 

Thank you!

Probabilistic data structures and algorithms for Big Data applications - PHPCE 2019

By Asmir Mustafic

Probabilistic data structures and algorithms for Big Data applications - PHPCE 2019

As the amount of data we produce is continuously growing, we need also more sophisticated methods to elaborate them. Probabilistic data structures are based on different hashing techniques and provide approximate answers with predictable errors. The potential errors are compensated by the incredibly low memory usage, query time and scaling factors. This talk will cover the most common strategies used to solve membership, counting, similarity, frequency and ranking problems in a Big Data context.

  • 1,047