Probabilistic data structures
and algorithms
for Big Data applications
Asmir Mustafic
PHP CE 2019  Dresden  Germany
Me
Asmir Mustafic
Me
@goetas
 Twitter: @goetas_asmir
 Github: @goetas
 LinkedIn: @goetas
 WWW: goetas.com
Berlin
Community
 jms/serializer (maintainer)
 masterminds/html5 (maintainer)
 hautelook/templateduribundle (maintainer)
 goetaswebservices/xsd2php (author)
 goetaswebservices/xsdreader (author)
 goetaswebservices/soapclient (author)
 goetas/twital (author)
 PHPFIG secretary
Agenda
 Why use probabilistic data structures
 3 problems solved using probabilistic data structures
Big data
 1GB
 1TB
 1PB
 1EB
Big data
1980
1GB = 250 000 $
455 kg
https://en.wikipedia.org/wiki/History_of_IBM_magnetic_disk_drives#IBM_3380
data are big
when we can't compute them
with the available resources and methods
Too many  Volume
Too different  Variety
Too fast  Velocity
Big data
How do we compute data?
Tape
HDD
SSD
Memory
Ease of use (developer)
CPU works only here
Tape
HDD
SSD
Memory
Speed
Tape
HDD
SSD
Memory
Economicity
Tape
HDD
SSD
Memory
How can we do more here?
Size
Ease of use
Speed
Economicity
Probabilistic data structures
In exchange of
predictable errors

extremely scalable

extremely low memory
Hash functions
Domain A
Domain B
h('abc...') = 7987884
h(123) = 'a'
h(1234) = 5
md5(string $str): string;
sha1(string $str): string;
crc32(string $str): int;

Work factor

Sticky state

Diffusion
 One Way hashing
 Fixed output size
 No limits on input size
 Collision resistant
 preimage resistance
 secondpreimage resistance
 Can allow initkey
Cryptographic
hash functions
Cryptographic
hash functions
https://passwordhashing.net/submissions.html
SHA1 SHA256 SHA512 MD2 MD4 MD5 MD6 RadioGatún Whirlpool Tiger BLAKE256 BLAKE512 HAVAL Argon2
Trade some Cryptographic proprieties for performance
NonCryptographic
hash functions
NonCryptographic
hash functions
CityHash FarmHash MetroHash SpookyHash xxHash MurmurHash JenkinsHash FNV
https://cyan4973.github.io/xxHash/
Name  Speed 

xxHash  5.4 GB/s 
MurmurHash3a  2.7 GB/s 
MD5  0.33 GB/s 
SHA1  0.25 GB/s 
Non
Cryptographic
Cryptographic
NonCryptographic
hash functions speed
PHP
PHP
Good support for cryptographic hash functions
// 52 algo supported in PHP 7.3
hash($str, $algo);
// passwordhashing optimized
password_hash($password, $algo);
PHP
Bad support for noncryptographic hash functions
https://bugs.php.net/bug.php?id=62063
// some available in hash()
hash($str, $algo);
Custom extensions can be found online
Probabilistic data structures
The problem
Membership
Akamai CDN
serves 1530 % of all web traffic
~75% of files are requested only once
Akamai CDN
How to avoid caching ondisk onehit requests?
Akamai CDN
Keeping a list of URLs and cache
only on the 2nd request!
Akamai CDN
URL length ~100 chars
4 GB
40M unique URL in 24h per node
to avoid caching those onehit files Akamai uses bloom filters
Akamai CDN
https://doi.org/10.1145%2F2805789.2805800
using ~ 68 MB
0.1% error
Bloom Filter
interface BloomFilter
{
function insert(mixed $element): void;
function exists(mixed $element): bool;
}
0  1  2  3  4  5  6  7  8  9 

0  0  0  0  0  0  0  0  0  0 
bit array
0  1  2  3  4  5  6  7  8  9 

0  0  0  1  0  0  0  1  0  0 
h1(Rome) = Murmur3(Rome) mod 10 = 7
h2(Rome) = Fnv1a(Rome) mod 10 = 3
Insert
0  1  2  3  4  5  6  7  8  9 

0  1  0  1  0  0  0  1  0  0 
h1(Berlin) = Murmur3(Berlin) mod 10 = 1
h2(Berlin) = Fnv1a(Berlin) mod 10 = 3
Insert
0  1  2  3  4  5  6  7  8  9 

0  1  0  1  0  0  0  1  0  0 
0  1  2  3  4  5  6  7  8  9 

0  1  0  1  0  0  0  1  0  0 
h1(Rome) = Murmur3(Rome) mod 10 = 7
h2(Rome) = Fnv1a(Rome) mod 10 = 3
exists?
Rome is part
of the dataset
0  1  2  3  4  5  6  7  8  9 

0  1  0  1  0  0  0  1  0  0 
h1(Paris) = Murmur3(Paris) mod 10 = 3
h2(Paris) = Fnv1a(Paris) mod 10 = 6
exists?
Paris is not part
of the dataset
0  1  2  3  4  5  6  7  8  9 

0  1  0  1  0  0  0  1  0  0 
h1(Madrid) = Murmur3(Madrid) mod 10 = 1
h2(Madrid) = Fnv1a(Madrid) mod 10 = 7
exists?
Madrid is part
of the dataset
items  mem (mb) 

1 000  3 
100 000  37 
1 000 000  264 
mem (mb) 

0.02 
0.1 
0.9 
bloom 1%
list ~100chr strings
Space efficiency
Bitarray size?
~1.2Kbyte
How many functions?
https://hur.st/bloomfilter/
Variants
Counting Bloom filter
Approximate count of elements
Quotient filter
Storage friendly
Allows delete
Fast (only one hash)
Can be resized
Cuckoo filter
Faster
Allows delete
Less memory
Can be resized
PHP
PHP
pecl.php.net/package/bloomy
Not maintained
PHP
rocketlabs/bloomfilter
Murmur3
Redis support
Counting filter support
The problem
Cardinality
Unique visitors of a webiste
bbc.co.uk
558M visits/month (Nov 2018)
3.45 pages / visit
~2bln page views
https://www.similarweb.com/website/bbc.co.uk#overview
bbc.co.uk
Count unique visitors?
Keeping a list of visits
grouped by some unique identifier
bbc.co.uk
10% unique users
Unique identifier is 10 bytes
List size ~ 2 GB
bbc.co.uk
HyperLogLog counts unique elements using 12 KB
0.81% error
Probabilistic
counting
h('abc') =
h('xyz') =
h('foo') =
10110110
11011000
10111011
⇒ 1
⇒ 3
⇒ 0
Flajolet, Martin 1985
Probabilistic counting
0  1  2  3  4  5  6  7 

0  0  0  0  0  0  0  0 
0  1  2  3  4  5  6  7 

1  1  0  1  0  0  0  0 
rank('foo') = 0
Probabilistic counting
rank('xyz') = 3
rank('abc') = 1
0  1  2  3  4  5  6  7 

1  1  0  1  0  0  0  0 
Probabilistic counting
Not really correct
Probabilistic counting
LogLog
Redis version
h('abc') =
h('xyz') =
h('foo') =
10110110
11011000
10111011
LogLog
0  1  2  3  4  5 

0  0  0  0  0  0 
0  1  2  3  4  5 

0  0  0  0  0  0 
0  1  2  3  4  5 

0  0  0  1  0  0 
0  1  2  3  4  5 

1  1  0  0  0  0 
00
01
10
11
No elements fall in this group
No elements fall in this group
h('abc') = 10110110
h('foo') = 10111011
h('xyz') = 11011000
Real value = 3
LogLog
HyperLogLog
HyperLogLog++
Improvements of LogLog
adding bias correction, adjusted constants
harmonic mean, 64bit, less memory
standard error ~0.81 %
PHP
PHP
https://github.com/shabbyrobe/phphll
PHP extension
Port of Redis HyperLogLog
PHP
joegreen0991/hyperloglog
Port of Redis HyperLogLog
The problem
Frequency
http://www.internetlivestats.com/twitterstatistics/
6000 tweet/s
some tweets have hashtags
1) Count tweets for
most popular tags
2) Compare today's tagcount
with
other periods or regions
Keep sorted lists with hashtag counts
....
CountMin Sketch
solves the problem using
~ 200 KB
0.01% error
CountMin Sketch
4,4,4,4,2,3,5,4,6,4,3,3,4,2,3,3,3,2
c1  c2  c3  c4 

0  0  0  0 
0  0  0  0 
h1(x), h2(x)
h1(x)
h2(x)
counters array
c
r
4,4,4,4,2,3,5,4,6,4,3,3,4,2,3,3,3,2
h1(4) = 4
h2(4) = 4
c1  c2  c3  c4 

0  0  0  1 
0  0  0  1 
h1(x)
h2(x)
4,4,4,4,2,3,5,4,6,4,3,3,4,2,3,3,3,2
h1(4) = 4
h2(4) = 4
c1  c2  c3  c4 

0  0  0  2 
0  0  0  2 
h1(x)
h2(x)
4,4,4,4,2,3,5,4,6,4,3,3,4,2,3,3,3,2
h1(4) = 4
h2(4) = 4
c1  c2  c3  c4 

0  0  0  3 
0  0  0  3 
h1(x)
h2(x)
4,4,4,4,2,3,5,4,6,4,3,3,4,2,3,3,3,2
h1(4) = 4
h2(4) = 4
c1  c2  c3  c4 

0  0  0  4 
0  0  0  4 
h1(x)
h2(x)
4,4,4,4,2,3,5,4,6,4,3,3,4,2,3,3,3,2
h1(2) = 4
h2(2) = 1
c1  c2  c3  c4 

0  0  0  5 
1  0  0  4 
h1(x)
h2(x)
4,4,4,4,2,3,5,4,6,4,3,3,4,2,3,3,3,2
c1  c2  c3  c4 

8  0  0  10 
3  4  6  7 
h1(x)
h2(x)
Estimate frequency
h1(4) = 4
h2(4) = 4
f(4) = ?
c1  c2  c3  c4 

8  0  0  10 
3  4  6  7 
h1(x)
h2(x)
c1  c2  c3  c4 

8  0  0  10 
1  4  6  7 
real frequency 7
h1(x)
h2(x)
h1(2) = 4
h2(2) = 1
f(2) = ?
c1  c2  c3  c4 

8  0  0  10 
3  4  6  7 
h1(x)
h2(x)
c1  c2  c3  c4 

8  0  0  10 
3  4  6  7 
real frequency 3
h1(x)
h2(x)
h1(6) = 1
h2(6) = 2
f(6) = ?
c1  c2  c3  c4 

8  0  0  10 
3  4  6  7 
h1(x)
h2(x)
c1  c2  c3  c4 

8  0  0  10 
3  4  6  7 
real frequency 1
h1(x)
h2(x)
Hash functions
(and counter rows)
= the frequency overestimation error
Example: 2% error
Counter columns
= standard error
Example: 2% error
~4bln unique elements
0.1% errors and accuracy
7 rows (hash functions)
2718 columns
32 bit counters (~4bln elements)
7 x 2718 x 32 / 8 = 76 KB
Heavy Hitters
4,4,4,4,2,3,5,4,6,4,3,3,4,2,3,3,3,2
Find 3 Top elements
Top 3 elements are [(4, 7), (3, 6), (2, 3)]
4,4,4,4,2,3,5,4,6,4,3,3,4,2,3,3,3,2
3 Top elements
Top 3 elements are [(4, 7), (3, 6)]
4,4,4,4,2,3,5,4,6,4,3,3,4,2,3,3,3,2
c1  c2  c3  c4 

0  0  0  1 
0  0  0  1 
He = [ (4, 1) ]
h1(x)
h2(x)
4,4,4,4,2,3,5,4,6,4,3,3,4,2,3,3,3,2
c1  c2  c3  c4 

0  0  0  2 
0  0  0  2 
He = [ (4, 2) ]
h1(x)
h2(x)
4,4,4,4,2,3,5,4,6,4,3,3,4,2,3,3,3,2
c1  c2  c3  c4 

7  0  0  9 
1  3  6  7 
He = [ (4, 7) ]
h1(x)
h2(x)
4,4,4,4,2,3,5,4,6,4,3,3,4,2,3,3,3,2
c1  c2  c3  c4 

7  0  0  10 
2  4  6  6 
He = [ (4, 7), (3, 6) ]
h1(x)
h2(x)
He = [ (4, 7) ]
4,4,4,4,2,3,5,4,6,4,3,3,4,2,3,3,3,2
He = [ (4, 7), (3, 6) ]
Variants
CountMinLog Sketch
Better frequency approximation for low frequency elements
CountMeanMin Sketch
subtracts the median, good when
underestimation is preferred
PHP
PHP
https://github.com/mrjgreen/CountMinSketch
Not registered on packagist
No unit tests (has benchmark test)
Weird hash function crc32 + md5
Only frequency estimation, no heavy hitters
What about PHP and big data?
Probabilistic data structures
reside in memory
...but PHP cleans the memory on each request
PHP is faster on each release
PHP daemons
started to emerge
swoole, amp, react
General knowledge of advanced data structures
helps to take decisions
when is the right moment
Thank you!
 Twitter: @goetas_asmir
 Github: @goetas
 LinkedIn: @goetas
 WWW: goetas.com
Probabilistic data structures and algorithms for Big Data applications  PHPCE 2019
By Asmir Mustafic
Probabilistic data structures and algorithms for Big Data applications  PHPCE 2019
As the amount of data we produce is continuously growing, we need also more sophisticated methods to elaborate them. Probabilistic data structures are based on different hashing techniques and provide approximate answers with predictable errors. The potential errors are compensated by the incredibly low memory usage, query time and scaling factors. This talk will cover the most common strategies used to solve membership, counting, similarity, frequency and ranking problems in a Big Data context.
 42