Probabilistic data structures
and algorithms
for Big Data applications
Asmir Mustafic
PHP CE 2019 - Dresden - Germany
Me
Asmir Mustafic
Me
@goetas
- Twitter: @goetas_asmir
- Github: @goetas
- LinkedIn: @goetas
- WWW: goetas.com
Berlin
Community
- jms/serializer (maintainer)
- masterminds/html5 (maintainer)
- hautelook/templated-uri-bundle (maintainer)
- goetas-webservices/xsd2php (author)
- goetas-webservices/xsd-reader (author)
- goetas-webservices/soap-client (author)
- goetas/twital (author)
- PHP-FIG secretary
Agenda
- Why use probabilistic data structures
- 3 problems solved using probabilistic data structures
Big data
- 1GB
- 1TB
- 1PB
- 1EB
Big data
1980
1GB = 250 000 $
455 kg
https://en.wikipedia.org/wiki/History_of_IBM_magnetic_disk_drives#IBM_3380
data are big
when we can't compute them
with the available resources and methods
Too many - Volume
Too different - Variety
Too fast - Velocity
Big data
How do we compute data?
Tape
HDD
SSD
Memory
Ease of use (developer)
CPU works only here
Tape
HDD
SSD
Memory
Speed
Tape
HDD
SSD
Memory
Economicity
Tape
HDD
SSD
Memory
How can we do more here?
Size
Ease of use
Speed
Economicity
Probabilistic data structures
In exchange of
predictable errors
-
extremely scalable
-
extremely low memory
Hash functions
Domain A
Domain B
h('abc...') = 7987884
h(123) = 'a'
h(1234) = 5
md5(string $str): string;
sha1(string $str): string;
crc32(string $str): int;
-
Work factor
-
Sticky state
-
Diffusion
- One Way hashing
- Fixed output size
- No limits on input size
- Collision resistant
- preimage resistance
- second-preimage resistance
- Can allow init-key
Cryptographic
hash functions
Cryptographic
hash functions
https://password-hashing.net/submissions.html
SHA-1 SHA-256 SHA-512 MD2 MD4 MD5 MD6 RadioGatún Whirlpool Tiger BLAKE-256 BLAKE-512 HAVAL Argon2
Trade some Cryptographic proprieties for performance
Non-Cryptographic
hash functions
Non-Cryptographic
hash functions
CityHash FarmHash MetroHash SpookyHash xxHash MurmurHash JenkinsHash FNV
https://cyan4973.github.io/xxHash/
Name | Speed |
---|---|
xxHash | 5.4 GB/s |
MurmurHash3a | 2.7 GB/s |
MD5 | 0.33 GB/s |
SHA1 | 0.25 GB/s |
Non
Cryptographic
Cryptographic
Non-Cryptographic
hash functions speed
PHP
PHP
Good support for cryptographic hash functions
// 52 algo supported in PHP 7.3
hash($str, $algo);
// password-hashing optimized
password_hash($password, $algo);
PHP
Bad support for non-cryptographic hash functions
https://bugs.php.net/bug.php?id=62063
// some available in hash()
hash($str, $algo);
Custom extensions can be found online
Probabilistic data structures
The problem
Membership
Akamai CDN
serves 15-30 % of all web traffic
~75% of files are requested only once
Akamai CDN
How to avoid caching on-disk one-hit requests?
Akamai CDN
Keeping a list of URLs and cache
only on the 2nd request!
Akamai CDN
URL length ~100 chars
4 GB
40M unique URL in 24h per node
to avoid caching those one-hit files Akamai uses bloom filters
Akamai CDN
https://doi.org/10.1145%2F2805789.2805800
using ~ 68 MB
0.1% error
Bloom Filter
interface BloomFilter
{
function insert(mixed $element): void;
function exists(mixed $element): bool;
}
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
bit array
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 |
h1(Rome) = Murmur3(Rome) mod 10 = 7
h2(Rome) = Fnv1a(Rome) mod 10 = 3
Insert
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 |
h1(Berlin) = Murmur3(Berlin) mod 10 = 1
h2(Berlin) = Fnv1a(Berlin) mod 10 = 3
Insert
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 |
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 |
h1(Rome) = Murmur3(Rome) mod 10 = 7
h2(Rome) = Fnv1a(Rome) mod 10 = 3
exists?
Rome is part
of the dataset
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 |
h1(Paris) = Murmur3(Paris) mod 10 = 3
h2(Paris) = Fnv1a(Paris) mod 10 = 6
exists?
Paris is not part
of the dataset
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 |
h1(Madrid) = Murmur3(Madrid) mod 10 = 1
h2(Madrid) = Fnv1a(Madrid) mod 10 = 7
exists?
Madrid is part
of the dataset
items | mem (mb) |
---|---|
1 000 | 3 |
100 000 | 37 |
1 000 000 | 264 |
mem (mb) |
---|
0.02 |
0.1 |
0.9 |
bloom 1%
list ~100chr strings
Space efficiency
Bit-array size?
~1.2Kbyte
How many functions?
https://hur.st/bloomfilter/
Variants
Counting Bloom filter
Approximate count of elements
Quotient filter
Storage friendly
Allows delete
Fast (only one hash)
Can be resized
Cuckoo filter
Faster
Allows delete
Less memory
Can be resized
PHP
PHP
pecl.php.net/package/bloomy
Not maintained
PHP
rocket-labs/bloom-filter
Murmur3
Redis support
Counting filter support
The problem
Cardinality
Unique visitors of a webiste
bbc.co.uk
558M visits/month (Nov 2018)
3.45 pages / visit
~2bln page views
https://www.similarweb.com/website/bbc.co.uk#overview
bbc.co.uk
Count unique visitors?
Keeping a list of visits
grouped by some unique identifier
bbc.co.uk
10% unique users
Unique identifier is 10 bytes
List size ~ 2 GB
bbc.co.uk
HyperLogLog counts unique elements using 12 KB
0.81% error
Probabilistic
counting
h('abc') =
h('xyz') =
h('foo') =
10110110
11011000
10111011
⇒ 1
⇒ 3
⇒ 0
Flajolet, Martin 1985
Probabilistic counting
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
---|---|---|---|---|---|---|---|
1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 |
rank('foo') = 0
Probabilistic counting
rank('xyz') = 3
rank('abc') = 1
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
---|---|---|---|---|---|---|---|
1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 |
Probabilistic counting
Not really correct
Probabilistic counting
LogLog
Redis version
h('abc') =
h('xyz') =
h('foo') =
10110110
11011000
10111011
LogLog
0 | 1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 |
0 | 1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 |
0 | 1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|---|
0 | 0 | 0 | 1 | 0 | 0 |
0 | 1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|---|
1 | 1 | 0 | 0 | 0 | 0 |
00
01
10
11
No elements fall in this group
No elements fall in this group
h('abc') = 10110110
h('foo') = 10111011
h('xyz') = 11011000
Real value = 3
LogLog
HyperLogLog
HyperLogLog++
Improvements of LogLog
adding bias correction, adjusted constants
harmonic mean, 64bit, less memory
standard error ~0.81 %
PHP
PHP
https://github.com/shabbyrobe/phphll
PHP extension
Port of Redis HyperLogLog
PHP
joegreen0991/hyperloglog
Port of Redis HyperLogLog
The problem
Frequency
http://www.internetlivestats.com/twitter-statistics/
6000 tweet/s
some tweets have hash-tags
1) Count tweets for
most popular tags
2) Compare today's tag-count
with
other periods or regions
Keep sorted lists with hashtag counts
....
Count-Min Sketch
solves the problem using
~ 200 KB
0.01% error
Count-Min Sketch
4,4,4,4,2,3,5,4,6,4,3,3,4,2,3,3,3,2
c1 | c2 | c3 | c4 |
---|---|---|---|
0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 |
h1(x), h2(x)
h1(x)
h2(x)
counters array
c
r
4,4,4,4,2,3,5,4,6,4,3,3,4,2,3,3,3,2
h1(4) = 4
h2(4) = 4
c1 | c2 | c3 | c4 |
---|---|---|---|
0 | 0 | 0 | 1 |
0 | 0 | 0 | 1 |
h1(x)
h2(x)
4,4,4,4,2,3,5,4,6,4,3,3,4,2,3,3,3,2
h1(4) = 4
h2(4) = 4
c1 | c2 | c3 | c4 |
---|---|---|---|
0 | 0 | 0 | 2 |
0 | 0 | 0 | 2 |
h1(x)
h2(x)
4,4,4,4,2,3,5,4,6,4,3,3,4,2,3,3,3,2
h1(4) = 4
h2(4) = 4
c1 | c2 | c3 | c4 |
---|---|---|---|
0 | 0 | 0 | 3 |
0 | 0 | 0 | 3 |
h1(x)
h2(x)
4,4,4,4,2,3,5,4,6,4,3,3,4,2,3,3,3,2
h1(4) = 4
h2(4) = 4
c1 | c2 | c3 | c4 |
---|---|---|---|
0 | 0 | 0 | 4 |
0 | 0 | 0 | 4 |
h1(x)
h2(x)
4,4,4,4,2,3,5,4,6,4,3,3,4,2,3,3,3,2
h1(2) = 4
h2(2) = 1
c1 | c2 | c3 | c4 |
---|---|---|---|
0 | 0 | 0 | 5 |
1 | 0 | 0 | 4 |
h1(x)
h2(x)
4,4,4,4,2,3,5,4,6,4,3,3,4,2,3,3,3,2
c1 | c2 | c3 | c4 |
---|---|---|---|
8 | 0 | 0 | 10 |
3 | 4 | 6 | 7 |
h1(x)
h2(x)
Estimate frequency
h1(4) = 4
h2(4) = 4
f(4) = ?
c1 | c2 | c3 | c4 |
---|---|---|---|
8 | 0 | 0 | 10 |
3 | 4 | 6 | 7 |
h1(x)
h2(x)
c1 | c2 | c3 | c4 |
---|---|---|---|
8 | 0 | 0 | 10 |
1 | 4 | 6 | 7 |
real frequency 7
h1(x)
h2(x)
h1(2) = 4
h2(2) = 1
f(2) = ?
c1 | c2 | c3 | c4 |
---|---|---|---|
8 | 0 | 0 | 10 |
3 | 4 | 6 | 7 |
h1(x)
h2(x)
c1 | c2 | c3 | c4 |
---|---|---|---|
8 | 0 | 0 | 10 |
3 | 4 | 6 | 7 |
real frequency 3
h1(x)
h2(x)
h1(6) = 1
h2(6) = 2
f(6) = ?
c1 | c2 | c3 | c4 |
---|---|---|---|
8 | 0 | 0 | 10 |
3 | 4 | 6 | 7 |
h1(x)
h2(x)
c1 | c2 | c3 | c4 |
---|---|---|---|
8 | 0 | 0 | 10 |
3 | 4 | 6 | 7 |
real frequency 1
h1(x)
h2(x)
Hash functions
(and counter rows)
= the frequency overestimation error
Example: 2% error
Counter columns
= standard error
Example: 2% error
~4bln unique elements
0.1% errors and accuracy
7 rows (hash functions)
2718 columns
32 bit counters (~4bln elements)
7 x 2718 x 32 / 8 = 76 KB
Heavy Hitters
4,4,4,4,2,3,5,4,6,4,3,3,4,2,3,3,3,2
Find 3 Top elements
Top 3 elements are [(4, 7), (3, 6), (2, 3)]
4,4,4,4,2,3,5,4,6,4,3,3,4,2,3,3,3,2
3 Top elements
Top 3 elements are [(4, 7), (3, 6)]
4,4,4,4,2,3,5,4,6,4,3,3,4,2,3,3,3,2
c1 | c2 | c3 | c4 |
---|---|---|---|
0 | 0 | 0 | 1 |
0 | 0 | 0 | 1 |
He = [ (4, 1) ]
h1(x)
h2(x)
4,4,4,4,2,3,5,4,6,4,3,3,4,2,3,3,3,2
c1 | c2 | c3 | c4 |
---|---|---|---|
0 | 0 | 0 | 2 |
0 | 0 | 0 | 2 |
He = [ (4, 2) ]
h1(x)
h2(x)
4,4,4,4,2,3,5,4,6,4,3,3,4,2,3,3,3,2
c1 | c2 | c3 | c4 |
---|---|---|---|
7 | 0 | 0 | 9 |
1 | 3 | 6 | 7 |
He = [ (4, 7) ]
h1(x)
h2(x)
4,4,4,4,2,3,5,4,6,4,3,3,4,2,3,3,3,2
c1 | c2 | c3 | c4 |
---|---|---|---|
7 | 0 | 0 | 10 |
2 | 4 | 6 | 6 |
He = [ (4, 7), (3, 6) ]
h1(x)
h2(x)
He = [ (4, 7) ]
4,4,4,4,2,3,5,4,6,4,3,3,4,2,3,3,3,2
He = [ (4, 7), (3, 6) ]
Variants
Count-Min-Log Sketch
Better frequency approximation for low frequency elements
Count-Mean-Min Sketch
subtracts the median, good when
under-estimation is preferred
PHP
PHP
https://github.com/mrjgreen/CountMinSketch
Not registered on packagist
No unit tests (has benchmark test)
Weird hash function crc32 + md5
Only frequency estimation, no heavy hitters
What about PHP and big data?
Probabilistic data structures
reside in memory
...but PHP cleans the memory on each request
PHP is faster on each release
PHP daemons
started to emerge
swoole, amp, react
General knowledge of advanced data structures
helps to take decisions
when is the right moment
Thank you!
- Twitter: @goetas_asmir
- Github: @goetas
- LinkedIn: @goetas
- WWW: goetas.com
Probabilistic data structures and algorithms for Big Data applications - PHPCE 2019
By Asmir Mustafic
Probabilistic data structures and algorithms for Big Data applications - PHPCE 2019
As the amount of data we produce is continuously growing, we need also more sophisticated methods to elaborate them. Probabilistic data structures are based on different hashing techniques and provide approximate answers with predictable errors. The potential errors are compensated by the incredibly low memory usage, query time and scaling factors. This talk will cover the most common strategies used to solve membership, counting, similarity, frequency and ranking problems in a Big Data context.
- 1,190