Asmir Mustafic
PHP CE 2019 - Dresden - Germany
Berlin
1980
1GB = 250 000 $
455 kg
https://en.wikipedia.org/wiki/History_of_IBM_magnetic_disk_drives#IBM_3380
data are big
when we can't compute them
with the available resources and methods
Too many - Volume
Too different - Variety
Too fast - Velocity
Ease of use (developer)
CPU works only here
Speed
Economicity
How can we do more here?
Size
Ease of use
Speed
Economicity
Domain A
Domain B
h('abc...') = 7987884
h(123) = 'a'
h(1234) = 5
md5(string $str): string;
sha1(string $str): string;
crc32(string $str): int;
https://password-hashing.net/submissions.html
SHA-1 SHA-256 SHA-512 MD2 MD4 MD5 MD6 RadioGatún Whirlpool Tiger BLAKE-256 BLAKE-512 HAVAL Argon2
CityHash FarmHash MetroHash SpookyHash xxHash MurmurHash JenkinsHash FNV
https://cyan4973.github.io/xxHash/
Name | Speed |
---|---|
xxHash | 5.4 GB/s |
MurmurHash3a | 2.7 GB/s |
MD5 | 0.33 GB/s |
SHA1 | 0.25 GB/s |
Non
Cryptographic
Cryptographic
// 52 algo supported in PHP 7.3
hash($str, $algo);
// password-hashing optimized
password_hash($password, $algo);
https://bugs.php.net/bug.php?id=62063
// some available in hash()
hash($str, $algo);
Custom extensions can be found online
https://doi.org/10.1145%2F2805789.2805800
0.1% error
interface BloomFilter
{
function insert(mixed $element): void;
function exists(mixed $element): bool;
}
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
bit array
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 |
h1(Rome) = Murmur3(Rome) mod 10 = 7
h2(Rome) = Fnv1a(Rome) mod 10 = 3
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 |
h1(Berlin) = Murmur3(Berlin) mod 10 = 1
h2(Berlin) = Fnv1a(Berlin) mod 10 = 3
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 |
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 |
h1(Rome) = Murmur3(Rome) mod 10 = 7
h2(Rome) = Fnv1a(Rome) mod 10 = 3
Rome is part
of the dataset
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 |
h1(Paris) = Murmur3(Paris) mod 10 = 3
h2(Paris) = Fnv1a(Paris) mod 10 = 6
Paris is not part
of the dataset
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 |
h1(Madrid) = Murmur3(Madrid) mod 10 = 1
h2(Madrid) = Fnv1a(Madrid) mod 10 = 7
Madrid is part
of the dataset
items | mem (mb) |
---|---|
1 000 | 3 |
100 000 | 37 |
1 000 000 | 264 |
mem (mb) |
---|
0.02 |
0.1 |
0.9 |
bloom 1%
list ~100chr strings
~1.2Kbyte
https://hur.st/bloomfilter/
Approximate count of elements
Storage friendly
Allows delete
Fast (only one hash)
Can be resized
Faster
Allows delete
Less memory
Can be resized
Murmur3
Redis support
Counting filter support
https://www.similarweb.com/website/bbc.co.uk#overview
0.81% error
h('abc') =
h('xyz') =
h('foo') =
10110110
11011000
10111011
⇒ 1
⇒ 3
⇒ 0
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
---|---|---|---|---|---|---|---|
1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 |
rank('foo') = 0
rank('xyz') = 3
rank('abc') = 1
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
---|---|---|---|---|---|---|---|
1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 |
Not really correct
Redis version
h('abc') =
h('xyz') =
h('foo') =
10110110
11011000
10111011
0 | 1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 |
0 | 1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 |
0 | 1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|---|
0 | 0 | 0 | 1 | 0 | 0 |
0 | 1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|---|
1 | 1 | 0 | 0 | 0 | 0 |
00
01
10
11
No elements fall in this group
No elements fall in this group
h('abc') = 10110110
h('foo') = 10111011
h('xyz') = 11011000
Real value = 3
Improvements of LogLog
adding bias correction, adjusted constants
harmonic mean, 64bit, less memory
standard error ~0.81 %
PHP extension
Port of Redis HyperLogLog
Port of Redis HyperLogLog
http://www.internetlivestats.com/twitter-statistics/
....
0.01% error
4,4,4,4,2,3,5,4,6,4,3,3,4,2,3,3,3,2
c1 | c2 | c3 | c4 |
---|---|---|---|
0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 |
h1(x), h2(x)
h1(x)
h2(x)
counters array
c
r
4,4,4,4,2,3,5,4,6,4,3,3,4,2,3,3,3,2
h1(4) = 4
h2(4) = 4
c1 | c2 | c3 | c4 |
---|---|---|---|
0 | 0 | 0 | 1 |
0 | 0 | 0 | 1 |
h1(x)
h2(x)
4,4,4,4,2,3,5,4,6,4,3,3,4,2,3,3,3,2
h1(4) = 4
h2(4) = 4
c1 | c2 | c3 | c4 |
---|---|---|---|
0 | 0 | 0 | 2 |
0 | 0 | 0 | 2 |
h1(x)
h2(x)
4,4,4,4,2,3,5,4,6,4,3,3,4,2,3,3,3,2
h1(4) = 4
h2(4) = 4
c1 | c2 | c3 | c4 |
---|---|---|---|
0 | 0 | 0 | 3 |
0 | 0 | 0 | 3 |
h1(x)
h2(x)
4,4,4,4,2,3,5,4,6,4,3,3,4,2,3,3,3,2
h1(4) = 4
h2(4) = 4
c1 | c2 | c3 | c4 |
---|---|---|---|
0 | 0 | 0 | 4 |
0 | 0 | 0 | 4 |
h1(x)
h2(x)
4,4,4,4,2,3,5,4,6,4,3,3,4,2,3,3,3,2
h1(2) = 4
h2(2) = 1
c1 | c2 | c3 | c4 |
---|---|---|---|
0 | 0 | 0 | 5 |
1 | 0 | 0 | 4 |
h1(x)
h2(x)
4,4,4,4,2,3,5,4,6,4,3,3,4,2,3,3,3,2
c1 | c2 | c3 | c4 |
---|---|---|---|
8 | 0 | 0 | 10 |
3 | 4 | 6 | 7 |
h1(x)
h2(x)
h1(4) = 4
h2(4) = 4
c1 | c2 | c3 | c4 |
---|---|---|---|
8 | 0 | 0 | 10 |
3 | 4 | 6 | 7 |
h1(x)
h2(x)
c1 | c2 | c3 | c4 |
---|---|---|---|
8 | 0 | 0 | 10 |
1 | 4 | 6 | 7 |
h1(x)
h2(x)
h1(2) = 4
h2(2) = 1
c1 | c2 | c3 | c4 |
---|---|---|---|
8 | 0 | 0 | 10 |
3 | 4 | 6 | 7 |
h1(x)
h2(x)
c1 | c2 | c3 | c4 |
---|---|---|---|
8 | 0 | 0 | 10 |
3 | 4 | 6 | 7 |
h1(x)
h2(x)
h1(6) = 1
h2(6) = 2
c1 | c2 | c3 | c4 |
---|---|---|---|
8 | 0 | 0 | 10 |
3 | 4 | 6 | 7 |
h1(x)
h2(x)
c1 | c2 | c3 | c4 |
---|---|---|---|
8 | 0 | 0 | 10 |
3 | 4 | 6 | 7 |
h1(x)
h2(x)
(and counter rows)
= the frequency overestimation error
Example: 2% error
= standard error
Example: 2% error
4,4,4,4,2,3,5,4,6,4,3,3,4,2,3,3,3,2
Top 3 elements are [(4, 7), (3, 6), (2, 3)]
4,4,4,4,2,3,5,4,6,4,3,3,4,2,3,3,3,2
Top 3 elements are [(4, 7), (3, 6)]
4,4,4,4,2,3,5,4,6,4,3,3,4,2,3,3,3,2
c1 | c2 | c3 | c4 |
---|---|---|---|
0 | 0 | 0 | 1 |
0 | 0 | 0 | 1 |
He = [ (4, 1) ]
h1(x)
h2(x)
4,4,4,4,2,3,5,4,6,4,3,3,4,2,3,3,3,2
c1 | c2 | c3 | c4 |
---|---|---|---|
0 | 0 | 0 | 2 |
0 | 0 | 0 | 2 |
He = [ (4, 2) ]
h1(x)
h2(x)
4,4,4,4,2,3,5,4,6,4,3,3,4,2,3,3,3,2
c1 | c2 | c3 | c4 |
---|---|---|---|
7 | 0 | 0 | 9 |
1 | 3 | 6 | 7 |
He = [ (4, 7) ]
h1(x)
h2(x)
4,4,4,4,2,3,5,4,6,4,3,3,4,2,3,3,3,2
c1 | c2 | c3 | c4 |
---|---|---|---|
7 | 0 | 0 | 10 |
2 | 4 | 6 | 6 |
He = [ (4, 7), (3, 6) ]
h1(x)
h2(x)
He = [ (4, 7) ]
4,4,4,4,2,3,5,4,6,4,3,3,4,2,3,3,3,2
He = [ (4, 7), (3, 6) ]
Better frequency approximation for low frequency elements
subtracts the median, good when
under-estimation is preferred
Not registered on packagist
No unit tests (has benchmark test)
Weird hash function crc32 + md5
Only frequency estimation, no heavy hitters