Asmir Mustafic
Berlin PHP user group - February 2019
Berlin
1980
1GB = 250 000 $
455 kg
https://en.wikipedia.org/wiki/History_of_IBM_magnetic_disk_drives#IBM_3380
data are big
when we can't compute them
with the available resources and methods
data are big
when we can't compute them
with the available resources and methods
Too many - Volume
Too different - Variety
Too fast - Velocity
Ease of use (developer)
CPU works only here
Speed
Cost
How can we do more here?
Size
Ease of use (developer)
Speed
Domain A
Domain B
h('abc') = 7987884
md5(string $str): string;
sha1(string $str): string;
crc32(string $str): int;
https://password-hashing.net/submissions.html
SHA-1 SHA-256 SHA-512 MD2 MD4 MD5 MD6 RadioGatún Whirlpool Tiger BLAKE-256 BLAKE-512 HAVAL Argon2
CityHash FarmHash MetroHash SpookyHash xxHash MurmurHash JenkinsHash FNV
https://cyan4973.github.io/xxHash/
Name | Speed |
---|---|
xxHash | 5.4 GB/s |
MurmurHash3a | 2.7 GB/s |
MD5 | 0.33 GB/s |
SHA1 | 0.25 GB/s |
Non
Cryptographic
Cryptographic
// 52 algo supported in PHP 7.3
hash($str, $algo);
// password-hashing optimized
password_hash($password, $algo);
https://bugs.php.net/bug.php?id=62063
// some available in hash()
hash($str, $algo);
Custom extensions can be found online
https://doi.org/10.1145%2F2805789.2805800
0.1% error
interface BloomFilter
{
function insert(string $element): void;
function exists(string $element): bool;
}
exists() === true
Element was PROBABLY inserted
0 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
---|---|---|---|---|---|---|---|---|---|
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
bit array
0 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
---|---|---|---|---|---|---|---|---|---|
0 |
0 |
0 |
1 |
0 |
0 |
0 |
1 |
0 |
0 |
h1(Rome) = Murmur3(Rome) mod 10 = 7
h2(Rome) = Fnv1a(Rome) mod 10 = 3
0 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
---|---|---|---|---|---|---|---|---|---|
0 |
1 |
0 |
1 |
0 |
0 |
0 |
1 |
0 |
0 |
h1(Berlin) = Murmur3(Berlin) mod 10 = 1
h2(Berlin) = Fnv1a(Berlin) mod 10 = 3
0 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
---|---|---|---|---|---|---|---|---|---|
0 |
1 |
0 |
1 |
0 |
0 |
0 |
1 |
0 |
0 |
0 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
---|---|---|---|---|---|---|---|---|---|
0 |
1 |
0 |
1 |
0 |
0 |
0 |
1 |
0 |
0 |
h1(Rome) = Murmur3(Rome) mod 10 = 7
h2(Rome) = Fnv1a(Rome) mod 10 = 3
Rome is part
of the dataset
0 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
---|---|---|---|---|---|---|---|---|---|
0 |
1 |
0 |
1 |
0 |
0 |
0 |
1 |
0 |
0 |
h1(Paris) = Murmur3(Paris) mod 10 = 3
h2(Paris) = Fnv1a(Paris) mod 10 = 6
Paris is not part
of the dataset
0 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
---|---|---|---|---|---|---|---|---|---|
0 |
1 |
0 |
1 |
0 |
0 |
0 |
1 |
0 |
0 |
h1(Madrid) = Murmur3(Madrid) mod 10 = 1
h2(Madrid) = Fnv1a(Madrid) mod 10 = 7
Madrid is part
of the dataset
~1.2Kbyte
https://hur.st/bloomfilter/
items |
mem (mb) |
---|---|
1 000 |
3 |
100 000 |
37 |
1 000 000 |
264 |
mem (mb) |
---|
0.02 |
0.1 |
0.9 |
bloom 1%
list ~100chr strings
Approximate count of elements
Storage friendly
Allows delete
Fast (only one hash)
Can be resized
Faster
Allows delete
Less memory
Can be resized
Murmur3
Redis support
Counting filter support