Bloom filters,
PHP and Big data
Asmir Mustafic
Berlin PHP user group - February 2019
Me
Asmir Mustafic
Me
@goetas
- Twitter: @goetas_asmir
- Github: @goetas
- LinkedIn: @goetas
- WWW: goetas.com
Me
Software Developer
Berlin
Open source
- jms/serializer (contributor/maintainer)
- masterminds/html5 (contributor/maintainer)
- hautelook/templated-uri-bundle (contributor/maintainer)
- goetas-webservices/xsd2php (author)
- goetas-webservices/xsd-reader (author)
- goetas-webservices/soap-client (author)
- goetas/twital (author)
- many others...
Big data
- 1GB
- 1TB
- 1PB
- 1EB
Big data
1980
1GB = 250 000 $
455 kg
https://en.wikipedia.org/wiki/History_of_IBM_magnetic_disk_drives#IBM_3380
Big data
data are big
when we can't compute them
with the available resources and methods
data are big
when we can't compute them
with the available resources and methods
Too many - Volume
Too different - Variety
Too fast - Velocity
How we compute data?
Tape
HDD
SSD
Memory
Ease of use (developer)
CPU works only here
Tape
HDD
SSD
Memory
Speed
Tape
HDD
SSD
Memory
Cost
Tape
HDD
SSD
Memory
How can we do more here?
Size
Ease of use (developer)
Speed
Probabilistic data structures
In exchange of
predictable errors
-
extremely scalable
-
extremely low memory
Bloom Filters
Hash functions
Domain A
Domain B
h('abc') = 7987884
md5(string $str): string;
sha1(string $str): string;
crc32(string $str): int;
We need
Non-Cryptographic
hash functions
-
Work factor
-
Sticky state
-
Diffusion
- One Way hashing
- Fixed output size
- No limits on input size
- Collision resistant
- preimage resistance
- second-preimage resistance
- Can allow init-key
Cryptographic
hash functions
Cryptographic
hash functions
https://password-hashing.net/submissions.html
SHA-1 SHA-256 SHA-512 MD2 MD4 MD5 MD6 RadioGatún Whirlpool Tiger BLAKE-256 BLAKE-512 HAVAL Argon2
Trade some Cryptographic proprieties for performance
Non-Cryptographic
hash functions
Non-Cryptographic
hash functions
CityHash FarmHash MetroHash SpookyHash xxHash MurmurHash JenkinsHash FNV
https://cyan4973.github.io/xxHash/
Name | Speed |
---|---|
xxHash | 5.4 GB/s |
MurmurHash3a | 2.7 GB/s |
MD5 | 0.33 GB/s |
SHA1 | 0.25 GB/s |
Non
Cryptographic
Cryptographic
Non-Cryptographic
hash functions speed
PHP
PHP
Good support for cryptographic hash functions
// 52 algo supported in PHP 7.3
hash($str, $algo);
// password-hashing optimized
password_hash($password, $algo);
PHP
Bad support for non-cryptographic hash functions
https://bugs.php.net/bug.php?id=62063
// some available in hash()
hash($str, $algo);
Custom extensions can be found online
The problem
Membership
Akamai CDN
serves 15-30 % of all web traffic
~75% of files are requested only once
Akamai CDN
How to avoid caching on-disk one-hit requests?
Akamai CDN
Keeping a list of URLs and cache
only on the 2nd request!
Akamai CDN
URL length ~100 chars
4 GB
40M unique URL in 24h per node
to avoid caching those one-hit files Akamai uses bloom filters
Akamai CDN
https://doi.org/10.1145%2F2805789.2805800
using ~ 68 MB
0.1% error
Bloom Filter
interface BloomFilter
{
function insert(string $element): void;
function exists(string $element): bool;
}
- exists() === false
Element was not inserted
-
exists() === true
Element was PROBABLY inserted
0 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
---|---|---|---|---|---|---|---|---|---|
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
bit array
0 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
---|---|---|---|---|---|---|---|---|---|
0 |
0 |
0 |
1 |
0 |
0 |
0 |
1 |
0 |
0 |
h1(Rome) = Murmur3(Rome) mod 10 = 7
h2(Rome) = Fnv1a(Rome) mod 10 = 3
Insert
0 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
---|---|---|---|---|---|---|---|---|---|
0 |
1 |
0 |
1 |
0 |
0 |
0 |
1 |
0 |
0 |
h1(Berlin) = Murmur3(Berlin) mod 10 = 1
h2(Berlin) = Fnv1a(Berlin) mod 10 = 3
Insert
0 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
---|---|---|---|---|---|---|---|---|---|
0 |
1 |
0 |
1 |
0 |
0 |
0 |
1 |
0 |
0 |
0 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
---|---|---|---|---|---|---|---|---|---|
0 |
1 |
0 |
1 |
0 |
0 |
0 |
1 |
0 |
0 |
h1(Rome) = Murmur3(Rome) mod 10 = 7
h2(Rome) = Fnv1a(Rome) mod 10 = 3
exists?
Rome is part
of the dataset
0 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
---|---|---|---|---|---|---|---|---|---|
0 |
1 |
0 |
1 |
0 |
0 |
0 |
1 |
0 |
0 |
h1(Paris) = Murmur3(Paris) mod 10 = 3
h2(Paris) = Fnv1a(Paris) mod 10 = 6
exists?
Paris is not part
of the dataset
0 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
---|---|---|---|---|---|---|---|---|---|
0 |
1 |
0 |
1 |
0 |
0 |
0 |
1 |
0 |
0 |
h1(Madrid) = Murmur3(Madrid) mod 10 = 1
h2(Madrid) = Fnv1a(Madrid) mod 10 = 7
exists?
Madrid is part
of the dataset
Bit-array size?
~1.2Kbyte
How many functions?
https://hur.st/bloomfilter/
items |
mem (mb) |
---|---|
1 000 |
3 |
100 000 |
37 |
1 000 000 |
264 |
mem (mb) |
---|
0.02 |
0.1 |
0.9 |
bloom 1%
list ~100chr strings
Space efficiency
Variants
Counting Bloom filter
Approximate count of elements
Quotient filter
Storage friendly
Allows delete
Fast (only one hash)
Can be resized
Cuckoo filter
Faster
Allows delete
Less memory
Can be resized
PHP
PHP
pecl.php.net/package/bloomy
Not maintained
PHP
rocket-labs/bloom-filter
Murmur3
Redis support
Counting filter support
PHP
and
BIG DATA
?
Thank you!
Bloom filters, PHP and Big data - Berlin PHP user group 2019
By Asmir Mustafic
Bloom filters, PHP and Big data - Berlin PHP user group 2019
Discover what are Bloom-Filters and how can they be used for big data computation
- 1,315