Bloom filters,
PHP and Big data 

Asmir Mustafic

Berlin PHP user group - February 2019

Me

Asmir Mustafic

Me

@goetas

Me

Software Developer

Berlin

Open source

  • jms/serializer (contributor/maintainer)
  • masterminds/html5 (contributor/maintainer)
  • hautelook/templated-uri-bundle (contributor/maintainer)
  • goetas-webservices/xsd2php (author)
  • goetas-webservices/xsd-reader (author)
  • goetas-webservices/soap-client (author)
  • goetas/twital (author)
  • many others...

Big data

  • 1GB
  • 1TB
  • 1PB
  • 1EB

Big data

1980

1GB = 250 000 $

 455 kg

https://en.wikipedia.org/wiki/History_of_IBM_magnetic_disk_drives#IBM_3380

Big data

data are big
when we can't compute them
with the available resources and methods

data are big
when we can't compute them
with the available resources and methods

Too many - Volume

Too different - Variety

Too fast - Velocity

How we compute data?

Tape

HDD

SSD

Memory

Ease of use (developer)

CPU works only here

Tape

HDD

SSD

Memory

Speed

Tape

HDD

SSD

Memory

Cost

Tape

HDD

SSD

Memory

How can we do more here?

Size

Ease of use (developer)

Speed

Probabilistic data structures

In exchange of
predictable errors

 

  • extremely scalable

  • extremely low memory

Bloom Filters

Hash functions

h(x) = y
h(x)=yh(x) = y

Domain A 

Domain B

h(x) = y
h(x)=yh(x) = y
h(x)
h(x)h(x)

h('abc') = 7987884

md5(string $str): string;

sha1(string $str): string;

crc32(string $str): int;

We need
Non-Cryptographic

hash functions

  • Work factor

  • Sticky state

  • Diffusion

  • One Way hashing
    • Fixed output size
    • No limits on input size
  • Collision resistant
    • preimage resistance
    • second-preimage resistance
  • Can allow init-key

Cryptographic
hash functions

Cryptographic
hash functions

https://password-hashing.net/submissions.html

SHA-1 SHA-256 SHA-512 MD2 MD4 MD5 MD6 RadioGatún Whirlpool Tiger BLAKE-256 BLAKE-512 HAVAL Argon2

Trade some Cryptographic proprieties for performance

Non-Cryptographic
hash functions

Non-Cryptographic
hash functions

CityHash FarmHash MetroHash SpookyHash xxHash MurmurHash JenkinsHash FNV

https://cyan4973.github.io/xxHash/

Name Speed
xxHash 5.4 GB/s
MurmurHash3a 2.7 GB/s
MD5 0.33 GB/s
SHA1 0.25 GB/s

Non
Cryptographic

Cryptographic

Non-Cryptographic
hash functions speed

PHP

PHP

Good support for cryptographic hash functions

// 52 algo supported in PHP 7.3
hash($str, $algo); 

// password-hashing optimized
password_hash($password, $algo);

PHP

Bad support for non-cryptographic hash functions

https://bugs.php.net/bug.php?id=62063

// some available in hash()
hash($str, $algo);

Custom extensions can be found online

The problem

Membership

Akamai CDN

serves 15-30 % of all web traffic

~75% of files are requested only once

Akamai CDN

How to avoid caching on-disk one-hit requests?

Akamai CDN

Keeping a list of URLs and cache
only on the 2nd request!

Akamai CDN

URL length ~100 chars

4 GB

40M unique URL in 24h per node

to avoid caching those one-hit files Akamai uses bloom filters

Akamai CDN

https://doi.org/10.1145%2F2805789.2805800

using ~ 68 MB

0.1% error

Bloom Filter

interface BloomFilter 
{

  function insert(string $element): void;

  function exists(string $element): bool;

}
  • exists() === false
    Element was not inserted
     
  • exists() === true
    Element was
    PROBABLY inserted

0

1

2

3

4

5

6

7

8

9

0

0

0

0

0

0

0

0

0

0

bit array

0

1

2

3

4

5

6

7

8

9

0

0

0

1

0

0

0

1

0

0

h1(Rome) = Murmur3(Rome) mod 10 = 7

h2(Rome) = Fnv1a(Rome) mod 10 = 3

Insert

0

1

2

3

4

5

6

7

8

9

0

1

0

1

0

0

0

1

0

0

h1(Berlin) = Murmur3(Berlin) mod 10 = 1

h2(Berlin) = Fnv1a(Berlin) mod 10 = 3

Insert

0

1

2

3

4

5

6

7

8

9

0

1

0

1

0

0

0

1

0

0

0

1

2

3

4

5

6

7

8

9

0

1

0

1

0

0

0

1

0

0

h1(Rome) = Murmur3(Rome) mod 10 = 7

h2(Rome) = Fnv1a(Rome) mod 10 = 3

exists?

Rome is part
 of the dataset

0

1

2

3

4

5

6

7

8

9

0

1

0

1

0

0

0

1

0

0

h1(Paris) = Murmur3(Paris) mod 10 = 3

h2(Paris) = Fnv1a(Paris) mod 10 = 6

exists?

Paris is not part
 of the dataset

0

1

2

3

4

5

6

7

8

9

0

1

0

1

0

0

0

1

0

0

h1(Madrid) = Murmur3(Madrid) mod 10 = 1

h2(Madrid) = Fnv1a(Madrid) mod 10 = 7

exists?

Madrid is part
 of the dataset

Bit-array size?

m = - \frac{n \ln{P}}{(ln{2})^2}
m=nlnP(ln2)2m = - \frac{n \ln{P}}{(ln{2})^2}
- \frac{1000 \ln{0.01}}{(ln{2})^2} = 9585\ bits
1000ln0.01(ln2)2=9585 bits- \frac{1000 \ln{0.01}}{(ln{2})^2} = 9585\ bits

~1.2Kbyte

k = \frac{m}{n}\ln{2}
k=mnln2k = \frac{m}{n}\ln{2}

How many functions?

\frac{9585}{1000}\ln{2} = 6.6
95851000ln2=6.6\frac{9585}{1000}\ln{2} = 6.6

https://hur.st/bloomfilter/

items

mem (mb)

1 000

3

100 000

37

1 000 000

264

mem (mb)

0.02

0.1

0.9

bloom 1%

list ~100chr strings

Space efficiency

Variants

Counting Bloom filter

Approximate count of elements 

Quotient filter

Storage friendly

Allows delete

Fast (only one hash)

Can be resized

Cuckoo filter

Faster

Allows delete

Less memory

Can be resized

PHP

PHP

pecl.php.net/package/bloomy

Not maintained

PHP

rocket-labs/bloom-filter

Murmur3

Redis support

Counting filter support

PHP

and 

BIG DATA
?

Thank you!

Bloom filters, PHP and Big data - Berlin PHP user group 2019

By Asmir Mustafic

Bloom filters, PHP and Big data - Berlin PHP user group 2019

Discover what are Bloom-Filters and how can they be used for big data computation

  • 1,211