We have a list of URLs known to be malicious, but it's huge - let's say 100 MB
Checking a URL against everything in that list would take obnoxiously long ("but there's a way around that" - hold that thought!)
Heck, even just downloading the 100 MB list would be kinda gross :(
But querying google's IsThisWebsiteMalicious service on every URL is obviously not good...
BLOOM FILTERS
Relax the problem constraints - what if a query can return "no" or "maybe"?
Hash each value with N hash functions
For practical purposes, can cheat w/ just 1 underlying hash
Insert!
bloom filters cont'd
Tradeoff between false positive rate and size of underlying bit array - 1% error rate requires only around 9.6 bits per element, 0.1% error rate only needs 4.8 more
add, test both constant-time unlike any other constant-space set data structure