Counting things is difficult,
especially when you have a lot of things to count.
The number of items in a set is its cardinality.
A list can have duplicate entries, a set cannot.
MongoDB Ids are represented by 24 character strings (24 bytes):
56def5418e80271df973a9a7
We get about 3000 requests per minute, containing account, event, campaign, segment, user, audience, creative, landing page, etc. IDs.
~43 Million ID events per day
Just the IDs represent 24 bytes * 43 Million = 1 GB per day.
After 1 year, we'd need 365 GB of RAM ~ $10k / month in counting servers
Make a list of 0s, as long as the maximum cardinality of your set. (Let's say 43 million for us)
Create a function that maps our ids to a unique number between 0 and 43 million.
When you encounter an ID, set that n-th number to 1 (from 0).
43 Million * 1 bit = 5.375MB (instead of 1GB per day)
Problems are:
"A Linear-Time Probabilistic Counting Algorithm for Database Applications "
Whang, et. al. 1990
Start with much smaller bitmap (smaller than the size of your cardinality)
m = size of the mask
w = weight of the mask
n = estimate of cardinality
First counter which is not perfectly accurate.
Counting distinct words used
in all of Shakespeare's work
This one is cray (the paper is ~50 pages long)
Bitmaps and Linear Counters contain information about members of the set
That can be improved!
(Data -> Hashed -> Number -> Value at index of that number is flipped)
*Not actually true since we're using HLLs with a fixed 0.8% error, which are ~16KB and not 512 Bytes