2019/2020
Université de Paris, IRIF
Inspired by Advanced Algorithms and Graph Mining by Andrea Marino (University of Florence)
296 volunteers (starting population) asked to dispatch a message to specific individual (target person, stockholder living and working in suburb of Boston)
The message could not be sent directly to the target person (unless the sender knew him personally), but could only be mailed to someone known personally who is more likely than the sender to know the target person
Starting population
100 living in Boston
100 Nebraska stockholders (i.e., people living far from the target but sharing with him their profession)
96 were Nebraska inhabitants chosen at random
By Ageev Andrew - Own Work based on Image:Map of USA showing state names.png
CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=22945395
Milgram was measuring the average length of a (greedy) routing path on a social network
Upper bound on the average distance
People involved in the experiment were not necessarily sending the letter to an acquaintance on a shortest path to the destination
First world-scale social network experiment
Entire Facebook network of active users (\(\approx 721\) million users, \(\approx 69\) billion friendship links
Given an unweighted graph \(G = (V,E)\) (strongly) connected
Distance \(d(u,v)\): number of edges in shortest path from \(u\) to \(v\)
Neighbourhood function of node \(u\)
\[N_h(u) = \{v\in V : d(u,v)=h\}\]
We denote \(N_1(u)\) also as \(N(u)\)
Distance distribution
\[N_h = \frac{|\{(u, v)\in V \times V : d(u,v)=h\}|}{n(n-1)}=\sum_{u\in V}\frac{|N_h(u)|}{n(n-1)}\]
Average distance
\[\sum_{u,v\in V}\frac{d(u,v)}{n(n-1)}=\sum_{h>0}(h\cdot N_h)\]
Approximating \(N_h\) implies approximating average distance
Example
Given a random \(U\subseteq V\) approximate\[N_h = \frac{|\{(u, v)\in V \times V : d(u,v)=h\}|}{n(n-1)}\]by\[N_h(U)=\frac{|\{(u,v) \in U\times V: d(u,v)=h\}|}{|U| \, (n-1)}\]
Algorithm
Select a random sample of \(\kappa\) vertices obtaining a multiset \(U = \{u_1, u_2, \ldots, u_{\kappa}\} \subseteq V\)
For \(i=1,2, \ldots, \kappa\), compute distances \(d(u_i,v)\) for all \(v \in V\)
BFS starting from \(u_i\)
Return the approximation \(N_h(U)\)
Running time: \(O(\kappa\cdot m)\) where \(m\) is the number of edges
Unbiased
Since \(N_h(\{u_i\}) = \frac{|\{(u_i,v) : v \in V \wedge d(u_i,v)=h\}|}{n-1}\), we have that\[N_h(U)=\frac{|\{(u_i,v) : u_i \in U\wedge v\in V \wedge d(u_i,v)=h\}|}{\kappa(n-1)}\\ = \frac{\sum_{i=1}^{\kappa} N_h(\{u_i\})}{\kappa}\]
If vertex \(u_i\) is randomly chosen in \(V\), then \(E[N_h(\{ u_i\})] = N_h\)
Indeed, \[E[N_h(\{ u_i\})] = \frac{1}{n}\sum_{v\in V}N_h(\{v\}) = N_h(V) = N_h\]
Hence, if all elements of \(U\) are randomly chosen, we have that, by the linearity of the expectation,
\[E[N_h(U)]=E\left[\frac{\sum_{i=1}^{\kappa} N_h(\{u_i\})}{\kappa}\right]=\frac{\sum_{i=1}^{\kappa} E[N_h(\{u_i\})]}{\kappa}=N_h\]
Concentration
Hoeffding bound
If \(X_1, X_2, \ldots, X_k\) are independent random variables such that \(\mu=E[\sum X_i/k]\) and, for each \(i\), \(0 \leq X_i \leq 1\), then, for any \(t\geq0\)\[Pr\left\{ \left|\frac{\sum_{i=1}^k X_i}{k}-\mu \right|\geq t \right\}\leq 2e^{-2kt^2}\]
Markov's inequality
If \(X\) is a nonnegative random variable and \(v > 0\), then\[Pr\left\{X\geq v \right\}\leq \frac{E[X]}{v}\]
Proof
We want to prove that\[vPr\left\{X\geq v \right\}\leq E[X]\]
From definition of expected value\[E[X]=\sum_{u<v}uPr\left\{ X=u\right\}+\sum_{u\geq v}uPr\left\{X=u\right\}\]
Since \(X\) is nonnegative\[E[X]\geq\sum_{u\geq v}uPr\left\{X=u\right\}\geq v\sum_{u\geq v}Pr\left\{X=u\right\}=vPr\left\{X\geq v\right\}\]
Chernoff's bounding method
Let \(X\) be any real-valued random variable. Then, for all \(t>0\)\[Pr\{X\geq t\}\leq \mathrm{inf}_{s>0}e^{-st}E[e^{sX}]\]
Proof (using Markov's inequality) \[Pr\{X\geq t\}=Pr\{sX\geq st\}=Pr\{e^{sX}\geq e^{st}\}\leq e^{-st}E[e^{sX}]\]
Hoeffding's lemma
If \(X\) is a real random variables such that \(E[X]=0\) and \(a \leq X \leq b\), then, for any \(s>0\)\[E[e^{sX}]\leq e^{s^2(b-a)^2/8}\]
Proof
Since \(f(x)=e^{sx}\) is convex\[e^{sx}\leq\frac{x-a}{b-a}e^{sb}+\frac{b-x}{b-a}e^{sa}\]
Since \(E[X]=0\)\[E[e^{sX}]\leq \frac{b}{b-a}e^{sa}-\frac{a}{b-a}e^{sb}\]
Let \(p=-\frac{a}{b-a}>0\). Then\[E[e^{sX}]\leq (1-p)e^{sa}+pe^{sb}=e^{sa}(1-p+pe^{s(b-a)})=(1-p+pe^{s(b-a)})e^{-sp(b-a)}\]
Let \(\varphi(u)=-pu+\log(1-p+pe^{pu})\) with \(u=s(b-a)\). Then\[E[e^{sX}]\leq e^{\varphi(u)}\]
By calculus (Taylor's theorem)\[\varphi(u)\leq \frac{1}{8}s^2(b-a)^2\]
Hoeffding bound
If \(X_1, X_2, \ldots, X_k\) are independent random variables such that, for each \(i\), \(0 \leq X_i \leq 1\). If \(S_k=\sum X_i\), then, for any \(t>0\)\[Pr\left\{S_k-E[S_k]\geq t \right\}\leq e^{-2t^2/k}\]
Proof
By applying Chernoff's bounding method\[Pr\left\{S_k-E[S_k]\geq t \right\}\leq \min_{s>0}e^{-st}E\left[e^{s(S_k-E[S_k])}\right]\]
Since \(X_i\) are independent\[e^{-st}E\left[e^{s(S_k-E[S_k])}\right]=e^{-st}\prod E\left[e^{s(X_i-E[X_i])}\right]\]
By applying Hoeffding's lemma\[Pr\left\{S_k-E[S_k]\geq t \right\}\leq \min_{s>0}e^{-st+ks^2/8}\]
Minimum at \(s=4t/k\) and bound follows
Concentration
Hoeffding bound
If \(X_1, X_2, \ldots, X_k\) are independent random variables such that \(\mu=E[\sum X_i/k]\) and, for each \(i\), \(0 \leq X_i \leq 1\), then, for any \(t\geq0\)\[Pr\left\{ \left|\frac{\sum_{i=1}^k X_i}{k}-\mu \right|\geq t \right\}\leq 2e^{-2kt^2}\]
In our case
\(k=\kappa\), \(X_i=N_h(\{u_i\})\), and \(\mu=N_h\)
By definition, \(0\leq N_h(\{u_i\})\leq 1\)
Hence\[Pr\left\{ \left|\frac{\sum_{i=1}^\kappa N_h(\{u_i\})}{\kappa}-\mu \right|\geq t \right\}\leq 2e^{-2\kappa t^2}\]
Choosing \(\kappa=\frac{\alpha}{2t^2}\ln n\) with \(\alpha>0\), probability bounded by \(\frac{2}{n^\alpha}\)
Called high probability
The previous analysis can be easily extended to the case of weighted (strongly) connected graphs
Use of Dijkstra's algorithm
Running time \(O(\kappa(m + n\log n))=O(t^{-2} (m\log n + n\log^2 n))\)
In a similar way, we can compute
An approximation of the average distance
An approximation of the \(\alpha\)-diameter, which is defined as the minimum \(h\) for which \(\sum_{i=1}^h N_h \geq \alpha\)
It suffices to repeat the analysis with respect to \(\sum_{i=1}^h N_h\)
Given an unweighted graph \(G = (V,E)\) (strongly) connected
Ball function of node \(u\)\[B_h(u) = \{v\in V : d(u,v)\leq h\}\]
Note that \(B_1(u)\) contains \(u\)
Cumulative distance distribution
\[B_h = \frac{|\{(u, v)\in V \times V : d(u,v)\leq h\}|}{n^2}=\sum_{u\in V}\frac{|B_h(u)|}{n^2}\]
Note that\[N_h=\frac{n^2}{n(n-1)}(B_h-B_{h-1})\]
Recursive definition of ball function\[B_h(u) = \left\{\begin{array}{ll}\{u\}&\mathrm{if\ }h=0\\B_{h-1}(u)\cup\bigcup_{v\in N(u)}B_{h-1}(v)&\mathrm{otherwise}\end{array}\right.\]
This gives a dynamic programming algorithm for computing \(B_h\)
For any \(u\), \(B_0(u)=\{u\}\)
For \(h \in 1\ldots D\)
For any \(u\), \(B_h(u)=B_{h-1}(u)\cup\bigcup_{v\in N(u)} B_{h-1}(v)\)
Output \(\sum_{u\in V}|B_h(u)|/n^2=B_h\)
For any \(u\), \(B_0(u)=\{u\}\)
For \(h \in 1\ldots D\)
For any \(u\), \(B_h(u)=B_{h-1}(u)\cup\bigcup_{v\in N(u)} B_{h-1}(v)\)
Output \(\sum_{u\in V}|B_h(u)|/n^2=B_h\)
For any \(u\), \(B_0(u)=\{u\}\)
For \(h \in 1\ldots D\)
For any \(u\), \(B_h(u)=B_{h-1}(u)\cup\bigcup_{v\in N(u)} B_{h-1}(v)\)
Output \(\sum_{u\in V}|B_h(u)|/n^2=B_h\)
Complexity
A sketch \(S(A)\) is a compressed form of representation for a given set \(A\subseteq U\) providing the following operations
\(\mathtt{init}(S(A))\) How a sketch \(S(A)\) for \(A\) is initialized
\(\mathtt{update}(S(A),u)\) How a sketch \(S(A)\) for \(A\) modifies when an element \(u\) is added to \(A\)
\(\mathtt{merge}(S(A),S(B))\) Given two sketches for \(A\) and \(B\), provide a sketch for \(A\cup B\)
\(\mathtt{size}(S(A))\) Estimate the number of distinct elements of \(A\)
For any \(u\), \(\mathtt{init}(S(B_0(u)))\)
For any \(u\), \(\mathtt{update}(S(B_0(u)),u)\)
For \(h \in 1\ldots D\)
Output \(\sum_{u\in V}\mathtt{size}(S(B_h(u)))/n^2\)
Complexity (if size of sketches logarithmic)
\(k\)-min sketch
Includes the item of smallest rank in each of \(k\) independent
permutations
Let \(r_1,r_2,\ldots,r_k\) be ranking functions \[r_i:U\to \{1/n,2/n,3/n,\ldots, 1\}\] where \(n=|U|\)
Hence, a sketch \(S(A)\) is a sequence of exactly \(k\) entries \(a_1,\ldots,a_k\), where each entry can be an element of \(A\) or \(\bot\)
Setting \(r_i(\bot)=\infty\)
To deal with the initialisation of the sketch
Motivation
Let \(X\) be the minimum rank of the elements in \(A\)
If the ranks are uniformly distributed then the probability that \(X\) is less than \(t\) is\[Pr[X\leq t]=1-(1-t)^{|A|}\]
Hence,\[Pr[X=t]=|A|(1-t)^{|A|-1}\]
Average minimum\[E[X]=\int_0^1 x\cdot |A|(1-x)^{|A|-1}\ dx= \left[-\frac{(1-x)^{|A|}(1+|A|\cdot x)}{(1+|A|)}\right]^1_0\]\[=\frac{1}{1+|A|}\]
Then\[|A|=\frac{1}{E[X]}-1\]
Operations of \(k\)-min sketch
\(\mathtt{init}(S(A))\) \(a_i=\bot\) for \(i=1,\ldots,k\)
\(\mathtt{update}(S(A),u)\) for every \(i\), with \(1\leq i\leq k\), if \(r_i(a_i)>r_i(u)\), \(a_i\) is replaced with \(u\)
\(\mathtt{merge}(S(A),S(B))\) return \(\{c_1,\ldots,c_k\}\) such that \(c_i=\mathrm{arg}\,\mathrm{min}_{x\in\{a_i,b_i\}}r_i(x)\)
\(\mathtt{size}(S(A))\) return \(k/\sum_{a_i\in S(A)}r_i(a_i)-1\)
Choosing \(k = O(\epsilon^{-2} \log n)\) the relative error is bounded by \(\epsilon\) w.h.p.
Bottom \(k\) sketch
Given
A ranking (i.e., a bijective function) \(r:U\to \{1/n,2/n,\ldots, 1\}\)
A subset \(A\) of \(U\)
We denote as
\(H_k(A)\) the first \(k\) elements of \(A\) according to \(r\)
\(k_{th}(A)\) the rank of the \(k\)-th element of \(A\) according to \(r\)
A bottom-\(k\) sketch includes the \(k\) items with smallest rank in a single permutation, that is, \(H_k(A)\)
In this case a sketch is simply a subset of the set
Motivation
The idea behind the estimator is based on the following proportion that on the average holds\[|A|:1=(k-1):k_{th}(A)\]
Averaging over all possible rankings\[|A|=E\left[\frac{k-1}{k_{th}(A)}\right]\]
Operations of bottom-\(k\) sketch
\(\mathtt{init}(S(A))\) empty set
\(\mathtt{update}(S(A),u)\) if \(|S(A)|<k\), add \(u\) to \(S(A)\), otherwise, if \(r(\max(S(A)))>r(u)\), replace \(\max(S(A))\) with \(u\)
\(\mathtt{merge}(S(A),S(B))\) return the first \(k\) elements of the set \(S(A)\cup S(B)\), that is \(H_k(S(A)\cup S(B))\)
\(\mathtt{size}(S(A))\) return \((k-1)/k_{th}(S(A))\)
The same guarantee of \(k\)-min applies
Loglog counter
A sketch \(S(A)\) is a sequence of exactly \(k\) binary string, \(a_1,\ldots,a_k\) of length \(m\)
Given \(k\) partition functions \(p_i:U\to \{1,2,\ldots, m\}\), for each \(j\), with \(1\leq j\leq m\) the bit \(a_{i,j}=1\) if there exists an element in \(A\) that is mapped to \(j\) according to \(p_i\), \(0\) otherwise
Each partition function is such that randomly \(1/2\) of the elements are mapped to 1, \(1/4\) of the elements are mapped to 2, \(\ldots\), \(1/2^i\) of the elements are mapped to \(i\)
Operations of bottom-\(k\) sketch
\(\mathtt{init}(S(A))\) \(a_{i,j}=0\) for any \(i,j\)
\(\mathtt{update}(S(A),u)\) for any \(i\), if \(p_i(u)=j\), set \(a_{i,j}\) to \(1\)
\(\mathtt{merge}(S(A),S(B))\) return \(\{c_1,\ldots,c_k\}\) where \(c_i\) is the OR bit-a-bit between \(a_i\) and \(b_i\)
\(\mathtt{size}(S(A))\) let \(b\) the average position of the least zero bits in \(a_1,\ldots, a_k\), return \(2^b/.77351\)
Used to compute the average distance in the Facebook graph