GM2

Graph mining

2019/2020

Pierluigi Crescenzi

Université de Paris, IRIF

Inspired by Advanced Algorithms and Graph Mining by Andrea Marino (University of Florence)

Distance distribution computation
Sampling and sketch techniques

296 volunteers (starting population) asked to dispatch a message to specific individual (target person, stockholder living and working in suburb of Boston)
The message could not be sent directly to the target person (unless the sender knew him personally), but could only be mailed to someone known personally who is more likely than the sender to know the target person
Starting population
- 100 living in Boston
- 100 Nebraska stockholders (i.e., people living far from the target but sharing with him their profession)
- 96 were Nebraska inhabitants chosen at random

By Ageev Andrew - Own Work based on Image:Map of USA showing state names.png

CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=22945395

Distance distribution computation

Milgram's experiment

pierluigi.crescenzi@irif.fr

GM

#02

Results
- Only 64 chains (22%) were completed (i.e., they reached the target)
- The average number of intermediaries in these chains was 5.2
- There was a marked difference between the Boston group (4.4) and the rest of the starting population, whereas the difference between the two other subpopulations was not statistically significant
- The random group from Nebraska (far away) needed 5.7 intermediaries on average (i.e., rounding up, six degrees of separation)
Conclusion
- The average path length is small, much smaller than expected

Distance distribution computation

Milgram's experiment

pierluigi.crescenzi@irif.fr

GM

#02

Milgram was measuring the average length of a (greedy) routing path on a social network
- Upper bound on the average distance
- People involved in the experiment were not necessarily sending the letter to an acquaintance on a shortest path to the destination
First world-scale social network experiment
- Entire Facebook network of active users (\(\approx 721\) million users, \(\approx 69\) billion friendship links

Distance distribution computation

Milgram's experiment

pierluigi.crescenzi@irif.fr

GM

#02

pierluigi.crescenzi@irif.fr

GM

#02

Given an unweighted graph \(G = (V,E)\) (strongly) connected
- Distance \(d(u,v)\): number of edges in shortest path from \(u\) to \(v\)
- Neighbourhood function of node \(u\)
  \[N_h(u) = \{v\in V : d(u,v)=h\}\]
  - We denote \(N_1(u)\) also as \(N(u)\)
- Distance distribution
  \[N_h = \frac{|\{(u, v)\in V \times V : d(u,v)=h\}|}{n(n-1)}=\sum_{u\in V}\frac{|N_h(u)|}{n(n-1)}\]
- Average distance
  \[\sum_{u,v\in V}\frac{d(u,v)}{n(n-1)}=\sum_{h>0}(h\cdot N_h)\]
  - Approximating \(N_h\) implies approximating average distance

Definitions

Distance distribution computation

Example

pierluigi.crescenzi@irif.fr

GM

#02

Definitions

Distance distribution computation

\(N_1=\frac{|\{(u, v)\in V \times V : d(u,v)=1\}|}{n(n-1)}=\frac{22}{72}\)
\(N_2=\frac{|\{(u, v)\in V \times V : d(u,v)=2\}|}{n(n-1)}=\frac{26}{72}\)
\(N_3=\frac{|\{(u, v)\in V \times V : d(u,v)=3\}|}{n(n-1)}=\frac{20}{72}\)
\(N_4=\frac{|\{(u, v)\in V \times V : d(u,v)=4\}|}{n(n-1)}=\frac{4}{72}\)
Average distance\[\frac{1\cdot22+2\cdot26+3\cdot20+4\cdot4}{72}=\frac{150}{72}\approx2.08\]

Given a random \(U\subseteq V\) approximate\[N_h = \frac{|\{(u, v)\in V \times V : d(u,v)=h\}|}{n(n-1)}\]by\[N_h(U)=\frac{|\{(u,v) \in U\times V: d(u,v)=h\}|}{|U| \, (n-1)}\]
Algorithm
- Select a random sample of \(\kappa\) vertices obtaining a multiset \(U = \{u_1, u_2, \ldots, u_{\kappa}\} \subseteq V\)
- For \(i=1,2, \ldots, \kappa\), compute distances \(d(u_i,v)\) for all \(v \in V\)
  - BFS starting from \(u_i\)
- Return the approximation \(N_h(U)\)
Running time: \(O(\kappa\cdot m)\) where \(m\) is the number of edges

pierluigi.crescenzi@irif.fr

GM

#02

Sampling

Distance distribution computation

Unbiased
- Since \(N_h(\{u_i\}) = \frac{|\{(u_i,v) : v \in V \wedge d(u_i,v)=h\}|}{n-1}\), we have that\[N_h(U)=\frac{|\{(u_i,v) : u_i \in U\wedge v\in V \wedge d(u_i,v)=h\}|}{\kappa(n-1)}\\ = \frac{\sum_{i=1}^{\kappa} N_h(\{u_i\})}{\kappa}\]
- If vertex \(u_i\) is randomly chosen in \(V\), then \(E[N_h(\{ u_i\})] = N_h\)
  - Indeed, \[E[N_h(\{ u_i\})] = \frac{1}{n}\sum_{v\in V}N_h(\{v\}) = N_h(V) = N_h\]
- Hence, if all elements of \(U\) are randomly chosen, we have that, by the linearity of the expectation,
  \[E[N_h(U)]=E\left[\frac{\sum_{i=1}^{\kappa} N_h(\{u_i\})}{\kappa}\right]=\frac{\sum_{i=1}^{\kappa} E[N_h(\{u_i\})]}{\kappa}=N_h\]

pierluigi.crescenzi@irif.fr

GM

#02

Sampling

Distance distribution computation

Concentration
- Hoeffding bound
  
  If \(X_1, X_2, \ldots, X_k\) are independent random variables such that \(\mu=E[\sum X_i/k]\) and, for each \(i\), \(0 \leq X_i \leq 1\), then, for any \(t\geq0\)\[Pr\left\{ \left|\frac{\sum_{i=1}^k X_i}{k}-\mu \right|\geq t \right\}\leq 2e^{-2kt^2}\]

pierluigi.crescenzi@irif.fr

GM

#02

Sampling

Distance distribution computation

Markov's inequality
- If \(X\) is a nonnegative random variable and \(v > 0\), then\[Pr\left\{X\geq v \right\}\leq \frac{E[X]}{v}\]
- Proof
  - We want to prove that\[vPr\left\{X\geq v \right\}\leq E[X]\]
  - From definition of expected value\[E[X]=\sum_{u<v}uPr\left\{ X=u\right\}+\sum_{u\geq v}uPr\left\{X=u\right\}\]
  - Since \(X\) is nonnegative\[E[X]\geq\sum_{u\geq v}uPr\left\{X=u\right\}\geq v\sum_{u\geq v}Pr\left\{X=u\right\}=vPr\left\{X\geq v\right\}\]

pierluigi.crescenzi@irif.fr

GM

#02

Hoeffding bound

Distance distribution computation

Chernoff's bounding method
- Let \(X\) be any real-valued random variable. Then, for all \(t>0\)\[Pr\{X\geq t\}\leq \mathrm{inf}_{s>0}e^{-st}E[e^{sX}]\]
- Proof (using Markov's inequality) \[Pr\{X\geq t\}=Pr\{sX\geq st\}=Pr\{e^{sX}\geq e^{st}\}\leq e^{-st}E[e^{sX}]\]

pierluigi.crescenzi@irif.fr

GM

#02

Sampling

Distance distribution computation

Hoeffding's lemma
- If \(X\) is a real random variables such that \(E[X]=0\) and \(a \leq X \leq b\), then, for any \(s>0\)\[E[e^{sX}]\leq e^{s^2(b-a)^2/8}\]
- Proof
  - Since \(f(x)=e^{sx}\) is convex\[e^{sx}\leq\frac{x-a}{b-a}e^{sb}+\frac{b-x}{b-a}e^{sa}\]
  - Since \(E[X]=0\)\[E[e^{sX}]\leq \frac{b}{b-a}e^{sa}-\frac{a}{b-a}e^{sb}\]
  - Let \(p=-\frac{a}{b-a}>0\). Then\[E[e^{sX}]\leq (1-p)e^{sa}+pe^{sb}=e^{sa}(1-p+pe^{s(b-a)})=(1-p+pe^{s(b-a)})e^{-sp(b-a)}\]
  - Let \(\varphi(u)=-pu+\log(1-p+pe^{pu})\) with \(u=s(b-a)\). Then\[E[e^{sX}]\leq e^{\varphi(u)}\]
  - By calculus (Taylor's theorem)\[\varphi(u)\leq \frac{1}{8}s^2(b-a)^2\]

pierluigi.crescenzi@irif.fr

GM

#02

Sampling

Distance distribution computation

Hoeffding bound
- If \(X_1, X_2, \ldots, X_k\) are independent random variables such that, for each \(i\), \(0 \leq X_i \leq 1\). If \(S_k=\sum X_i\), then, for any \(t>0\)\[Pr\left\{S_k-E[S_k]\geq t \right\}\leq e^{-2t^2/k}\]
- Proof
  - By applying Chernoff's bounding method\[Pr\left\{S_k-E[S_k]\geq t \right\}\leq \min_{s>0}e^{-st}E\left[e^{s(S_k-E[S_k])}\right]\]
  - Since \(X_i\) are independent\[e^{-st}E\left[e^{s(S_k-E[S_k])}\right]=e^{-st}\prod E\left[e^{s(X_i-E[X_i])}\right]\]
  - By applying Hoeffding's lemma\[Pr\left\{S_k-E[S_k]\geq t \right\}\leq \min_{s>0}e^{-st+ks^2/8}\]
  - Minimum at \(s=4t/k\) and bound follows

pierluigi.crescenzi@irif.fr

GM

#02

Sampling

Distance distribution computation

Concentration
- Hoeffding bound
  
  If \(X_1, X_2, \ldots, X_k\) are independent random variables such that \(\mu=E[\sum X_i/k]\) and, for each \(i\), \(0 \leq X_i \leq 1\), then, for any \(t\geq0\)\[Pr\left\{ \left|\frac{\sum_{i=1}^k X_i}{k}-\mu \right|\geq t \right\}\leq 2e^{-2kt^2}\]
- In our case
  - \(k=\kappa\), \(X_i=N_h(\{u_i\})\), and \(\mu=N_h\)
    - By definition, \(0\leq N_h(\{u_i\})\leq 1\)
  - Hence\[Pr\left\{ \left|\frac{\sum_{i=1}^\kappa N_h(\{u_i\})}{\kappa}-\mu \right|\geq t \right\}\leq 2e^{-2\kappa t^2}\]
  - Choosing \(\kappa=\frac{\alpha}{2t^2}\ln n\) with \(\alpha>0\), probability bounded by \(\frac{2}{n^\alpha}\)
    - Called high probability

pierluigi.crescenzi@irif.fr

GM

#02

Sampling

Distance distribution computation

pierluigi.crescenzi@irif.fr

GM

#02

Sampling

Distance distribution computation

pierluigi.crescenzi@irif.fr

GM

#02

Sampling

Distance distribution computation

The previous analysis can be easily extended to the case of weighted (strongly) connected graphs
- Use of Dijkstra's algorithm
  - Running time \(O(\kappa(m + n\log n))=O(t^{-2} (m\log n + n\log^2 n))\)
In a similar way, we can compute
- An approximation of the average distance
- An approximation of the \(\alpha\)-diameter, which is defined as the minimum \(h\) for which \(\sum_{i=1}^h N_h \geq \alpha\)
  - It suffices to repeat the analysis with respect to \(\sum_{i=1}^h N_h\)

pierluigi.crescenzi@irif.fr

GM

#02

Given an unweighted graph \(G = (V,E)\) (strongly) connected
- Ball function of node \(u\)\[B_h(u) = \{v\in V : d(u,v)\leq h\}\]
  - Note that \(B_1(u)\) contains \(u\)
- Cumulative distance distribution
  \[B_h = \frac{|\{(u, v)\in V \times V : d(u,v)\leq h\}|}{n^2}=\sum_{u\in V}\frac{|B_h(u)|}{n^2}\]
  - Note that\[N_h=\frac{n^2}{n(n-1)}(B_h-B_{h-1})\]

Definitions

Distance distribution computation

pierluigi.crescenzi@irif.fr

GM

#02

Recursive definition of ball function\[B_h(u) = \left\{\begin{array}{ll}\{u\}&\mathrm{if\ }h=0\\B_{h-1}(u)\cup\bigcup_{v\in N(u)}B_{h-1}(v)&\mathrm{otherwise}\end{array}\right.\]
This gives a dynamic programming algorithm for computing \(B_h\)
- For any \(u\), \(B_0(u)=\{u\}\)
- For \(h \in 1\ldots D\)
  - For any \(u\), \(B_h(u)=B_{h-1}(u)\cup\bigcup_{v\in N(u)} B_{h-1}(v)\)
  - Output \(\sum_{u\in V}|B_h(u)|/n^2=B_h\)

Definitions

Distance distribution computation

pierluigi.crescenzi@irif.fr

GM

#02

For any \(u\), \(B_0(u)=\{u\}\)
For \(h \in 1\ldots D\)
- For any \(u\), \(B_h(u)=B_{h-1}(u)\cup\bigcup_{v\in N(u)} B_{h-1}(v)\)
- Output \(\sum_{u\in V}|B_h(u)|/n^2=B_h\)

Definitions

Distance distribution computation

First three steps of \(v_1\)
- \(h=0\)
  - \(B_0(v_1)=\{v_1\}\)
  - \(B_0(v_2)=\{v_2\}\)
  - \(B_0(v_3)=\{v_3\}\)
- \(h=1\)
  - \(B_1(v_1)=\{v_1,v_2,v_3\}\)
  - \(B_1(v_2)=\{v_1,v_2,v_3,v_4,v_7\}\)
  - \(B_1(v_3)=\{v_1,v_2,v_3,v_5\}\)
- \(h=2\)
  - \(B_2(v_1)=\{v_1,v_2,v_3\}\cup\{v_1,v_2,v_3,v_4,v_7\}\cup\{v_1,v_2,v_3,v_5\}=\{v_1,v_2,v_3,v_4,v_5,v_7\}\)

pierluigi.crescenzi@irif.fr

GM

#02

For any \(u\), \(B_0(u)=\{u\}\)
For \(h \in 1\ldots D\)
- For any \(u\), \(B_h(u)=B_{h-1}(u)\cup\bigcup_{v\in N(u)} B_{h-1}(v)\)
- Output \(\sum_{u\in V}|B_h(u)|/n^2=B_h\)
Complexity
- Time: \(O(n^2\log n)\)
- Space: \(O(n^2)\)
Can we do better?
- Idea
  - For any \(u\), maintain limited information (constant or almost constant) about the sets \(B_{h}(u)\)
  - Given approximation of \(A\) and \(B\), we can compute an approximation of the union set \(A\cup B\)

Definitions

Distance distribution computation

pierluigi.crescenzi@irif.fr

GM

#02

A sketch \(S(A)\) is a compressed form of representation for a given set \(A\subseteq U\) providing the following operations
- \(\mathtt{init}(S(A))\) How a sketch \(S(A)\) for \(A\) is initialized
- \(\mathtt{update}(S(A),u)\) How a sketch \(S(A)\) for \(A\) modifies when an element \(u\) is added to \(A\)
- \(\mathtt{merge}(S(A),S(B))\) Given two sketches for \(A\) and \(B\), provide a sketch for \(A\cup B\)
- \(\mathtt{size}(S(A))\) Estimate the number of distinct elements of \(A\)

Sketches

Distance distribution computation

pierluigi.crescenzi@irif.fr

GM

#02

Sketches

Distance distribution computation

For any \(u\), \(\mathtt{init}(S(B_0(u)))\)
For any \(u\), \(\mathtt{update}(S(B_0(u)),u)\)
For \(h \in 1\ldots D\)
- For any \(u\)
  - \(S(B_h(u))=S(B_{h-1}(u))\)
  - For any edge \((u,v)\), \(S(B_h(u))=\mathtt{merge}(S(B_h(u)), S(B_{h-1}(v)))\)
- Output \(\sum_{u\in V}\mathtt{size}(S(B_h(u)))/n^2\)
Complexity (if size of sketches logarithmic)
- Time: \(\tilde{O}(n)\)
- Space: \(\tilde{O}(n)\)

pierluigi.crescenzi@irif.fr

GM

#02

Sketches

Distance distribution computation

\(k\)-min sketch
- Includes the item of smallest rank in each of \(k\) independent
  permutations
- Let \(r_1,r_2,\ldots,r_k\) be ranking functions \[r_i:U\to \{1/n,2/n,3/n,\ldots, 1\}\] where \(n=|U|\)
Hence, a sketch \(S(A)\) is a sequence of exactly \(k\) entries \(a_1,\ldots,a_k\), where each entry can be an element of \(A\) or \(\bot\)
- Setting \(r_i(\bot)=\infty\)
  - To deal with the initialisation of the sketch

pierluigi.crescenzi@irif.fr

GM

#02

Motivation
- Let \(X\) be the minimum rank of the elements in \(A\)
- If the ranks are uniformly distributed then the probability that \(X\) is less than \(t\) is\[Pr[X\leq t]=1-(1-t)^{|A|}\]
- Hence,\[Pr[X=t]=|A|(1-t)^{|A|-1}\]
- Average minimum\[E[X]=\int_0^1 x\cdot |A|(1-x)^{|A|-1}\ dx= \left[-\frac{(1-x)^{|A|}(1+|A|\cdot x)}{(1+|A|)}\right]^1_0\]\[=\frac{1}{1+|A|}\]
- Then\[|A|=\frac{1}{E[X]}-1\]

Sketches

Distance distribution computation

pierluigi.crescenzi@irif.fr

GM

#02

Operations of \(k\)-min sketch
- \(\mathtt{init}(S(A))\) \(a_i=\bot\) for \(i=1,\ldots,k\)
- \(\mathtt{update}(S(A),u)\) for every \(i\), with \(1\leq i\leq k\), if \(r_i(a_i)>r_i(u)\), \(a_i\) is replaced with \(u\)
- \(\mathtt{merge}(S(A),S(B))\) return \(\{c_1,\ldots,c_k\}\) such that \(c_i=\mathrm{arg}\,\mathrm{min}_{x\in\{a_i,b_i\}}r_i(x)\)
- \(\mathtt{size}(S(A))\) return \(k/\sum_{a_i\in S(A)}r_i(a_i)-1\)
Choosing \(k = O(\epsilon^{-2} \log n)\) the relative error is bounded by \(\epsilon\) w.h.p.

Sketches

Distance distribution computation

pierluigi.crescenzi@irif.fr

GM

#02

Sketches

Distance distribution computation

Bottom \(k\) sketch
- Given
  - A ranking (i.e., a bijective function) \(r:U\to \{1/n,2/n,\ldots, 1\}\)
  - A subset \(A\) of \(U\)
- We denote as
  - \(H_k(A)\) the first \(k\) elements of \(A\) according to \(r\)
  - \(k_{th}(A)\) the rank of the \(k\)-th element of \(A\) according to \(r\)
- A bottom-\(k\) sketch includes the \(k\) items with smallest rank in a single permutation, that is, \(H_k(A)\)
  - In this case a sketch is simply a subset of the set

pierluigi.crescenzi@irif.fr

GM

#02

Sketches

Distance distribution computation

Motivation
- The idea behind the estimator is based on the following proportion that on the average holds\[|A|:1=(k-1):k_{th}(A)\]
- Averaging over all possible rankings\[|A|=E\left[\frac{k-1}{k_{th}(A)}\right]\]

pierluigi.crescenzi@irif.fr

GM

#02

Operations of bottom-\(k\) sketch
- \(\mathtt{init}(S(A))\) empty set
- \(\mathtt{update}(S(A),u)\) if \(|S(A)|<k\), add \(u\) to \(S(A)\), otherwise, if \(r(\max(S(A)))>r(u)\), replace \(\max(S(A))\) with \(u\)
- \(\mathtt{merge}(S(A),S(B))\) return the first \(k\) elements of the set \(S(A)\cup S(B)\), that is \(H_k(S(A)\cup S(B))\)
- \(\mathtt{size}(S(A))\) return \((k-1)/k_{th}(S(A))\)
The same guarantee of \(k\)-min applies

Sketches

Distance distribution computation

pierluigi.crescenzi@irif.fr

GM

#02

Sketches

Distance distribution computation

Loglog counter
- A sketch \(S(A)\) is a sequence of exactly \(k\) binary string, \(a_1,\ldots,a_k\) of length \(m\)
- Given \(k\) partition functions \(p_i:U\to \{1,2,\ldots, m\}\), for each \(j\), with \(1\leq j\leq m\) the bit \(a_{i,j}=1\) if there exists an element in \(A\) that is mapped to \(j\) according to \(p_i\), \(0\) otherwise
- Each partition function is such that randomly \(1/2\) of the elements are mapped to 1, \(1/4\) of the elements are mapped to 2, \(\ldots\), \(1/2^i\) of the elements are mapped to \(i\)

pierluigi.crescenzi@irif.fr

GM

#02

Operations of bottom-\(k\) sketch
- \(\mathtt{init}(S(A))\) \(a_{i,j}=0\) for any \(i,j\)
- \(\mathtt{update}(S(A),u)\) for any \(i\), if \(p_i(u)=j\), set \(a_{i,j}\) to \(1\)
- \(\mathtt{merge}(S(A),S(B))\) return \(\{c_1,\ldots,c_k\}\) where \(c_i\) is the OR bit-a-bit between \(a_i\) and \(b_i\)
- \(\mathtt{size}(S(A))\) let \(b\) the average position of the least zero bits in \(a_1,\ldots, a_k\), return \(2^b/.77351\)
Used to compute the average distance in the Facebook graph

Sketches

Distance distribution computation

Graph mining

Pierluigi Crescenzi

Distance distribution computation Sampling and sketch techniques

Distance distribution computation

Milgram's experiment

pierluigi.crescenzi@irif.fr

GM

#02

Distance distribution computation

Milgram's experiment

pierluigi.crescenzi@irif.fr

GM

#02

Distance distribution computation

Milgram's experiment

pierluigi.crescenzi@irif.fr

GM

#02

pierluigi.crescenzi@irif.fr

GM

#02

Definitions

Distance distribution computation

pierluigi.crescenzi@irif.fr

GM

#02

Definitions

Distance distribution computation

pierluigi.crescenzi@irif.fr

GM

#02

Sampling

Distance distribution computation

pierluigi.crescenzi@irif.fr

GM

#02

Sampling

Distance distribution computation

pierluigi.crescenzi@irif.fr

GM

#02

Sampling

Distance distribution computation

pierluigi.crescenzi@irif.fr

GM

#02

Hoeffding bound

Distance distribution computation

pierluigi.crescenzi@irif.fr

GM

#02

Sampling

Distance distribution computation

pierluigi.crescenzi@irif.fr

GM

#02

Sampling

Distance distribution computation

pierluigi.crescenzi@irif.fr

GM

#02

Sampling

Distance distribution computation

pierluigi.crescenzi@irif.fr

GM

#02

Sampling

Distance distribution computation

pierluigi.crescenzi@irif.fr

GM

#02

Sampling

Distance distribution computation

pierluigi.crescenzi@irif.fr

GM

#02

Sampling

Distance distribution computation

pierluigi.crescenzi@irif.fr

GM

Distance distribution computation
Sampling and sketch techniques