Graph mining

2019/2020

Pierluigi Crescenzi

Université de Paris, IRIF

Inspired by Advanced Algorithms and Graph Mining by Andrea Marino (University of Florence)

Diameter Computation

Heuristics and Lower Bounds

Diameter computation

Small world

pierluigi.crescenzi@irif.fr

GM

#01

  • The small average distance observed in the complex networks is referred as small world effect
  • The average distance is constant (six)
  • If the size of the network is \(n\), the diameter have at most the order of magnitude of \(\log(n)\)

By Ageev Andrew - Own Work based on Image:Map of USA showing state names.png

CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=22945395

Diameter computation

Small world

pierluigi.crescenzi@irif.fr

GM

#01

  • The average distance in Facebook (\(721.1\)M nodes and \(68.7\)G edges) is 4.7 and the diameter is 41
  • Average distance has been computed by applying HyperANF tool
  • Diameter has been computed by applying \(i\)FUB.

pierluigi.crescenzi@irif.fr

GM

#01

Diameter computation

Definitions

  • Given an unweighted graph \(G = (V,E)\) (strongly) connected

    • Distance
      The distance \(d(u,v)\) is the number of edges along shortest path from \(u\) to \(v\)

    • Diameter
      \(D = \max_{u,v \in V} d(u,v)\)

pierluigi.crescenzi@irif.fr

GM

#01

Diameter computation

Definitions: undirected graphs

  • Eccentricity of a node \(u\): \(\mathrm{ecc}(u)=\max_{v\in V}d(u,v)\)

    • Diameter: \(D=\max_{u\in V}\mathrm{ecc}(u)\)

  • Frontier \(F_i(u)\) of a node \(u\): set of nodes at distance \(i\)

    • Nodes at level \(i\) of BFS tree

pierluigi.crescenzi@irif.fr

GM

#01

Diameter computation

Definitions: directed graphs

  • Forward eccentricity of a node \(u\): \(\mathrm{ecc}_F(u)=\max_{v\in V}d(u,v)\)

  • Backward eccentricity of a node \(u\): \(\mathrm{ecc}_B(u)=\max_{v\in V}d(v,u)\)

    • Diameter: \(D=\max_{u\in V}\{\mathrm{ecc}_F(u),\mathrm{ecc}_B(u)\}\)

pierluigi.crescenzi@irif.fr

GM

#01

Diameter computation

Definitions: directed graphs

  • Forward frontier \(F^F_i(u)\) of a node \(u\): nodes at level \(i\) of forward BFS tree from \(u\)
  • Backward frontier \(F^B_i(u)\) of a node \(u\): nodes at level \(i\) of backward BFS tree to \(u\)

Forward BFS tree

Backward BFS tree

pierluigi.crescenzi@irif.fr

GM

#01

Diameter computation

State of the art: positive results

  • \(G=(V,E)\) with \(n=|V|\) and \(m=|E|\)
  • Textbook algorithm
    • Perform \(n\) BFSes and return maximum eccentricity
      • A BFS from \(u\) returns all the distances from \(u\) and takes \(O(m)\) time
    • Complexity \(O(nm)\): too expensive
  • Several other approaches for all pairs shortest path problem
    • \(O(n^{(3+\omega)/2}\log n)\) where \(\omega\) is the exponent of the matrix multiplication
    • Still too expensive
  • Empirically finding lower bound \(L\) and upper bound \(U\)
    • That is, \(L \leq D \leq U\)
    • \(D\) found, when \(L=U\)
  • Quadratic algorithms are not feasible
  • Look for "hardest" quadratic time solvable problems
    • Approach similar to NP-completeness
    • Definition of specific reducibility
      • Preserving subquadratic solvability
  • Hardness relative to complexity hypothesis
    • Similar to P vs NP
    • ​SETH: no algorithm solving k-SAT in subexponential time
      • Quadratic time solvable version (k-SAT*)

Diameter computation

Negative results

pierluigi.crescenzi@irif.fr

GM

#01

  • Quasi-linear reducibility: \(\mathcal{P} \leq_{ql}\mathcal{Q}\)
    • \(I \mathrm{\ instance\ of\ }\mathcal{P} \rightarrow \Phi(I)\mathrm{\ instance\ of\ }\mathcal{Q}\)
      • \(\mathrm{Computable\ in\ time\ }\tilde{O}(|I|)\)
    • \(I \mathrm{\ and\ } \Phi(I) \mathrm{\ same\ output}\)
      • \(\mathrm{Linear\ time\ computable\ output\ mapping}\)
  • Fact
\mathcal{P} \leq_{ql}\mathcal{Q} \mathrm{\ and \ } \mathcal{Q} \mathrm{\ is\ solvable\ in\ time \ } \tilde{O}(n^{2-\epsilon})
\mathcal{P} \mathrm{\ is\ solvable\ in\ time \ } \tilde{O}(n^{2-\epsilon})

Diameter computation

Negative results

pierluigi.crescenzi@irif.fr

GM

#01

  • \(k\)-SAT*
    • Input
      • \(\mathrm{Two\ sets \ of\ } n \mathrm{\ variables\ }\{x_i\},\{y_i\}\)
      • \(\mathrm{Set\ of\ clauses\ } C\)
      • \(\mathrm{Possible\ assignments\ to \ } x_i\)
      • \(\mathrm{Possible\ assignments\ to \ } y_i\)
    • Output
      • \(\mathrm{\ True\ iff\ }C\mathrm{\ satisfiable}\)
  • Fact
O(n^{2-\epsilon})\mathrm{\ algorithm\ for\ } k-\mathrm{SAT}^*
O(2^{\frac{n}{2}(2-\epsilon)})=O((2^{\frac{2-\epsilon}{2}})^n)\mathrm{\ algorithm\ for\ } k-\mathrm{SAT}

Diameter computation

Negative results

pierluigi.crescenzi@irif.fr

GM

#01

  • The reduction web

Diameter computation

Negative results

pierluigi.crescenzi@irif.fr

GM

#01

  • From \(k\)-two disjoint sets to diameter
    • Input
      • \(\mathrm{Set\ of\ items\ } X\)
      • \(\mathrm{Collection\ } C \mathrm{\ of\ subsets\ of\ } X\) with \(|X|\leq \log^k(|C|)\)
    • Output
      • \(\mathrm{\ True\ iff\ }C\mathrm{\ has\ two\ disjoint\ sets}\)
    • Reduction
\mathrm{Clique\ of\ }|X|\mathrm{\ nodes}
\mathrm{Independent\ set\ of\ }|C|\mathrm{\ nodes}
\mathrm{Two\ sets\ that\ do\ not\ intersect:\ distance\ }3
\mathrm{Two\ sets\ that\ intersect:\ distance\ }2
\mathrm{Disjoint\ sets\ } \leftrightarrow \mathrm{\ diameter\ is\ } 3

Diameter computation

Negative results

pierluigi.crescenzi@irif.fr

GM

#01

  • By using one BFS tree

Diameter computation

Lower and upper bounds: undirected graphs

pierluigi.crescenzi@irif.fr

GM

#01

  • Lower bound: eccentricity (height of the BFS tree)

    • Example: 3

  • Upper bound: twice the eccentricity

    • Example: 6 (every node can reach another node going to \(v_1\) by \(\leq 3\) edges and going to the destination by \(\leq 3\) edges)

    • Fact: \(x\in F_i(u)\) and \(y\in F_j(u)\) implies \(d(x,y)\leq i+j\)

  • Bounds by sampling but very often \(L < D < U\)

    • In the example diameter is 4: \(d(v_{7},v_{8})=4\)

  • By using one forward BFS (fBFS) tree and one backward BFS (bBFS) tree

Diameter computation

Lower and upper bounds: directed graphs

pierluigi.crescenzi@irif.fr

GM

#01

  • Lower bound: maximum between \(\mathrm{ecc}_F(u)\) (height of the fBFS tree) and \(\mathrm{ecc}_B(u)\) (height of the bBFS tree)

    • Example: 5

  • Upper bound: \(\mathrm{ecc}_F(u) + \mathrm{ecc}_B(u)\)

    • Example: 9 (every node can reach another node going to \(v_1\) by \(\leq 5\) edges and going to the destination by \(\leq 4\) edges)

    • Fact: \(x\in F_i^B(u)\) and \(y\in F_j^F(u)\) implies \(d(x,y)\leq i+j\)

  • Bounds by sampling but very often \(L < D < U\)

    • In the example diameter is 7: \(d(v_{10},v_{12})=7\)

Diameter computation

Lower bounds: (undirected) 2-sweep

pierluigi.crescenzi@irif.fr

GM

#01

  • Run a BFS from a (random) node \(r\): let \(a\) be the farthest node

  • Run a BFS from \(a\): let \(b\) be the farthest node

  • Return \(d(a,b)\)

  • Experiments (\(r\) node of maximum degree)

Diameter computation

Lower bounds: (directed) 2-sweep

pierluigi.crescenzi@irif.fr

GM

#01

  • Run a fBFS and a bBFS from a (random) node \(r\): let \(a_1\) and \(a_2\) be the farthest nodes

  • Run a bBFS (fBFS) from \(a_1\) (\(a_2\)): let \(b_1\) (\(b_2\)) be the farthest node

  • Return \(\max\{d(b_1,a_1),d(a_2,b_2)\}\)

  • Experiments (\(r\) node of maximum in or out degree)

Diameter computation

Lower bounds: bad case for 2-sweep

pierluigi.crescenzi@irif.fr

GM

#01

  • Modified grid with \(k\) rows and \(1+3k/2\) columns
  • The algorithm can return \(k\)
  • The diameter is \(3k/2\)

Diameter computation

Lower bounds: 4-sweep

pierluigi.crescenzi@irif.fr

GM

#01

  • Apply 2-Sweep
  • Pick the middle vertex of the returned path
  • Apply again 2-Sweep

Diameter computation

Lower bounds: bad case for 4-sweep

pierluigi.crescenzi@irif.fr

GM

#01

  • The diameter is \(\max\{ x-1, y, z-1 \}\)
  • By choosing \(x\), \(y\), and \(z\) appropriately, diameter is equal to \(z-1\), while 4-sweep can return \(x-1>(z+1)/2\)

Diameter computation

Exact computation (undirected graphs)

pierluigi.crescenzi@irif.fr

GM

#01

  • The textbook algorithm runs a BFS for any node and return the maximum found eccentricity

  • Idea

    • Perform the BFSes one after the other specifying the order in which they have to be executed

    • While doing this

      • Refine the lower bound (maximum eccentricity)

      • Upper bound eccentricities of remaining nodes

      • Stop when the remaining nodes cannot have eccentricity higher than our lower bound

    • Good order can be inferred looking at some properties of BFS trees

Diameter computation

Exact computation (undirected graphs)

pierluigi.crescenzi@irif.fr

GM

#01

  • Main observation

    • For any \(1\leq i< \mathrm{ecc}(u)\)  and \(1 \leq k < i\), and for any \(x\in F_{i-k}(u)\) such that \(\mathrm{ecc}(x)>2(i-1)\), there exists \(y\in F_j(u)\) such that \(d(x,y)=\mathrm{ecc}(x)\) with \(j \geq i\)

\(\mathrm{ecc}(x)>2(i-1)\Rightarrow\exists y[\mathrm{ecc}(y)\geq\mathrm{ecc}(x)]\)

Diameter computation

Exact computation (undirected graphs)

pierluigi.crescenzi@irif.fr

GM

#01

  • Main observation

    • For any \(1\leq i< \mathrm{ecc}(u)\)  and \(1 \leq k < i\), and for any \(x\in F_{i-k}(u)\) such that \(\mathrm{ecc}(x)>2(i-1)\), there exists \(y\in F_j(u)\) such that \(d(x,y)=\mathrm{ecc}(x)\) with \(j \geq i\)

  • Proof

    • Since \(\mathrm{ecc}(x)>2(i-1)\), then there exists \(y_x\) whose distance from \(x\) is equal to \(\mathrm{ecc}(x)\) and, hence, greater than \(2(i-1)\)

    • If \(y_x\) was in \(F_j(u)\) with \(j < i\), then \[d(x,y_x)\leq (i-1)+(i-k)\leq 2\max\{i-1,i-k\} = 2(i-1)\]

    • Contradiction: hence, \(y_x\in F_j(u)\) with \(j\geq i\)

  • Corollary: if \(lb\) is the maximum among all the eccentricities of the nodes in or below the level \(i\), then the eccentricities of all other nodes is bounded by \(\max\{lb,2(i-1)\}\)

Diameter computation

Exact computation (undirected graphs)

pierluigi.crescenzi@irif.fr

GM

#01

  • Corollary of the main observation

    • If \(lb\) is the maximum among all the eccentricities of the nodes in or below the level \(i\), then the eccentricities of all other nodes is bounded by \(\max\{lb,2(i-1)\}\)

  • ​Notation: \(B_{i}(u)=\max_{v\in F_i(u)}\mathrm{ecc}(v)\)

  • The algorithm (bottom-up)

    • Given a node \(u\) and its BFS tree

      • Set \(i=\mathrm{ecc}(u)\) and \(M=B_{i}(u)\)

      • If \(M > 2(i-1)\), then return \(M\), else set \(i=i-1\) and \(M=\max\{M,B_{i}(u)\}\) and repeat this step

Diameter computation

Exact computation (undirected graphs)

pierluigi.crescenzi@irif.fr

GM

#01

Diameter computation

Exact computation (undirected graphs)

pierluigi.crescenzi@irif.fr

GM

#01

  • \(n\): number of nodes
  • \(v\): (average) number of executed BFSes

Diameter computation

Exact computation (undirected graphs)

pierluigi.crescenzi@irif.fr

GM

#01

  • Bad case for iFUB
    • A cycle with \(n\) nodes (\(n\) odd) has diameter \(\frac{n-1}{2}\), and each  node has the same BFS tree
    • The loop stops the first time that \(2(i-1)<\frac{n-1}{2}\), that is, \(i<\frac{n+3}{4}\)
    • The total number of iterations is equal to \(\frac{n-1}{2}-\frac{n+3}{4}+2=\frac{n+3}{4}\)
    • The number of BFSes is \(\frac{n+3}{2}\)

Diameter computation

Exact computation (directed graphs)

pierluigi.crescenzi@irif.fr

GM

#01

  • Notation: \((X,Y,a,b)\in\{(B,F,x,y),(F,B,t,z)\}\)

  • Main observation

    • For any \(1\leq i< \mathrm{ecc}_X(u)\)  and \(1 \leq k < i\), and for any \(a\in F^X_{i-k}(u)\) such that \(\mathrm{ecc}_Y(a)>2(i-1)\), there exists \(b\in F^Y_j(u)\) such that \(\mathrm{ecc}_X(b)\geq\mathrm{ecc}_X(a)\) with \(j \geq i\)

Diameter computation

Exact computation (directed graphs)

pierluigi.crescenzi@irif.fr

GM

#01

  • Notation: \((X,Y,a,b)\in\{(B,F,x,y),(F,B,t,z)\}\)

  • Main observation

    • For any \(1\leq i< \mathrm{ecc}_X(u)\)  and \(1 \leq k < i\), and for any \(a\in F^X_{i-k}(u)\) such that \(\mathrm{ecc}_Y(a)>2(i-1)\), there exists \(b\in F^Y_j(u)\) such that \(\mathrm{ecc}_X(b)\geq\mathrm{ecc}_X(a)\) with \(j \geq i\)

  • Corollary: if \(lb\) is the maximum among all the \(\mathrm{ecc}_B\) of nodes in or below the level \(i\) of the fBFS and among all the \(\mathrm{ecc}_F\) of nodes in or below the level \(i\) of the bBFS, then the eccentricities of all other nodes is bounded by \(\max\{lb,2(i-1)\}\)

  • Further notation:

B_j^F(u)=\left\{ \begin{array}{ll} \max_{x\in F_j^F(u)} \mathrm{ecc}_B(x) & \textrm{if }j\leq \mathrm{ecc}_F(u),\\ 0 & \textrm{otherwise} \end{array} \right.
B_j^B(u)=\left\{ \begin{array}{ll} \max_{x\in F_j^B(u)} \mathrm{ecc}_F(x) & \textrm{if }j\leq \mathrm{ecc}_B(u),\\ 0 & \textrm{otherwise.} \end{array} \right.

Diameter computation

Exact computation (directed graphs)

pierluigi.crescenzi@irif.fr

GM

#01

Diameter computation

Exact computation (directed graphs)

pierluigi.crescenzi@irif.fr

GM

#01

  • Example
  • \(i=lb=\max\{\mathrm{ecc}_F(v_1), \mathrm{ecc}_B(v_1)\} = \max\{4, 5\} = 5\), and \(ub=2i = 10\)

  • Since \(ub>lb\), the algorithm enters the while loop with \(i=5\) 

  • \(B_5^F(u)=0\) (since \(5>\mathrm{ecc}_F(u)\)) and \(B_5^B(u)=\mathrm{ecc}_F(v_{10})=7\): since \(7 < 8 = 2(i-1)\), the algorithm enters the else branch and set \(lb\) equal to 7 and \(ub\) equal to 8

  • Since \(ub>lb\), the algorithm enters the while loop with \(i=5\)

  • \(B_4^F(u)=\mathrm{ecc}_B(v_{10})=6\) and \(B_4^B(u)=\mathrm{ecc}_F(v_{8})=6\): since \(\max\{lb, B_4^B(u), B_4^F(u)\} = 7 > 6 = 2(i-1)\), the algorithm enters the if branch and returns the value 7 which is the correct diameter value

Diameter computation

Exact computation (directed graphs)

pierluigi.crescenzi@irif.fr

GM

#01

Diameter computation

Exact computation (directed graphs)

pierluigi.crescenzi@irif.fr

GM

#01

  • For any graph with more than 10000 nodes, diFUB  performs \(0.001\%n\) visits instead of \(n\)

Diameter computation

Final considerations

pierluigi.crescenzi@irif.fr

GM

#01

  • Suitable properties of the starting node \(u\)

    • (1) \(u\) has to be the node with minimum eccentricity, called radius \(R\)

    • (2) Constant number of nodes in \(F_{\mathrm{ecc}(u)}(u)\)

  • If you are able to infer the node \(u\) such that (1) and \(R=D/2\) you will stop after one iteration

    • High degree node is very often a good choice

    • If the lower bound path returned by 2-sweep is tight and \(R=D/2\), the node in the middle of this path make us stop after one iteration

  • Almost always in real-world graphs \(R=D/2\) (the minimum possible, maximum heterogeneity) and (2) is true if \(u\) is central

Diameter computation

Final considerations

pierluigi.crescenzi@irif.fr

GM

#01

  • diFUB can be generalized to weighted graphs

    • Using Dijkstra algorithm instead of BFS and sorting the nodes according to their distance from \(u\)

    • It works well, but not for road networks

  • Further optimization allow us to do better than this and to compute also the diameter of weakly connected graphs

  • It is possible to prove that for some graph random generation models (fixing the power law distribution) the number of BFSes is almost constant

Diameter computation

Back to theory

pierluigi.crescenzi@irif.fr

GM

#01

  • Average case complexity
    • Very hard and technical
    • Many models
    • Are models realistic?
    • Which properties are used?
  • Axiomatic framework
    • Define properties
    • Deduce probabilistic analyses from the properties
    • Prove that random graphs satisfy the properties
    • Show empirically that real-world graphs satisfy the properties

Diameter computation

Back to theory

pierluigi.crescenzi@irif.fr

GM

#01

  • The models
    • Erdös-Renyi model: connect each pair of nodes with probability p
      • Not realistic (all nodes are "equal")
      • Heuristics are not efficient on this model
    • Random graph with prescribed degree distribution w(u)
      • Configuration model
        • Each node u has w(u) half-edges
        • Half-edges paired at random
      • Rank-1 Inhomogeneous Random Graphs (e.g. Chung-Lu)
        • Edge between u and v exists with probability w(u)w(v)/M
          • M: sum of all w(u)
  • Power law degree distribution\[|\{v\in V:\mathrm{deg}(v)=d\}| \approx nd^{-\beta})\]

Diameter computation

Back to theory

pierluigi.crescenzi@irif.fr

GM

#01

  • Some definitions
    • \(\gamma^l(s)=|\{v\in V:d(s,v)=l\}|\)
    • \(\tau_s(n^x) = \min\{l:\gamma^l(s)> n^x\}\)
    • \(T(d \rightarrow n^x) = \mathrm{\ avg}_{\mathrm{deg}(s)=d}\tau_s(n^x)\)

...

\tau_s(n^x)
\gamma^1(s)
\gamma^2(s)
s
n^x

Diameter computation

Back to theory

pierluigi.crescenzi@irif.fr

GM

#01

  • Property 1 \[|\{s\in V:\tau_s(n^x)\geq T(\mathrm{deg}(s)\rightarrow n^x)+l\}|\approx\frac{n}{c^l}\]

Diameter computation

Back to theory

pierluigi.crescenzi@irif.fr

GM

#01

  • Property 2 \[d(s,t) \approx \tau_s(n^x)+\tau_t(n^{1-x})-1\]

Diameter computation

Back to theory

pierluigi.crescenzi@irif.fr

GM

#01

  • Estimate of the diameter
\mathrm{ecc}(s) = \max_{t\in V}d(s,t) \approx \tau_s(\sqrt{n})+\max_{t\in V}\tau_t(\sqrt{n})

By properties 2 and 3

|\{t\in V:\tau_t(\sqrt{n})\geq T(\mathrm{deg}(t)\rightarrow \sqrt{n})+l\}|\approx\frac{n}{c^l}

By property 1

\max_{t\in V}\tau_t(\sqrt{n}) \approx T(1\rightarrow \sqrt{n})+\log_c(n)

Hence

\mathrm{ecc}(s) \approx \tau_s(\sqrt{n}) + T(1\rightarrow \sqrt{n})+\log_c(n)

Hence

D = \max_{s\in V}\mathrm{ecc}(s) \approx 2T(1\rightarrow \sqrt{n})+2\log_c(n)

Hence

Diameter computation

Back to theory

pierluigi.crescenzi@irif.fr

GM

#01

  • Sampling versus 2-sweep
  • Sampling returns the eccentricity of a random node s
\mathrm{ecc}(t) \approx \tau_t(\sqrt{n}) + T(1\rightarrow \sqrt{n})+\log_c(n)
  • 2-sweep
    • Returns the eccentricity of the node farthest from random node s
\mathrm{ecc}(s) \approx \tau_s(\sqrt{n}) + T(1\rightarrow \sqrt{n})+\log_c(n)
\approx \max_{t\in V}\tau_t(\sqrt{n}) + T(1\rightarrow \sqrt{n})+\log_c(n)
\approx 2T(1\rightarrow \sqrt{n})+2\log_c(n)
\approx D
\approx 2T(1\rightarrow \sqrt{n})+\log_c(n)