A (large) graph mining roundtrip

From practice to theory and back

(two times)

Pierluigi Crescenzi

Leiden, September 23, 2016

Leiden Networks Day

The beginning of the story

  • Graph G=(V,E)
    • Undirected
  • Distance d(u,v): ​number of edges in shortest path from u to v
    • Connected
  • Diameter: maximum distance​
    • Maximum eccentricity of all nodes
      • e(v)=max d(v,w)

Our "toy" network

IMDB graph: edge between two actors if played in same movie

Algorithms for diameter

O(|V||E|): breadth-first search from each node

Can we do better?

Into the square

  • Quadratic algorithms are not feasible
  • Look for "hardest" quadratic time solvable problems
    • Approach similar to NP-completeness
    • Definition of specific reducibility
      • Preserving subquadratic solvability
  • Hardness relative to complexity hypothesis
    • Similar to P vs NP
    • ​SETH: no algorithm solving k-SAT in subexponential time
      • Quadratic time solvable version (k-SAT*)

Quasi-linear reducibility

\mathcal{P} \leq_{ql}\mathcal{Q}
PqlQ\mathcal{P} \leq_{ql}\mathcal{Q}
I \mathrm{\ instance\ of\ }\mathcal{P} \rightarrow \Phi(I)\mathrm{\ instance\ of\ }\mathcal{Q}
I instance of PΦ(I) instance of QI \mathrm{\ instance\ of\ }\mathcal{P} \rightarrow \Phi(I)\mathrm{\ instance\ of\ }\mathcal{Q}
\mathrm{Computable\ in\ time\ }\tilde{O}(|I|)
Computable in time O~(I)\mathrm{Computable\ in\ time\ }\tilde{O}(|I|)
I \mathrm{\ and\ } s(I) \mathrm{\ same\ output}
I and s(I) same outputI \mathrm{\ and\ } s(I) \mathrm{\ same\ output}
\mathrm{Linear\ time\ computable\ output\ mapping}
Linear time computable output mapping\mathrm{Linear\ time\ computable\ output\ mapping}
\mathcal{P} \leq_{ql}\mathcal{Q} \mathrm{\ and \ } \mathcal{Q} \mathrm{\ is\ solvable\ in\ time \ } \tilde{O}(n^{2-\epsilon})
PqlQ and Q is solvable in time O~(n2ϵ)\mathcal{P} \leq_{ql}\mathcal{Q} \mathrm{\ and \ } \mathcal{Q} \mathrm{\ is\ solvable\ in\ time \ } \tilde{O}(n^{2-\epsilon})
\mathcal{P} \mathrm{\ is\ solvable\ in\ time \ } \tilde{O}(n^{2-\epsilon})
P is solvable in time O~(n2ϵ)\mathcal{P} \mathrm{\ is\ solvable\ in\ time \ } \tilde{O}(n^{2-\epsilon})

k-SAT*

\mathrm{\ Input\ }
 Input \mathrm{\ Input\ }
\mathrm{Possible\ assignments\ to \ } x_i
Possible assignments to xi\mathrm{Possible\ assignments\ to \ } x_i
O(n^{2-\epsilon})\mathrm{\ algorithm\ for\ } k-\mathrm{SAT}^*
O(n2ϵ) algorithm for kSATO(n^{2-\epsilon})\mathrm{\ algorithm\ for\ } k-\mathrm{SAT}^*
O(2^{\frac{n}{2}(2-\epsilon)})=O((2^{\frac{2-\epsilon}{2}})^n)\mathrm{\ algorithm\ for\ } \mathrm{SAT}
O(2n2(2ϵ))=O((22ϵ2)n) algorithm for SATO(2^{\frac{n}{2}(2-\epsilon)})=O((2^{\frac{2-\epsilon}{2}})^n)\mathrm{\ algorithm\ for\ } \mathrm{SAT}
\mathrm{Two\ sets \ of\ } n \mathrm{\ variables\ }\{x_i\},\{y_i\}
Two sets of n variables {xi},{yi}\mathrm{Two\ sets \ of\ } n \mathrm{\ variables\ }\{x_i\},\{y_i\}
\mathrm{Set\ of\ clauses\ } C
Set of clauses C\mathrm{Set\ of\ clauses\ } C
\mathrm{Possible\ assignments\ to \ } y_i
Possible assignments to yi\mathrm{Possible\ assignments\ to \ } y_i
\mathrm{\ Output:\ true\ if\ }C\mathrm{\ satisfiable}
 Output: true if C satisfiable\mathrm{\ Output:\ true\ if\ }C\mathrm{\ satisfiable}

The reduction web

From disjoint sets to diameter

\mathrm{\ Input\ }
 Input \mathrm{\ Input\ }
\mathrm{Set\ of\ items\ } X
Set of items X\mathrm{Set\ of\ items\ } X
\mathrm{Collection\ } C \mathrm{\ of\ subsets\ of\ } X
Collection C of subsets of X\mathrm{Collection\ } C \mathrm{\ of\ subsets\ of\ } X
\mathrm{\ Output:\ true\ if\ }C\mathrm{\ has\ two\ disjoint\ sets}
 Output: true if C has two disjoint sets\mathrm{\ Output:\ true\ if\ }C\mathrm{\ has\ two\ disjoint\ sets}
\mathrm{Clique\ of\ }|X|\mathrm{\ nodes}
Clique of X nodes\mathrm{Clique\ of\ }|X|\mathrm{\ nodes}
\mathrm{Independent\ set\ of\ }|C|\mathrm{\ nodes}
Independent set of C nodes\mathrm{Independent\ set\ of\ }|C|\mathrm{\ nodes}
\mathrm{Reduction}
Reduction\mathrm{Reduction}
\mathrm{Two\ sets\ that\ do\ not\ intersect:\ distance\ }3
Two sets that do not intersect: distance 3\mathrm{Two\ sets\ that\ do\ not\ intersect:\ distance\ }3
\mathrm{Two\ sets\ that\ intersect:\ distance\ }2
Two sets that intersect: distance 2\mathrm{Two\ sets\ that\ intersect:\ distance\ }2
\mathrm{Disjoint\ sets\ } \leftrightarrow \mathrm{\ diameter\ is\ } 3
Disjoint sets  diameter is 3\mathrm{Disjoint\ sets\ } \leftrightarrow \mathrm{\ diameter\ is\ } 3

Lower bound on the diameter

The 2-sweep heuristics

v_1
v1v_1
v_2 \mathrm{ \ maximizes\ } d(v_2,v_1)
v2 maximizes d(v2,v1)v_2 \mathrm{ \ maximizes\ } d(v_2,v_1)
\mathrm{Max\ eccentricity \ of\ } v_1, v_2: \mathrm{good\ lower\ bound\ on\ diameter}
Max eccentricity of v1,v2:good lower bound on diameter\mathrm{Max\ eccentricity \ of\ } v_1, v_2: \mathrm{good\ lower\ bound\ on\ diameter}

Lower bound on the diameter

The sumsweep heuristics

v_1
v1v_1
v_2 \mathrm{ \ maximizes\ } d(v_2,v_1)
v2 maximizes d(v2,v1)v_2 \mathrm{ \ maximizes\ } d(v_2,v_1)
\mathrm{Max\ eccentricity \ of\ } v_1, v_2, v_3, v_4: \mathrm{better\ lower\ bound\ on\ diameter}
Max eccentricity of v1,v2,v3,v4:better lower bound on diameter\mathrm{Max\ eccentricity \ of\ } v_1, v_2, v_3, v_4: \mathrm{better\ lower\ bound\ on\ diameter}
v_3 \mathrm{ \ maximizes\ } d(v_3,v_1)+d(v_3,v_2)
v3 maximizes d(v3,v1)+d(v3,v2)v_3 \mathrm{ \ maximizes\ } d(v_3,v_1)+d(v_3,v_2)
v_4 \mathrm{ \ maximizes\ } d(v_4,v_1)+d(v_4,v_2)+d(v_4,v_3)
v4 maximizes d(v4,v1)+d(v4,v2)+d(v4,v3)v_4 \mathrm{ \ maximizes\ } d(v_4,v_1)+d(v_4,v_2)+d(v_4,v_3)

Bounds on node eccentricities

\mathrm{If\ BFS\ from\ } v \mathrm{\ done}
If BFS from v done\mathrm{If\ BFS\ from\ } v \mathrm{\ done}
d(v,w) \leq ecc(w) \leq d(v,w)+ecc(v)
d(v,w)ecc(w)d(v,w)+ecc(v)d(v,w) \leq ecc(w) \leq d(v,w)+ecc(v)
L_v(w)
Lv(w)L_v(w)
U_v(w)
Uv(w)U_v(w)
U_v(w) \mathrm{can\ be\ improved}
Uv(w)can be improvedU_v(w) \mathrm{can\ be\ improved}

Exact value of diameter

\mathrm{Vectors\ } e_L,e_U: \mathrm{\ lower\ and\ upper\ bounds}
Vectors eL,eU: lower and upper bounds\mathrm{Vectors\ } e_L,e_U: \mathrm{\ lower\ and\ upper\ bounds}
\mathrm{At\ each\ BFS\ from\ } v:
At each BFS from v:\mathrm{At\ each\ BFS\ from\ } v:
e_L(w)=\max(e_L(w),L_v(w))
eL(w)=max(eL(w),Lv(w))e_L(w)=\max(e_L(w),L_v(w))
e_U(w)=\min(e_U(w),U_v(w))
eU(w)=min(eU(w),Uv(w))e_U(w)=\min(e_U(w),U_v(w))
\mathrm{Vector\ } S: \mathrm{\ sum\ of\ distances\ to\ already\ explored\ nodes}
Vector S: sum of distances to already explored nodes\mathrm{Vector\ } S: \mathrm{\ sum\ of\ distances\ to\ already\ explored\ nodes}
S(w)=d(v,w)+S(w)
S(w)=d(v,w)+S(w)S(w)=d(v,w)+S(w)
\mathrm{Start\ with\ } e_L=0, e_U=\infty \mathrm{\ and\ sumsweep\ of\ } k \mathrm{\ nodes}
Start with eL=0,eU= and sumsweep of k nodes\mathrm{Start\ with\ } e_L=0, e_U=\infty \mathrm{\ and\ sumsweep\ of\ } k \mathrm{\ nodes}
\mathrm{At\ each\ step,\ \mathbf{cleverly}\ choose\ next\ } v
At each step, cleverly choose next v\mathrm{At\ each\ step,\ \mathbf{cleverly}\ choose\ next\ } v
\mathrm{Update\ } e_L, e_U : e_L(w)=e_U(w)\Rightarrow e(w) = e_L(w)
Update eL,eU:eL(w)=eU(w)e(w)=eL(w)\mathrm{Update\ } e_L, e_U : e_L(w)=e_U(w)\Rightarrow e(w) = e_L(w)
\mathrm{Terminate\ when\ } \max e(v) \geq \max (e_U(w))
Terminate when maxe(v)max(eU(w))\mathrm{Terminate\ when\ } \max e(v) \geq \max (e_U(w))

Choosing the next vertex u

\mathrm{Alternate}
Alternate\mathrm{Alternate}
\mathrm{Minimize\ } e_L(u)
Minimize eL(u)\mathrm{Minimize\ } e_L(u)
\mathrm{Ties\ solved\ by\ minimizing\ } S(u)
Ties solved by minimizing S(u)\mathrm{Ties\ solved\ by\ minimizing\ } S(u)
\mathrm{Maximize\ } e_U(u)
Maximize eU(u)\mathrm{Maximize\ } e_U(u)
\mathrm{Ties\ solved\ by\ maximizing\ } S(u)
Ties solved by maximizing S(u)\mathrm{Ties\ solved\ by\ maximizing\ } S(u)
\mathrm{Should\ improve\ upper\ bounds}
Should improve upper bounds\mathrm{Should\ improve\ upper\ bounds}
\mathrm{Should\ improve\ lower\ bounds}
Should improve lower bounds\mathrm{Should\ improve\ lower\ bounds}

Performances

In theory, as the worst case, but...

Why?

Average case complexity

  • Very hard and technical
  • Many models
  • Are models realistic?
  • Which properties are used?
  • Axiomatic framework
    • Define axioms
    • Deduce probabilistic analyses from the axioms
    • Prove that random graphs satisfy the axioms
    • Show empirically that real-world graphs satisfy the axioms

The models

  • Erdös-Renyi model
    • Not realistic (all nodes are "equal")
    • Heuristics are not efficient on this model
  • Random graph with prescribed degree distribution
    • Configuration model
    • Chung-Lu model
    • Norros-Reittu model
  • Power law degree distribution
|\{v\in V:\mathrm{deg}(v)=d\}| \approx nd^{-\beta}
{vV:deg(v)=d}ndβ|\{v\in V:\mathrm{deg}(v)=d\}| \approx nd^{-\beta}

The axioms

  • Some definitions
\tau_s(n^x) = \min\{l:\gamma^l(s)\geq n^x\}
τs(nx)=min{l:γl(s)nx}\tau_s(n^x) = \min\{l:\gamma^l(s)\geq n^x\}
\gamma^l(s)=|\{v\in V:d(s,v)=l\}|
γl(s)={vV:d(s,v)=l}\gamma^l(s)=|\{v\in V:d(s,v)=l\}|
T(d \rightarrow n^x) = \mathrm{\ avg}_{\mathrm{deg}(s)=d}\tau_s(n^x)
T(dnx)= avgdeg(s)=dτs(nx)T(d \rightarrow n^x) = \mathrm{\ avg}_{\mathrm{deg}(s)=d}\tau_s(n^x)

...

\tau_s(n^x)
τs(nx)\tau_s(n^x)
\gamma^1(s)
γ1(s)\gamma^1(s)
\gamma^2(s)
γ2(s)\gamma^2(s)
s
ss
n^x
nxn^x
|\{s\in V:\tau_s(n^x)\geq T(\mathrm{deg}(s)\rightarrow n^x)+l\}|\approx\frac{n}{c^l}
{sV:τs(nx)T(deg(s)nx)+l}ncl|\{s\in V:\tau_s(n^x)\geq T(\mathrm{deg}(s)\rightarrow n^x)+l\}|\approx\frac{n}{c^l}

Axiom 1

d(s,t) \approx \tau_s(n^x)+\tau_t(n^{1-x})-1
d(s,t)τs(nx)+τt(n1x)1d(s,t) \approx \tau_s(n^x)+\tau_t(n^{1-x})-1

Axiom 2

The sum-sweep heuristics

\beta>3
β>3\beta>3
2<\beta<3
2<β<32<\beta<3
1<\beta<2
1<β<21<\beta<2
\leq n^{1+\frac{C}{C+\frac{\beta-1}{\beta-3}}}
n1+CC+β1β3\leq n^{1+\frac{C}{C+\frac{\beta-1}{\beta-3}}}
n^{1+o(1)}
n1+o(1)n^{1+o(1)}
\leq mn^{1-\frac{2-\beta}{\beta-1}\left(\left\lfloor\frac{\beta-1}{2-\beta}-\frac{3}{2}\right\rfloor-\frac{1}{2}\right)}
mn12ββ1(β12β3212)\leq mn^{1-\frac{2-\beta}{\beta-1}\left(\left\lfloor\frac{\beta-1}{2-\beta}-\frac{3}{2}\right\rfloor-\frac{1}{2}\right)}
C=\frac{2d_{\mathrm{avg}}(n)}{D-d_{\mathrm{avg}}(n)}\mathrm{\ is\ constant}
C=2davg(n)Ddavg(n) is constantC=\frac{2d_{\mathrm{avg}}(n)}{D-d_{\mathrm{avg}}(n)}\mathrm{\ is\ constant}

The same story for...

  • Hyperbolicity
  • Closeness centrality
  • Betweenness centrality

Computing closeness top-k

\mathrm{Definition:\ }c(v)=\frac{n-1}{\sum_{w \in V-\{v\}}d(v,w)}
Definition: c(v)=n1wV{v}d(v,w)\mathrm{Definition:\ }c(v)=\frac{n-1}{\sum_{w \in V-\{v\}}d(v,w)}
\mathrm{In\ theory\ complexity\ }\Theta(n^2)
In theory complexity Θ(n2)\mathrm{In\ theory\ complexity\ }\Theta(n^2)
\mathrm{In\ practice}
In practice\mathrm{In\ practice}
\mathrm{BFSCut\ returns\ 0\ if\ } v \mathrm{\ is\ not\ among\ top\ } k, c(v)\mathrm{\ otherwise}
BFSCut returns 0 if v is not among top k,c(v) otherwise\mathrm{BFSCut\ returns\ 0\ if\ } v \mathrm{\ is\ not\ among\ top\ } k, c(v)\mathrm{\ otherwise}

How to cut the BFS

\mathrm{If\ } \gamma_d(v)=|\Gamma_d(v)|: f(v) \geq f_d(v) + (d+1)\gamma_ {d+1}(v)+(d+2)(r(v)-n_{d+1}(v))
If γd(v)=Γd(v):f(v)fd(v)+(d+1)γd+1(v)+(d+2)(r(v)nd+1(v))\mathrm{If\ } \gamma_d(v)=|\Gamma_d(v)|: f(v) \geq f_d(v) + (d+1)\gamma_ {d+1}(v)+(d+2)(r(v)-n_{d+1}(v))
\mathrm{Since\ } n_{d+1}(v)=\gamma_{d+1}(v)+n_{d}(v): f(v) \geq f_d(v) - \gamma_ {d+1}(v)+(d+2)(r(v)-n_{d}(v))
Since nd+1(v)=γd+1(v)+nd(v):f(v)fd(v)γd+1(v)+(d+2)(r(v)nd(v))\mathrm{Since\ } n_{d+1}(v)=\gamma_{d+1}(v)+n_{d}(v): f(v) \geq f_d(v) - \gamma_ {d+1}(v)+(d+2)(r(v)-n_{d}(v))
\mathrm{Since\ } \gamma_{d+1}(v)\leq\tilde{\gamma}_{d+1}(v): f(v) \geq f_d(v) - \tilde{\gamma}_ {d+1}(v)+(d+2)(r(v)-n_{d}(v))
Since γd+1(v)γ~d+1(v):f(v)fd(v)γ~d+1(v)+(d+2)(r(v)nd(v))\mathrm{Since\ } \gamma_{d+1}(v)\leq\tilde{\gamma}_{d+1}(v): f(v) \geq f_d(v) - \tilde{\gamma}_ {d+1}(v)+(d+2)(r(v)-n_{d}(v))
  • Everything is known
    • If graph not connected, work on components

To conclude the story...

  • Total running time: 37 minutes!

Semels ('40)

Corrado ('45)

Flowers ('50-'80)

Welles ('85-'90)

Lee ('95-'00)

Hitler ('05-'10)

Madsen ('14)

Thanks to...

This is my story: dozens of researchers have similar stories (references)

  • Michele Borassi, Pierluigi Crescenzi, Michel Habib: Into the Square: On the Complexity of Some Quadratic-time Solvable Problems. Electr. Notes Theor. Comput. Sci. 322: 51-67 (2016)
  • Elisabetta Bergamini, Michele Borassi, Pierluigi Crescenzi, Andrea Marino, Henning Meyerhenke: Computing Top-k Closeness Centrality Faster in Unweighted Graphs. ALENEX 2016: 68-80
  • Michele Borassi, Pierluigi Crescenzi, Luca Trevisan: An Axiomatic and an Average-Case Analysis of Algorithms and Heuristics for Metric Properties of Graphs. CoRR abs/1604.01445 (2016)
  • Michele Borassi, Pierluigi Crescenzi, Michel Habib, Walter A. Kosters, Andrea Marino, Frank W. Takes: Fast diameter and radius BFS-based computation in (weakly connected) real-world graphs: With an application to the six degrees of separation games. Theor. Comput. Sci. 586: 59-80 (2015)
  • Michele Borassi, David Coudert, Pierluigi Crescenzi, Andrea Marino: On Computing the Hyperbolicity of Real-World Graphs. ESA 2015: 215-226
  • Pilu Crescenzi, Roberto Grossi, Michel Habib, Leonardo Lanzi, Andrea Marino: On computing the diameter of real-world undirected graphs. Theor. Comput. Sci. 514: 84-95 (2013)
  • Pierluigi Crescenzi, Roberto Grossi, Leonardo Lanzi, Andrea Marino: On Computing the Diameter of Real-World Directed (Weighted) Graphs. SEA 2012: 99-110
  • Pierluigi Crescenzi, Roberto Grossi, Leonardo Lanzi, Andrea Marino: A Comparison of Three Algorithms for Approximating the Distance Distribution in Real-World Graphs. TAPAS 2011: 92-103
  • Pierluigi Crescenzi, Roberto Grossi, Claudio Imbrenda, Leonardo Lanzi, Andrea Marino: Finding the Diameter in Real-World Graphs - Experimentally Turning a Lower Bound into an Upper Bound. ESA (1) 2010: 302-313

References

A (large) graph mining roudtrip

By Pierluigi Crescenzi

Private

A (large) graph mining roudtrip

Leiden Networks Day, September 23, 2016