A (large) graph mining roundtrip

From practice to theory and back

(two times)

PierluigiCrescenzi

Leiden, September 23, 2016

Leiden Networks Day

The beginning of the story

• Graph G=(V,E)
• Undirected
• Distance d(u,v): ​number of edges in shortest path from u to v
• Connected
• Diameter: maximum distance​
• Maximum eccentricity of all nodes
• e(v)=max d(v,w)

Our "toy" network

IMDB graph: edge between two actors if played in same movie

Algorithms for diameter

O(|V||E|): breadth-first search from each node

Can we do better?

Into the square

• Quadratic algorithms are not feasible
• Look for "hardest" quadratic time solvable problems
• Approach similar to NP-completeness
• Definition of specific reducibility
• Hardness relative to complexity hypothesis
• Similar to P vs NP
• ​SETH: no algorithm solving k-SAT in subexponential time
• Quadratic time solvable version (k-SAT*)

Quasi-linear reducibility

\mathcal{P} \leq_{ql}\mathcal{Q}
$\mathcal{P} \leq_{ql}\mathcal{Q}$
I \mathrm{\ instance\ of\ }\mathcal{P} \rightarrow \Phi(I)\mathrm{\ instance\ of\ }\mathcal{Q}
$I \mathrm{\ instance\ of\ }\mathcal{P} \rightarrow \Phi(I)\mathrm{\ instance\ of\ }\mathcal{Q}$
\mathrm{Computable\ in\ time\ }\tilde{O}(|I|)
$\mathrm{Computable\ in\ time\ }\tilde{O}(|I|)$
I \mathrm{\ and\ } s(I) \mathrm{\ same\ output}
$I \mathrm{\ and\ } s(I) \mathrm{\ same\ output}$
\mathrm{Linear\ time\ computable\ output\ mapping}
$\mathrm{Linear\ time\ computable\ output\ mapping}$
\mathcal{P} \leq_{ql}\mathcal{Q} \mathrm{\ and \ } \mathcal{Q} \mathrm{\ is\ solvable\ in\ time \ } \tilde{O}(n^{2-\epsilon})
$\mathcal{P} \leq_{ql}\mathcal{Q} \mathrm{\ and \ } \mathcal{Q} \mathrm{\ is\ solvable\ in\ time \ } \tilde{O}(n^{2-\epsilon})$
\mathcal{P} \mathrm{\ is\ solvable\ in\ time \ } \tilde{O}(n^{2-\epsilon})
$\mathcal{P} \mathrm{\ is\ solvable\ in\ time \ } \tilde{O}(n^{2-\epsilon})$

k-SAT*

\mathrm{\ Input\ }
$\mathrm{\ Input\ }$
\mathrm{Possible\ assignments\ to \ } x_i
$\mathrm{Possible\ assignments\ to \ } x_i$
O(n^{2-\epsilon})\mathrm{\ algorithm\ for\ } k-\mathrm{SAT}^*
$O(n^{2-\epsilon})\mathrm{\ algorithm\ for\ } k-\mathrm{SAT}^*$
O(2^{\frac{n}{2}(2-\epsilon)})=O((2^{\frac{2-\epsilon}{2}})^n)\mathrm{\ algorithm\ for\ } \mathrm{SAT}
$O(2^{\frac{n}{2}(2-\epsilon)})=O((2^{\frac{2-\epsilon}{2}})^n)\mathrm{\ algorithm\ for\ } \mathrm{SAT}$
\mathrm{Two\ sets \ of\ } n \mathrm{\ variables\ }\{x_i\},\{y_i\}
$\mathrm{Two\ sets \ of\ } n \mathrm{\ variables\ }\{x_i\},\{y_i\}$
\mathrm{Set\ of\ clauses\ } C
$\mathrm{Set\ of\ clauses\ } C$
\mathrm{Possible\ assignments\ to \ } y_i
$\mathrm{Possible\ assignments\ to \ } y_i$
\mathrm{\ Output:\ true\ if\ }C\mathrm{\ satisfiable}
$\mathrm{\ Output:\ true\ if\ }C\mathrm{\ satisfiable}$

From disjoint sets to diameter

\mathrm{\ Input\ }
$\mathrm{\ Input\ }$
\mathrm{Set\ of\ items\ } X
$\mathrm{Set\ of\ items\ } X$
\mathrm{Collection\ } C \mathrm{\ of\ subsets\ of\ } X
$\mathrm{Collection\ } C \mathrm{\ of\ subsets\ of\ } X$
\mathrm{\ Output:\ true\ if\ }C\mathrm{\ has\ two\ disjoint\ sets}
$\mathrm{\ Output:\ true\ if\ }C\mathrm{\ has\ two\ disjoint\ sets}$
\mathrm{Clique\ of\ }|X|\mathrm{\ nodes}
$\mathrm{Clique\ of\ }|X|\mathrm{\ nodes}$
\mathrm{Independent\ set\ of\ }|C|\mathrm{\ nodes}
$\mathrm{Independent\ set\ of\ }|C|\mathrm{\ nodes}$
\mathrm{Reduction}
$\mathrm{Reduction}$
\mathrm{Two\ sets\ that\ do\ not\ intersect:\ distance\ }3
$\mathrm{Two\ sets\ that\ do\ not\ intersect:\ distance\ }3$
\mathrm{Two\ sets\ that\ intersect:\ distance\ }2
$\mathrm{Two\ sets\ that\ intersect:\ distance\ }2$
\mathrm{Disjoint\ sets\ } \leftrightarrow \mathrm{\ diameter\ is\ } 3
$\mathrm{Disjoint\ sets\ } \leftrightarrow \mathrm{\ diameter\ is\ } 3$

Lower bound on the diameter

The 2-sweep heuristics

v_1
$v_1$
v_2 \mathrm{ \ maximizes\ } d(v_2,v_1)
$v_2 \mathrm{ \ maximizes\ } d(v_2,v_1)$
\mathrm{Max\ eccentricity \ of\ } v_1, v_2: \mathrm{good\ lower\ bound\ on\ diameter}
$\mathrm{Max\ eccentricity \ of\ } v_1, v_2: \mathrm{good\ lower\ bound\ on\ diameter}$

Lower bound on the diameter

The sumsweep heuristics

v_1
$v_1$
v_2 \mathrm{ \ maximizes\ } d(v_2,v_1)
$v_2 \mathrm{ \ maximizes\ } d(v_2,v_1)$
\mathrm{Max\ eccentricity \ of\ } v_1, v_2, v_3, v_4: \mathrm{better\ lower\ bound\ on\ diameter}
$\mathrm{Max\ eccentricity \ of\ } v_1, v_2, v_3, v_4: \mathrm{better\ lower\ bound\ on\ diameter}$
v_3 \mathrm{ \ maximizes\ } d(v_3,v_1)+d(v_3,v_2)
$v_3 \mathrm{ \ maximizes\ } d(v_3,v_1)+d(v_3,v_2)$
v_4 \mathrm{ \ maximizes\ } d(v_4,v_1)+d(v_4,v_2)+d(v_4,v_3)
$v_4 \mathrm{ \ maximizes\ } d(v_4,v_1)+d(v_4,v_2)+d(v_4,v_3)$

Bounds on node eccentricities

\mathrm{If\ BFS\ from\ } v \mathrm{\ done}
$\mathrm{If\ BFS\ from\ } v \mathrm{\ done}$
d(v,w) \leq ecc(w) \leq d(v,w)+ecc(v)
$d(v,w) \leq ecc(w) \leq d(v,w)+ecc(v)$
L_v(w)
$L_v(w)$
U_v(w)
$U_v(w)$
U_v(w) \mathrm{can\ be\ improved}
$U_v(w) \mathrm{can\ be\ improved}$

Exact value of diameter

\mathrm{Vectors\ } e_L,e_U: \mathrm{\ lower\ and\ upper\ bounds}
$\mathrm{Vectors\ } e_L,e_U: \mathrm{\ lower\ and\ upper\ bounds}$
\mathrm{At\ each\ BFS\ from\ } v:
$\mathrm{At\ each\ BFS\ from\ } v:$
e_L(w)=\max(e_L(w),L_v(w))
$e_L(w)=\max(e_L(w),L_v(w))$
e_U(w)=\min(e_U(w),U_v(w))
$e_U(w)=\min(e_U(w),U_v(w))$
\mathrm{Vector\ } S: \mathrm{\ sum\ of\ distances\ to\ already\ explored\ nodes}
$\mathrm{Vector\ } S: \mathrm{\ sum\ of\ distances\ to\ already\ explored\ nodes}$
S(w)=d(v,w)+S(w)
$S(w)=d(v,w)+S(w)$
\mathrm{Start\ with\ } e_L=0, e_U=\infty \mathrm{\ and\ sumsweep\ of\ } k \mathrm{\ nodes}
$\mathrm{Start\ with\ } e_L=0, e_U=\infty \mathrm{\ and\ sumsweep\ of\ } k \mathrm{\ nodes}$
\mathrm{At\ each\ step,\ \mathbf{cleverly}\ choose\ next\ } v
$\mathrm{At\ each\ step,\ \mathbf{cleverly}\ choose\ next\ } v$
\mathrm{Update\ } e_L, e_U : e_L(w)=e_U(w)\Rightarrow e(w) = e_L(w)
$\mathrm{Update\ } e_L, e_U : e_L(w)=e_U(w)\Rightarrow e(w) = e_L(w)$
\mathrm{Terminate\ when\ } \max e(v) \geq \max (e_U(w))
$\mathrm{Terminate\ when\ } \max e(v) \geq \max (e_U(w))$

Choosing the next vertex u

\mathrm{Alternate}
$\mathrm{Alternate}$
\mathrm{Minimize\ } e_L(u)
$\mathrm{Minimize\ } e_L(u)$
\mathrm{Ties\ solved\ by\ minimizing\ } S(u)
$\mathrm{Ties\ solved\ by\ minimizing\ } S(u)$
\mathrm{Maximize\ } e_U(u)
$\mathrm{Maximize\ } e_U(u)$
\mathrm{Ties\ solved\ by\ maximizing\ } S(u)
$\mathrm{Ties\ solved\ by\ maximizing\ } S(u)$
\mathrm{Should\ improve\ upper\ bounds}
$\mathrm{Should\ improve\ upper\ bounds}$
\mathrm{Should\ improve\ lower\ bounds}
$\mathrm{Should\ improve\ lower\ bounds}$

Performances

In theory, as the worst case, but...

Why?

Average case complexity

• Very hard and technical
• Many models
• Are models realistic?
• Which properties are used?
• Axiomatic framework
• Define axioms
• Deduce probabilistic analyses from the axioms
• Prove that random graphs satisfy the axioms
• Show empirically that real-world graphs satisfy the axioms

The models

• Erdös-Renyi model
• Not realistic (all nodes are "equal")
• Heuristics are not efficient on this model
• Random graph with prescribed degree distribution
• Configuration model
• Chung-Lu model
• Norros-Reittu model
• Power law degree distribution
|\{v\in V:\mathrm{deg}(v)=d\}| \approx nd^{-\beta}
$|\{v\in V:\mathrm{deg}(v)=d\}| \approx nd^{-\beta}$

The axioms

• Some definitions
\tau_s(n^x) = \min\{l:\gamma^l(s)\geq n^x\}
$\tau_s(n^x) = \min\{l:\gamma^l(s)\geq n^x\}$
\gamma^l(s)=|\{v\in V:d(s,v)=l\}|
$\gamma^l(s)=|\{v\in V:d(s,v)=l\}|$
T(d \rightarrow n^x) = \mathrm{\ avg}_{\mathrm{deg}(s)=d}\tau_s(n^x)
$T(d \rightarrow n^x) = \mathrm{\ avg}_{\mathrm{deg}(s)=d}\tau_s(n^x)$

...

\tau_s(n^x)
$\tau_s(n^x)$
\gamma^1(s)
$\gamma^1(s)$
\gamma^2(s)
$\gamma^2(s)$
s
$s$
n^x
$n^x$
|\{s\in V:\tau_s(n^x)\geq T(\mathrm{deg}(s)\rightarrow n^x)+l\}|\approx\frac{n}{c^l}
$|\{s\in V:\tau_s(n^x)\geq T(\mathrm{deg}(s)\rightarrow n^x)+l\}|\approx\frac{n}{c^l}$

Axiom 1

d(s,t) \approx \tau_s(n^x)+\tau_t(n^{1-x})-1
$d(s,t) \approx \tau_s(n^x)+\tau_t(n^{1-x})-1$

Axiom 2

The sum-sweep heuristics

\beta>3
$\beta>3$
2<\beta<3
$2<\beta<3$
1<\beta<2
$1<\beta<2$
\leq n^{1+\frac{C}{C+\frac{\beta-1}{\beta-3}}}
$\leq n^{1+\frac{C}{C+\frac{\beta-1}{\beta-3}}}$
n^{1+o(1)}
$n^{1+o(1)}$
\leq mn^{1-\frac{2-\beta}{\beta-1}\left(\left\lfloor\frac{\beta-1}{2-\beta}-\frac{3}{2}\right\rfloor-\frac{1}{2}\right)}
$\leq mn^{1-\frac{2-\beta}{\beta-1}\left(\left\lfloor\frac{\beta-1}{2-\beta}-\frac{3}{2}\right\rfloor-\frac{1}{2}\right)}$
C=\frac{2d_{\mathrm{avg}}(n)}{D-d_{\mathrm{avg}}(n)}\mathrm{\ is\ constant}
$C=\frac{2d_{\mathrm{avg}}(n)}{D-d_{\mathrm{avg}}(n)}\mathrm{\ is\ constant}$

The same story for...

• Hyperbolicity
• Closeness centrality
• Betweenness centrality

Computing closeness top-k

\mathrm{Definition:\ }c(v)=\frac{n-1}{\sum_{w \in V-\{v\}}d(v,w)}
$\mathrm{Definition:\ }c(v)=\frac{n-1}{\sum_{w \in V-\{v\}}d(v,w)}$
\mathrm{In\ theory\ complexity\ }\Theta(n^2)
$\mathrm{In\ theory\ complexity\ }\Theta(n^2)$
\mathrm{In\ practice}
$\mathrm{In\ practice}$
\mathrm{BFSCut\ returns\ 0\ if\ } v \mathrm{\ is\ not\ among\ top\ } k, c(v)\mathrm{\ otherwise}
$\mathrm{BFSCut\ returns\ 0\ if\ } v \mathrm{\ is\ not\ among\ top\ } k, c(v)\mathrm{\ otherwise}$

How to cut the BFS

\mathrm{If\ } \gamma_d(v)=|\Gamma_d(v)|: f(v) \geq f_d(v) + (d+1)\gamma_ {d+1}(v)+(d+2)(r(v)-n_{d+1}(v))
$\mathrm{If\ } \gamma_d(v)=|\Gamma_d(v)|: f(v) \geq f_d(v) + (d+1)\gamma_ {d+1}(v)+(d+2)(r(v)-n_{d+1}(v))$
\mathrm{Since\ } n_{d+1}(v)=\gamma_{d+1}(v)+n_{d}(v): f(v) \geq f_d(v) - \gamma_ {d+1}(v)+(d+2)(r(v)-n_{d}(v))
$\mathrm{Since\ } n_{d+1}(v)=\gamma_{d+1}(v)+n_{d}(v): f(v) \geq f_d(v) - \gamma_ {d+1}(v)+(d+2)(r(v)-n_{d}(v))$
\mathrm{Since\ } \gamma_{d+1}(v)\leq\tilde{\gamma}_{d+1}(v): f(v) \geq f_d(v) - \tilde{\gamma}_ {d+1}(v)+(d+2)(r(v)-n_{d}(v))
$\mathrm{Since\ } \gamma_{d+1}(v)\leq\tilde{\gamma}_{d+1}(v): f(v) \geq f_d(v) - \tilde{\gamma}_ {d+1}(v)+(d+2)(r(v)-n_{d}(v))$
• Everything is known
• If graph not connected, work on components

To conclude the story...

• Total running time: 37 minutes!

Semels ('40)

Flowers ('50-'80)

Welles ('85-'90)

Lee ('95-'00)

Hitler ('05-'10)

Thanks to...

This is my story: dozens of researchers have similar stories (references)

• Michele Borassi, Pierluigi Crescenzi, Michel Habib: Into the Square: On the Complexity of Some Quadratic-time Solvable Problems. Electr. Notes Theor. Comput. Sci. 322: 51-67 (2016)
• Elisabetta Bergamini, Michele Borassi, Pierluigi Crescenzi, Andrea Marino, Henning Meyerhenke: Computing Top-k Closeness Centrality Faster in Unweighted Graphs. ALENEX 2016: 68-80
• Michele Borassi, Pierluigi Crescenzi, Luca Trevisan: An Axiomatic and an Average-Case Analysis of Algorithms and Heuristics for Metric Properties of Graphs. CoRR abs/1604.01445 (2016)
• Michele Borassi, Pierluigi Crescenzi, Michel Habib, Walter A. Kosters, Andrea Marino, Frank W. Takes: Fast diameter and radius BFS-based computation in (weakly connected) real-world graphs: With an application to the six degrees of separation games. Theor. Comput. Sci. 586: 59-80 (2015)
• Michele Borassi, David Coudert, Pierluigi Crescenzi, Andrea Marino: On Computing the Hyperbolicity of Real-World Graphs. ESA 2015: 215-226
• Pilu Crescenzi, Roberto Grossi, Michel Habib, Leonardo Lanzi, Andrea Marino: On computing the diameter of real-world undirected graphs. Theor. Comput. Sci. 514: 84-95 (2013)
• Pierluigi Crescenzi, Roberto Grossi, Leonardo Lanzi, Andrea Marino: On Computing the Diameter of Real-World Directed (Weighted) Graphs. SEA 2012: 99-110
• Pierluigi Crescenzi, Roberto Grossi, Leonardo Lanzi, Andrea Marino: A Comparison of Three Algorithms for Approximating the Distance Distribution in Real-World Graphs. TAPAS 2011: 92-103
• Pierluigi Crescenzi, Roberto Grossi, Claudio Imbrenda, Leonardo Lanzi, Andrea Marino: Finding the Diameter in Real-World Graphs - Experimentally Turning a Lower Bound into an Upper Bound. ESA (1) 2010: 302-313

References

A (large) graph mining roudtrip

By Pierluigi Crescenzi

Private

A (large) graph mining roudtrip

Leiden Networks Day, September 23, 2016