A (large) graph mining roundtrip

From practice to theory and back

(two times)

Pierluigi Crescenzi

Leiden, September 23, 2016

Leiden Networks Day

The beginning of the story

  • Graph G=(V,E)
    • Undirected
  • Distance d(u,v): ​number of edges in shortest path from u to v
    • Connected
  • Diameter: maximum distance​
    • Maximum eccentricity of all nodes
      • e(v)=max d(v,w)

Our "toy" network

IMDB graph: edge between two actors if played in same movie

Algorithms for diameter

O(|V||E|): breadth-first search from each node

Can we do better?

Into the square

  • Quadratic algorithms are not feasible
  • Look for "hardest" quadratic time solvable problems
    • Approach similar to NP-completeness
    • Definition of specific reducibility
      • Preserving subquadratic solvability
  • Hardness relative to complexity hypothesis
    • Similar to P vs NP
    • ​SETH: no algorithm solving k-SAT in subexponential time
      • Quadratic time solvable version (k-SAT*)

Quasi-linear reducibility

\mathcal{P} \leq_{ql}\mathcal{Q}
I \mathrm{\ instance\ of\ }\mathcal{P} \rightarrow \Phi(I)\mathrm{\ instance\ of\ }\mathcal{Q}
\mathrm{Computable\ in\ time\ }\tilde{O}(|I|)
I \mathrm{\ and\ } s(I) \mathrm{\ same\ output}
\mathrm{Linear\ time\ computable\ output\ mapping}
\mathcal{P} \leq_{ql}\mathcal{Q} \mathrm{\ and \ } \mathcal{Q} \mathrm{\ is\ solvable\ in\ time \ } \tilde{O}(n^{2-\epsilon})
\mathcal{P} \mathrm{\ is\ solvable\ in\ time \ } \tilde{O}(n^{2-\epsilon})

k-SAT*

\mathrm{\ Input\ }
\mathrm{Possible\ assignments\ to \ } x_i
O(n^{2-\epsilon})\mathrm{\ algorithm\ for\ } k-\mathrm{SAT}^*
O(2^{\frac{n}{2}(2-\epsilon)})=O((2^{\frac{2-\epsilon}{2}})^n)\mathrm{\ algorithm\ for\ } \mathrm{SAT}
\mathrm{Two\ sets \ of\ } n \mathrm{\ variables\ }\{x_i\},\{y_i\}
\mathrm{Set\ of\ clauses\ } C
\mathrm{Possible\ assignments\ to \ } y_i
\mathrm{\ Output:\ true\ if\ }C\mathrm{\ satisfiable}

The reduction web

From disjoint sets to diameter

\mathrm{\ Input\ }
\mathrm{Set\ of\ items\ } X
\mathrm{Collection\ } C \mathrm{\ of\ subsets\ of\ } X
\mathrm{\ Output:\ true\ if\ }C\mathrm{\ has\ two\ disjoint\ sets}
\mathrm{Clique\ of\ }|X|\mathrm{\ nodes}
\mathrm{Independent\ set\ of\ }|C|\mathrm{\ nodes}
\mathrm{Reduction}
\mathrm{Two\ sets\ that\ do\ not\ intersect:\ distance\ }3
\mathrm{Two\ sets\ that\ intersect:\ distance\ }2
\mathrm{Disjoint\ sets\ } \leftrightarrow \mathrm{\ diameter\ is\ } 3

Lower bound on the diameter

The 2-sweep heuristics

v_1
v_2 \mathrm{ \ maximizes\ } d(v_2,v_1)
\mathrm{Max\ eccentricity \ of\ } v_1, v_2: \mathrm{good\ lower\ bound\ on\ diameter}

Lower bound on the diameter

The sumsweep heuristics

v_1
v_2 \mathrm{ \ maximizes\ } d(v_2,v_1)
\mathrm{Max\ eccentricity \ of\ } v_1, v_2, v_3, v_4: \mathrm{better\ lower\ bound\ on\ diameter}
v_3 \mathrm{ \ maximizes\ } d(v_3,v_1)+d(v_3,v_2)
v_4 \mathrm{ \ maximizes\ } d(v_4,v_1)+d(v_4,v_2)+d(v_4,v_3)

Bounds on node eccentricities

\mathrm{If\ BFS\ from\ } v \mathrm{\ done}
d(v,w) \leq ecc(w) \leq d(v,w)+ecc(v)
L_v(w)
U_v(w)
U_v(w) \mathrm{can\ be\ improved}

Exact value of diameter

\mathrm{Vectors\ } e_L,e_U: \mathrm{\ lower\ and\ upper\ bounds}
\mathrm{At\ each\ BFS\ from\ } v:
e_L(w)=\max(e_L(w),L_v(w))
e_U(w)=\min(e_U(w),U_v(w))
\mathrm{Vector\ } S: \mathrm{\ sum\ of\ distances\ to\ already\ explored\ nodes}
S(w)=d(v,w)+S(w)
\mathrm{Start\ with\ } e_L=0, e_U=\infty \mathrm{\ and\ sumsweep\ of\ } k \mathrm{\ nodes}
\mathrm{At\ each\ step,\ \mathbf{cleverly}\ choose\ next\ } v
\mathrm{Update\ } e_L, e_U : e_L(w)=e_U(w)\Rightarrow e(w) = e_L(w)
\mathrm{Terminate\ when\ } \max e(v) \geq \max (e_U(w))

Choosing the next vertex u

\mathrm{Alternate}
\mathrm{Minimize\ } e_L(u)
\mathrm{Ties\ solved\ by\ minimizing\ } S(u)
\mathrm{Maximize\ } e_U(u)
\mathrm{Ties\ solved\ by\ maximizing\ } S(u)
\mathrm{Should\ improve\ upper\ bounds}
\mathrm{Should\ improve\ lower\ bounds}

Performances

In theory, as the worst case, but...

Why?

Average case complexity

  • Very hard and technical
  • Many models
  • Are models realistic?
  • Which properties are used?
  • Axiomatic framework
    • Define axioms
    • Deduce probabilistic analyses from the axioms
    • Prove that random graphs satisfy the axioms
    • Show empirically that real-world graphs satisfy the axioms

The models

  • Erdös-Renyi model
    • Not realistic (all nodes are "equal")
    • Heuristics are not efficient on this model
  • Random graph with prescribed degree distribution
    • Configuration model
    • Chung-Lu model
    • Norros-Reittu model
  • Power law degree distribution
|\{v\in V:\mathrm{deg}(v)=d\}| \approx nd^{-\beta}

The axioms

  • Some definitions
\tau_s(n^x) = \min\{l:\gamma^l(s)\geq n^x\}
\gamma^l(s)=|\{v\in V:d(s,v)=l\}|
T(d \rightarrow n^x) = \mathrm{\ avg}_{\mathrm{deg}(s)=d}\tau_s(n^x)

...

\tau_s(n^x)
\gamma^1(s)
\gamma^2(s)
s
n^x
|\{s\in V:\tau_s(n^x)\geq T(\mathrm{deg}(s)\rightarrow n^x)+l\}|\approx\frac{n}{c^l}

Axiom 1

d(s,t) \approx \tau_s(n^x)+\tau_t(n^{1-x})-1

Axiom 2

The sum-sweep heuristics

\beta>3
2<\beta<3
1<\beta<2
\leq n^{1+\frac{C}{C+\frac{\beta-1}{\beta-3}}}
n^{1+o(1)}
\leq mn^{1-\frac{2-\beta}{\beta-1}\left(\left\lfloor\frac{\beta-1}{2-\beta}-\frac{3}{2}\right\rfloor-\frac{1}{2}\right)}
C=\frac{2d_{\mathrm{avg}}(n)}{D-d_{\mathrm{avg}}(n)}\mathrm{\ is\ constant}

The same story for...

  • Hyperbolicity
  • Closeness centrality
  • Betweenness centrality

Computing closeness top-k

\mathrm{Definition:\ }c(v)=\frac{n-1}{\sum_{w \in V-\{v\}}d(v,w)}
\mathrm{In\ theory\ complexity\ }\Theta(n^2)
\mathrm{In\ practice}
\mathrm{BFSCut\ returns\ 0\ if\ } v \mathrm{\ is\ not\ among\ top\ } k, c(v)\mathrm{\ otherwise}

How to cut the BFS

\mathrm{If\ } \gamma_d(v)=|\Gamma_d(v)|: f(v) \geq f_d(v) + (d+1)\gamma_ {d+1}(v)+(d+2)(r(v)-n_{d+1}(v))
\mathrm{Since\ } n_{d+1}(v)=\gamma_{d+1}(v)+n_{d}(v): f(v) \geq f_d(v) - \gamma_ {d+1}(v)+(d+2)(r(v)-n_{d}(v))
\mathrm{Since\ } \gamma_{d+1}(v)\leq\tilde{\gamma}_{d+1}(v): f(v) \geq f_d(v) - \tilde{\gamma}_ {d+1}(v)+(d+2)(r(v)-n_{d}(v))
  • Everything is known
    • If graph not connected, work on components

To conclude the story...

  • Total running time: 37 minutes!

Semels ('40)

Corrado ('45)

Flowers ('50-'80)

Welles ('85-'90)

Lee ('95-'00)

Hitler ('05-'10)

Madsen ('14)

Thanks to...

This is my story: dozens of researchers have similar stories (references)

  • Michele Borassi, Pierluigi Crescenzi, Michel Habib: Into the Square: On the Complexity of Some Quadratic-time Solvable Problems. Electr. Notes Theor. Comput. Sci. 322: 51-67 (2016)
  • Elisabetta Bergamini, Michele Borassi, Pierluigi Crescenzi, Andrea Marino, Henning Meyerhenke: Computing Top-k Closeness Centrality Faster in Unweighted Graphs. ALENEX 2016: 68-80
  • Michele Borassi, Pierluigi Crescenzi, Luca Trevisan: An Axiomatic and an Average-Case Analysis of Algorithms and Heuristics for Metric Properties of Graphs. CoRR abs/1604.01445 (2016)
  • Michele Borassi, Pierluigi Crescenzi, Michel Habib, Walter A. Kosters, Andrea Marino, Frank W. Takes: Fast diameter and radius BFS-based computation in (weakly connected) real-world graphs: With an application to the six degrees of separation games. Theor. Comput. Sci. 586: 59-80 (2015)
  • Michele Borassi, David Coudert, Pierluigi Crescenzi, Andrea Marino: On Computing the Hyperbolicity of Real-World Graphs. ESA 2015: 215-226
  • Pilu Crescenzi, Roberto Grossi, Michel Habib, Leonardo Lanzi, Andrea Marino: On computing the diameter of real-world undirected graphs. Theor. Comput. Sci. 514: 84-95 (2013)
  • Pierluigi Crescenzi, Roberto Grossi, Leonardo Lanzi, Andrea Marino: On Computing the Diameter of Real-World Directed (Weighted) Graphs. SEA 2012: 99-110
  • Pierluigi Crescenzi, Roberto Grossi, Leonardo Lanzi, Andrea Marino: A Comparison of Three Algorithms for Approximating the Distance Distribution in Real-World Graphs. TAPAS 2011: 92-103
  • Pierluigi Crescenzi, Roberto Grossi, Claudio Imbrenda, Leonardo Lanzi, Andrea Marino: Finding the Diameter in Real-World Graphs - Experimentally Turning a Lower Bound into an Upper Bound. ESA (1) 2010: 302-313

References

A (large) graph mining roundtrip

By Pierluigi Crescenzi

Private

A (large) graph mining roundtrip

Leiden Networks Day, September 23, 2016