On Community detection a.k.a Graph Clustering
What is a graph?
Edge or Connection
Vertex or Node
Our Problem: Finding communities in graphs
All possible combinations:
n= 20 more than
A generic clustering problem :
combinations
Relaxation a.k.a approximation
Difficult problem
Relaxed problem
Must hold:
Convex relaxation
In convex problems if optimal is found then is a global optimal
Community detection in many fields:
Graph theory 1: linear algebra representation
weighted edge
directed edge
self-loop
W := adjacency matrix
Example of adjacency matrix of a graph with communities
Graph theory 2: basic concepts
Path-length 2:
Path-length 3:
Graph theory 3: counting walks of length-2
on undirected unweighted graphs
Graph theory 3: counting walks of length-n
on undirected unweighted graphs
Graph theory 4: Centrality, Communicability and Betweenness
Which vertices/edges are important?
Centrality: Importance of a node
Communicability: well-connectedness between 2 nodes
Betweenness: How much information flows through a node or edge
well-communicated
high centrality
high betweenness
Graph theory 4: Centrality, Communicability and Betweenness
Centrality node i:
Communicability node i and j:
Betweenness node r:
We can define in terms of walks
down-weighting parameter
walks of length n
Graph theory 4: Centrality, Communicability and Betweenness
Special case:
Then..
Graph theory 5: the graph-Laplacian
Very nice properties:
Back to our problem: Community detection
In terms of the graph-Laplacian
have trivial solution
We have to introduce a balancing constraint
but it becomes difficult to solve...
Spectral Relaxation
Expanded feasible set for
but still, non-convex...
Why is this non-convex relaxation good?
Eigendecomposition and :
From properties of L:
The second eigenvector is the solution for the relaxed problem!
Orthogonal matrix
Semidefinite relaxation (SDR)
change of variables
equivalent
relaxed problem
Convex problem!
Extracting solutions from SDR 1:
Low rank approximation + k-means
Low-rank approximation of Y
An optimal Y
Ordered spectrum of optimal Y
K-means with V rows as features
Extracting solutions from SDR 2: Randomization
X as a random variable
Stochastic Optimization Problem
equivalent to SDR
Augmented Adjacency Matrix 1: the idea
Recall
should :
e.g
Augmented adjacency matrix 2: Communicability
Augmented adjacency matrix 3: Distance
Synthetic data: Stochastic Block Model (SBM)
Synthetic Data: degree-corrected(DC) -SBM
Results on synthetic data
Experiments on real datasets
Zachary Karate club
Bottlenose Dolphins network
Results on real datasets
Silhouette index:
Modularity:
Conclusions