Convex Relaxation Techniques
On Community detection a.k.a Graph Clustering
What is a graph?
Edge or Connection
Vertex or Node
Our Problem: Finding communities in graphs
All possible combinations:
n= 20 more than
A generic clustering problem :
combinations
Relaxation a.k.a approximation
Difficult problem
Relaxed problem
Must hold:
Convex relaxation
In convex problems if optimal is found then is a global optimal
Community detection in many fields:
- in biology finding groups of proteins with similar functionalities to explain biological processes
- in social science to find groups that share traits. e.g finding potential research collaborations
- in political science to find groups with a similar ideology
- in ecology to find species
- much more...
Graph theory 1: linear algebra representation
weighted edge
directed edge
self-loop
W := adjacency matrix
Example of adjacency matrix of a graph with communities
Graph theory 2: basic concepts
- min-Distance:
Path-length 2:
Path-length 3:
- Node-degree: nº of adjacent nodes
- Path: ordered set of edges that join two nodes
Graph theory 3: counting walks of length-2
on undirected unweighted graphs
Graph theory 3: counting walks of length-n
on undirected unweighted graphs
Graph theory 4: Centrality, Communicability and Betweenness
Which vertices/edges are important?
-
Centrality: Importance of a node
-
Communicability: well-connectedness between 2 nodes
-
Betweenness: How much information flows through a node or edge
well-communicated
high centrality
high betweenness
Graph theory 4: Centrality, Communicability and Betweenness
-
Centrality node i:
-
Communicability node i and j:
-
Betweenness node r:
We can define in terms of walks
down-weighting parameter
walks of length n
Graph theory 4: Centrality, Communicability and Betweenness
Special case:
Then..
Graph theory 5: the graph-Laplacian
Very nice properties:
- For any vector :
- L is symmetric positive semidefinite:
- is always a eigenvector with eigenvalue 0:
Back to our problem: Community detection
In terms of the graph-Laplacian
have trivial solution
We have to introduce a balancing constraint
but it becomes difficult to solve...
Spectral Relaxation
Expanded feasible set for
but still, non-convex...
Why is this non-convex relaxation good?
Eigendecomposition and :
From properties of L:
The second eigenvector is the solution for the relaxed problem!
Orthogonal matrix
Semidefinite relaxation (SDR)
change of variables
equivalent
relaxed problem
Convex problem!
Extracting solutions from SDR 1:
Low rank approximation + k-means
Low-rank approximation of Y
An optimal Y
Ordered spectrum of optimal Y
K-means with V rows as features
Extracting solutions from SDR 2: Randomization
X as a random variable
Stochastic Optimization Problem
equivalent to SDR
- Sample
- Make e.g
- Reject unbalanced samples
- Evaluate in objective
- Repeat from 1
Augmented Adjacency Matrix 1: the idea
Recall
should :
- Encourage pairing together alike nodes
- Discourage pairing together dissimilar nodes
e.g
Augmented adjacency matrix 2: Communicability
Augmented adjacency matrix 3: Distance
Synthetic data: Stochastic Block Model (SBM)
Synthetic Data: degree-corrected(DC) -SBM
Results on synthetic data
Experiments on real datasets
Zachary Karate club
Bottlenose Dolphins network
Results on real datasets
Silhouette index:
Modularity:
Conclusions
- SDP can approximate hard clustering problems making them computationally feasible while keeping high performance
- Different definitions of the node connections can enhance separability. E.g: communicability, distance
- Different metrics, lead to different partitions. There isn't universal definition of community.
sdp-programming
By Arturo Arranz
sdp-programming
- 183