Convex Relaxation Techniques
On Community detection a.k.a Graph Clustering
What is a graph?

Edge or Connection
Vertex or Node
Our Problem: Finding communities in graphs


All possible combinations:
n= 20 more than
A generic clustering problem :
combinations



Relaxation a.k.a approximation
Difficult problem
Relaxed problem
Must hold:
Convex relaxation

In convex problems if optimal is found then is a global optimal
Community detection in many fields:
- in biology finding groups of proteins with similar functionalities to explain biological processes
- in social science to find groups that share traits. e.g finding potential research collaborations
- in political science to find groups with a similar ideology
- in ecology to find species
- much more...



Graph theory 1: linear algebra representation




weighted edge
directed edge
self-loop
W := adjacency matrix
Example of adjacency matrix of a graph with communities

Graph theory 2: basic concepts

- min-Distance:
Path-length 2:
Path-length 3:
- Node-degree: nº of adjacent nodes
- Path: ordered set of edges that join two nodes
Graph theory 3: counting walks of length-2

on undirected unweighted graphs
Graph theory 3: counting walks of length-n

on undirected unweighted graphs
Graph theory 4: Centrality, Communicability and Betweenness
Which vertices/edges are important?
-
Centrality: Importance of a node
-
Communicability: well-connectedness between 2 nodes
-
Betweenness: How much information flows through a node or edge

well-communicated
high centrality
high betweenness

Graph theory 4: Centrality, Communicability and Betweenness
-
Centrality node i:
-
Communicability node i and j:
-
Betweenness node r:
We can define in terms of walks
down-weighting parameter
walks of length n
Graph theory 4: Centrality, Communicability and Betweenness
Special case:
Then..
Graph theory 5: the graph-Laplacian
Very nice properties:
- For any vector :
- L is symmetric positive semidefinite:
- is always a eigenvector with eigenvalue 0:
Back to our problem: Community detection

In terms of the graph-Laplacian
have trivial solution
We have to introduce a balancing constraint

but it becomes difficult to solve...
Spectral Relaxation



Expanded feasible set for
but still, non-convex...
Why is this non-convex relaxation good?



Eigendecomposition and :


From properties of L:

The second eigenvector is the solution for the relaxed problem!
Orthogonal matrix
Semidefinite relaxation (SDR)




change of variables

equivalent
relaxed problem
Convex problem!
Extracting solutions from SDR 1:
Low rank approximation + k-means


Low-rank approximation of Y
An optimal Y
Ordered spectrum of optimal Y
K-means with V rows as features

Extracting solutions from SDR 2: Randomization


X as a random variable
Stochastic Optimization Problem

equivalent to SDR
- Sample
- Make e.g
- Reject unbalanced samples
- Evaluate in objective
- Repeat from 1


Augmented Adjacency Matrix 1: the idea

Recall
should :

- Encourage pairing together alike nodes
- Discourage pairing together dissimilar nodes

e.g
Augmented adjacency matrix 2: Communicability


Augmented adjacency matrix 3: Distance



Synthetic data: Stochastic Block Model (SBM)




Synthetic Data: degree-corrected(DC) -SBM


Results on synthetic data

Experiments on real datasets


Zachary Karate club
Bottlenose Dolphins network



Results on real datasets
Silhouette index:
Modularity:

Conclusions
- SDP can approximate hard clustering problems making them computationally feasible while keeping high performance
- Different definitions of the node connections can enhance separability. E.g: communicability, distance
- Different metrics, lead to different partitions. There isn't universal definition of community.
sdp-programming
By Arturo Arranz
sdp-programming
- 231