TUTORIAL 2b

Community detection and image segmentation


Course: Math 535 - Mathematical Methods in Data Science (MMiDS)
Author: Sebastien Roch, Department of Mathematics, University of Wisconsin-Madison
Updated: Sep 21, 2020
Copyright: © 2020 Sebastien Roch


Recap from the lectures

The setup A convenient way of specifying a graph is the following matrix representation.

Definition (Adjacency Matrix): Assume the undirected graph $G = (V,E)$ has $n = |V|$ vertices. The adjacency matrix $A$ of $G$ is the $n\times n$ symmetric matrix defined as

$$ \begin{align} A_{xy} = \begin{cases} 1 & \text{if $\{x,y\} \in E$}\\ 0 & \text{o.w.} \end{cases} \end{align} $$

$\lhd$

Our key definition is the following.

Definition (Laplacian Matrix): Let $G = (V,E)$ be a graph with vertices $V = \{1, \ldots, n\}$ and adjacency matrix $A \in \mathbb{R}^{n \times n}$. Let $D = \mathrm{diag}(\delta(1), \ldots, \delta(n))$ be the degree matrix. The Laplacian matrix associated to $G$ is defined as $L = D - A$. Its entries are

$$ l_{ij} = \begin{cases} \delta(i) & \text{if $i = j$}\\ -1 & \text{if $\{i,j\} \in E$}\\ 0 & \text{o.w.} \end{cases} $$

$\lhd$

The theory As a convention, we denote that eigenvalues of a Laplacian matrix $L$ by $0 \leq \mu_1 \leq \mu_2 \leq \cdots \leq \mu_n$.


Lemma ($\mu_1 = 0$): Let $G = (V, E)$ be a graph with $n = |V|$ vertices and Laplacian matrix $L$. The constant unit vector

$$ \mathbf{y}_1 = \frac{1}{\sqrt{n}} (1, \ldots, 1)^T $$

is an eigenvector of $L$ with eigenvalue $0$.



Corollary (Extremal Characterization of $\mu_2$): Let $G = (V, E)$ be a graph with $n = |V|$ vertices. Assume the Laplacian $L$ of $G$ has spectral decomposition $L = \sum_{i=1}^n \mu_i \mathbf{y}_i \mathbf{y}_i^T$ with $0 = \mu_1 \leq \mu_2 \leq \cdots \leq \mu_n$ and $\mathbf{y}_1 = \frac{1}{\sqrt{n}}(1,\ldots,1)^T$. Then

$$ \mu_2 = \min\left\{ \frac{\sum_{\{u, v\} \in E} (x_u - x_v)^2}{\sum_{u = 1}^n x_u^2} \,:\, \mathbf{x} = (x_1, \ldots, x_n)^T \neq \mathbf{0}, \sum_{u=1}^n x_u = 0 \right\}. $$

The algorithm Let $G=(V, E)$ be a graph. Imagine that we are interested in finding a good cut. That is, roughly speaking, we seek to divide it into two disjoint subsets of vertices to achieve two goals simultaneously:

  1. the two sets have relatively few edges between them
  2. neither set is too small.

Definition (Isoperimetric Number): Let $G=(V, E)$ be a graph. A cut is a bipartition $(S, S^c)$ of the vertices of $G$, where $S$ and $S^c$ are non-empty subsets of $V$. For short, we refer to $\emptyset \neq S \subset V$ as a cut. The size of the cut $|\partial S|$ is the number of edges from $S$ to $S^c$, where the edge boundary is defined as $\partial S = \{\{i,j\} \in E : i \in S, j \in S^c\}$. The cut ratio of $S$ is

$$ \phi(S) = \frac{|\partial S|}{\min\{|S|, |S^c|\}} $$

and the isoperimetric number (or Cheeger constant) of $G$ is the minimum cut ratio over all of its cuts, that is,

$$ \phi_G = \min_{\emptyset \neq S \subset V} \phi(S). $$

$\lhd$

A key result of spectral graph theory establishes a quantitative relation between the isoperimetric number and the second smallest Laplacian eigenvalue.


Theorem (Cheeger Inequality): Let $G = (V, E)$ be a graph with $n = |V|$ vertices and maximum degree $\bar{\delta}$. Let $0 = \mu_1 \leq \mu_2 \leq \cdots \leq \mu_n$ be its Laplacian eigenvalues. Then

$$ \frac{\phi_G^2}{2 \bar{\delta}} \leq \mu_2 \leq 2 \phi_G. $$

Let $\mathbf{y}_2 \in \mathbb{R}^n$ be the unit-norm eigenvector of $L$ associated to $\mu_2$. There is one entry of $\mathbf{y}_2 = (y_{2,1}, \ldots, y_{2,n})^T$ for each vertex of $G$. We use these entries to embed the graph $G$ in $\mathbb{R}$: vertex $i$ is mapped to $y_{2,i}$. Now order the entries $y_{2,\pi(1)}, \ldots, y_{2,\pi(n)}$, where $\pi$ is a permutation, that is, a re-ordering of $1,\ldots,n$. Specifically, $\pi(1)$ is the vertex corresponding to the smallest entry of $\mathbf{y}_{2}$, $\pi(2)$ is the second smallest, and so on. We consider only cuts of the form

$$ S_k = \{\pi(1), \ldots, \pi(k)\} $$

and we choose the $k \leq n/2$ that minimizes the cut ratio

$$ \phi(S_k) = \frac{|\partial S_k|}{|S_k|}. $$

1 Community detection

We return to the Karate Club dataset from the lecture.

In [3]:
g = smallgraph(:karate)
A = Array(adjacency_matrix(g))
n = nv(g)
gplot(g, nodelabel=1:n)
Out[3]:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
In [4]:
degrees = reshape(sum(A,dims=2), n)
D = Diagonal(degrees)
L = D - A
Out[4]:
34×34 Array{Int64,2}:
 16  -1  -1  -1  -1  -1  -1  -1  -1  …   0   0   0   0   0   0  -1   0   0
 -1   9  -1  -1   0   0   0  -1   0      0   0   0   0   0  -1   0   0   0
 -1  -1  10  -1   0   0   0  -1  -1      0   0  -1  -1   0   0   0  -1   0
 -1  -1  -1   6   0   0   0  -1   0      0   0   0   0   0   0   0   0   0
 -1   0   0   0   3   0  -1   0   0      0   0   0   0   0   0   0   0   0
 -1   0   0   0   0   4  -1   0   0  …   0   0   0   0   0   0   0   0   0
 -1   0   0   0  -1  -1   4   0   0      0   0   0   0   0   0   0   0   0
 -1  -1  -1  -1   0   0   0   4   0      0   0   0   0   0   0   0   0   0
 -1   0  -1   0   0   0   0   0   5      0   0   0   0   0  -1   0  -1  -1
  0   0  -1   0   0   0   0   0   0      0   0   0   0   0   0   0   0  -1
 -1   0   0   0  -1  -1   0   0   0  …   0   0   0   0   0   0   0   0   0
 -1   0   0   0   0   0   0   0   0      0   0   0   0   0   0   0   0   0
 -1   0   0  -1   0   0   0   0   0      0   0   0   0   0   0   0   0   0
  ⋮                   ⋮              ⋱   ⋮                   ⋮          
  0   0   0   0   0   0   0   0   0      0   0   0   0   0   0   0  -1  -1
  0   0   0   0   0   0   0   0   0     -1   0  -1   0  -1   0   0  -1  -1
  0   0   0   0   0   0   0   0   0     -1   0  -1   0   0   0  -1   0   0
  0   0   0   0   0   0   0   0   0  …   3   0   0   0   0   0  -1   0   0
  0   0   0   0   0   0   0   0   0      0   2   0   0  -1   0   0   0  -1
  0   0  -1   0   0   0   0   0   0      0   0   4   0   0   0   0   0  -1
  0   0  -1   0   0   0   0   0   0      0   0   0   3   0   0  -1   0  -1
  0   0   0   0   0   0   0   0   0      0  -1   0   0   4   0   0  -1  -1
  0  -1   0   0   0   0   0   0  -1  …   0   0   0   0   0   4   0  -1  -1
 -1   0   0   0   0   0   0   0   0     -1   0   0  -1   0   0   6  -1  -1
  0   0  -1   0   0   0   0   0  -1      0   0   0   0  -1  -1  -1  12  -1
  0   0   0   0   0   0   0   0  -1      0  -1  -1  -1  -1  -1  -1  -1  17
In [5]:
lambda = eigvals(L) # eigenvalues of L 
plot(lambda, legend=false, marker=:diamond)
Out[5]:
In [6]:
v = eigvecs(L) # eigenvectors of L
scatter(v[:,2], zeros(n), legend=false, ylims=(-1,1))
Out[6]:
In [7]:
(s, sc) = spectral_cut2(A)
Out[7]:
([27, 30, 21, 15, 16, 19, 23, 26, 24, 25, 33, 28, 34, 32, 29, 10, 31, 9], [3, 20, 14, 2, 8, 4, 18, 22, 13, 1, 12, 5, 11, 6, 7, 17])
In [8]:
viz_cut(g, s; layout=spring_layout)
Out[8]:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34

It is not trivial to assess the quality of the resulting cut. But this particular example has a known ground-truth community structure (which partly explains its widespread use). Quoting from Wikipedia:

A social network of a karate club was studied by Wayne W. Zachary for a period of three years from 1970 to 1972. The network captures 34 members of a karate club, documenting links between pairs of members who interacted outside the club. During the study a conflict arose between the administrator "John A" and instructor "Mr. Hi" (pseudonyms), which led to the split of the club into two. Half of the members formed a new club around Mr. Hi; members from the other part found a new instructor or gave up karate. Based on collected data Zachary correctly assigned all but one member of the club to the groups they actually joined after the split.

This ground truth is the following.

In [9]:
truth = [2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 1, 
    1, 2, 2, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1];
nodecolor = [colorant"lightseagreen", colorant"orange"]
gplot(g, nodelabel=1:n, nodefillc=nodecolor[truth])
Out[9]:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34

You can check that our cut perfectly matches the ground truth.