TOPIC 0

Introduction

2 Clustering: an objective and an algorithm


Course: Math 535 - Mathematical Methods in Data Science (MMiDS)
Author: Sebastien Roch, Department of Mathematics, University of Wisconsin-Madison
Updated: Sep 8, 2020
Copyright: © 2020 Sebastien Roch


Consider the following fundamental problem in data science. We are given $n$ vectors $\mathbf{x}_1,\ldots,\mathbf{x}_n$ in $\mathbb{R}^d$. Our goal is to find a good clustering: loosely speaking, we want to partition these data points into $k$ disjoint subsets -- or clusters -- with small pairwise distances within clusters and large pairwise distances across clusters. To make this rather imprecise problem more precise, we consider a specific objective function known as the $k$-means objective.

Fix a number of clusters $k$. Formally, we think of a clustering as a partition.

Definition (Partition): A partition of $[n] = \{1,\ldots,n\}$ of size $k$ is a collection of non-empty subsets $C_1,\ldots,C_k \subseteq [n]$ that:

  • are pairwise disjoint, i.e., $C_i \cap C_j = \emptyset$, $\forall i \neq j$
  • cover all of $[n]$, i.e., $\cup_{i=1}^k C_i = [n]$. $\lhd$

2.1 The $k$-means objective

Given $n$ vectors $\mathbf{x}_1,\ldots,\mathbf{x}_n$ in $\mathbb{R}^d$ and a partition $C_1, \ldots, C_k \subseteq [n]$, it will be useful to have notation for the corresponding cluster assignment: we define $c(i) = j$ if $i \in C_j$.

Under the $k$-means objective, the cost of $C_1,\ldots,C_k$ is then defined as

$$ \mathcal{G}(C_1,\ldots,C_k) = \min_{\boldsymbol{\mu}_1,\ldots,\boldsymbol{\mu}_k \in \mathbb{R}^d} G(C_1,\ldots,C_k; \boldsymbol{\mu}_1, \ldots, \boldsymbol{\mu}_k) $$

where

$$ \begin{align} G(C_1,\ldots,C_k; \boldsymbol{\mu}_1, \ldots, \boldsymbol{\mu}_k) &= \sum_{j=1}^k \sum_{i \in C_j} \|\mathbf{x}_i - \boldsymbol{\mu}_j\|^2 = \sum_{i=1}^n \|\mathbf{x}_i - \boldsymbol{\mu}_{c(i)}\|^2. \end{align} $$

Here $\boldsymbol{\mu}_i \in \mathbb{R}^d$ is the representative -- or center -- of cluster $C_i$. Note that $\boldsymbol{\mu}_i$ need not be one of the $\mathbf{x}_j$'s.

Our goal is to find a partition that minimizes $\mathcal{G}(C_1,\ldots,C_k)$, i.e., to solve the problem:

$$ \min_{C_1,\ldots,C_k} \mathcal{G}(C_1,\ldots,C_k) $$

over all partitions of $[n]$ of size $k$.

To quote Wikipedia:

In centroid-based clustering, clusters are represented by a central vector, which may not necessarily be a member of the data set. When the number of clusters is fixed to k, k-means clustering gives a formal definition as an optimization problem: find the k cluster centers and assign the objects to the nearest cluster center, such that the squared distances from the cluster are minimized.

In general, the problem is NP-hard, that is, no fast algorithm is expected to exist to solve it. The $k$-means algorithm is a popular heuristic. It is based on the idea that the following two sub-problems are easy to solve:

  1. finding the optimal representatives for a fixed partition
  2. finding the optimal partition for a fixed set of representatives.

One then alternates between the two (perhaps until progress falls below a tolerance). To elaborate on the first step, we review an elementary fact about quadratic functions.

2.1.1 Preliminary result: minimizing a quadratic function

A key step of the algorithm involves minimizing a quadratic function. Consider the function

$$q(x) = a x^2 + b x + c.$$

When $a > 0$, $q$ has a unique minimum.


Lemma (Minimum of Quadratic Function): Let $q(x) = a x^2 + b x + c$ where $a > 0$ and $x \in \mathbb{R}$. The unique minimum of $q$ is achieved at $x^* = -b/2a$.


Proof: By the First-Order Necessary Condition, a global minimizer of $q$ (which is necessarily a local minimizer) satisfies the condition

$$ \frac{\mathrm{d}}{\mathrm{d}x} q(x) = 2 a x + b = 0, $$

whose unique solution is

$$ x^*= -\frac{b}{2a}. $$

To see that $x^*$ is indeed a global minimizer, we re-write $q$ as

$$ \begin{align} q(x) &= a \left(x^2 + 2 \left[\frac{b}{2a}\right] x\right) + c\\ &= a \left(x^2 + 2 \left[\frac{b}{2a}\right] x + \left[\frac{b}{2a}\right]^2\right) - a \left[\frac{b}{2a}\right]^2 + c\\ &= a (x - x^*)^2 + \left[c - \frac{b^2}{4a}\right]. \end{align} $$

Clearly, any other $x$ gives a higher value for $q$. $\square$


Lemma (Optimal Representatives): Fix a partition $C_1,\ldots,C_k$. The optimal representatives under $G$ are the centroids

$$ \boldsymbol{\mu}_i^* = \frac{1}{|C_i|} \sum_{j\in C_i} \mathbf{x}_j. $$

Proof idea: The objective $G$ can be written as a sum, where each term is a quadratic function in one component of one of the $\boldsymbol{\mu}_i$'s. Each of these terms is minimized by the average of the corresponding components of the $\mathbf{x}_j$'s belonging $C_i$.

Proof: Note that we can expand the $k$-means objective as

$$ \begin{align} \sum_{i=1}^k \sum_{j \in C_i} \|\mathbf{x}_j - \boldsymbol{\mu}_i\|^2 &= \sum_{i=1}^k \sum_{j \in C_i} \sum_{m=1}^d (x_{j,m} - \mu_{i,m})^2\\ &= \sum_{m=1}^d \sum_{i=1}^k \left[\sum_{j \in C_i} (x_{j,m} - \mu_{i,m})^2\right]. \end{align} $$

The expression in square brackets is a quadratic function in $\mu_{i,m}$:

$$ q_{i,m}(\mu_{i,m}) = \left\{\sum_{j \in C_i} x_{j,m}^2\right\} + \left\{- 2 \sum_{j \in C_i} x_{j,m}\right\} \mu_{i,m} + \left\{|C_i| \right\} \mu_{i,m}^2, $$

and therefore, by the formula for the minimum of a quadratic function, is minimized at

$$ \mu_{i,m}^* = - \frac{- 2 \sum_{j \in C_i} x_{j,m}}{2 |C_i|} = \frac{1}{|C_i|} \sum_{j \in C_i} x_{j,m}. $$

Since each term in the sum making up $G$ is strictly minimized at $\boldsymbol{\mu}_1^*,\ldots, \boldsymbol{\mu}_k^*$, so is $G$. $\square$


Lemma (Optimal Clustering): Fix the representatives $\boldsymbol{\mu}_1,\ldots,\boldsymbol{\mu}_k$. An optimal partition under $G$ is obtained as follows. For each $j$, find the $\boldsymbol{\mu}_i$ that minimizes $\|\mathbf{x}_j - \boldsymbol{\mu}_i\|^2$ (picking one arbitrarily in the case of ties) and assign $\mathbf{x}_j$ to $C_i$.


Proof: By definition, each term in

$$ G(C_1,\ldots,C_k; \boldsymbol{\mu}_1, \ldots, \boldsymbol{\mu}_k) = \sum_{i=1}^n \|\mathbf{x}_i - \boldsymbol{\mu}_{c(i)}\|^2 $$

is, when the $\boldsymbol{\mu}_j$'s are fixed, minimized by the assignment in the statement. Hence so is $G$. $\square$

2.2 The $k$-means algorithm

We are now ready to describe the $k$-means algorithm. We start from a random assignment of clusters. We then alternate between the optimal choices in the lemmas. In lieu of pseudo-code, we write out the algorithm in Julia.

The input X is assumed to be a collection of $n$ vectors $\mathbf{x_1}, \ldots, \mathbf{x}_n \in \mathbb{R}^d$ stacked into a matrix. The other input, k, is the desired number of clusters. There is an optional input maxiter for the maximum number of iterations, which is set to $10$ by default.

We first define separate functions for the two main steps. To find the minimum of an array, we use the function findmin.

In [3]:
function opt_reps(X, k, assign)
    n, d = size(X)
    reps = zeros(Float64, k, d) # rows are representatives
    for j = 1:k
        in_j = [i for i=1:n if assign[i] == j]             
        reps[j,:] = sum(X[in_j,:],dims=1) ./ length(in_j)
    end
    return reps
end
Out[3]:
opt_reps (generic function with 1 method)
In [4]:
function opt_clust(X, k, reps)
    n, d = size(X) # n=number of rows, d=number of columns 
    dist = zeros(Float64, n) # distance to rep
    assign = zeros(Int64, n) # cluster assignments
    for i = 1:n
        dist[i],assign[i] = findmin([norm(X[i,:].-reps[j,:]) for j=1:k])
    end
    @show G = sum(dist.^2)
    return assign
end
Out[4]:
opt_clust (generic function with 1 method)
In [5]:
function mmids_kmeans(X, k; maxiter=10)
    n, d = size(X)
    assign = [rand(1:k) for i=1:n] # start with random assignments
    reps = zeros(Int64, k, d) # initialization of reps
    for iter = 1:maxiter
        # Step 1: Optimal representatives for fixed clusters
        reps = opt_reps(X, k, assign) 
        # Step 2: Optimal clusters for fixed representatives
        assign = opt_clust(X, k, reps) 
    end
    return assign
end
Out[5]:
mmids_kmeans (generic function with 1 method)

The $k$-means algorithm is only a heuristic. In particular, it is not guaranteed to find the global minimum of the $k$-means objective. However, it is guaranteed to improve the objective at every iteration, or more precisely, not to make it worse.


Theorem (Convergence of $k$-means): The sequence of objective function values produced by the $k$-means algorithm is non-increasing.


Proof idea: By the Optimal Representatives and Optimal Clustering lemmas, each step does not increase the objective.

Proof: Let $C_1',\ldots,C_k'$ be the current clusters, with representatives $\boldsymbol{\mu}_1',\ldots,\boldsymbol{\mu}_k'$. After Step 1, the new representatives are $\boldsymbol{\mu}_1'',\ldots,\boldsymbol{\mu}_k''$. By the Optimal Representatives Lemma, they satisfy

$$ \sum_{i=1}^k \sum_{j \in C_i'} \|\mathbf{x}_j - \boldsymbol{\mu}_i''\|^2 \leq \sum_{i=1}^k \sum_{j \in C_i'} \|\mathbf{x}_j - \boldsymbol{\mu}_i'\|^2. $$

After Step 2, the new clusters are $C_1'',\ldots,C_k''$. By the Optimal Clustering Lemma, they satisfy

$$ \sum_{i=1}^k \sum_{j \in C_i''} \|\mathbf{x}_j - \boldsymbol{\mu}_i''\|^2 \leq \sum_{i=1}^k \sum_{j \in C_i'} \|\mathbf{x}_j - \boldsymbol{\mu}_i''\|^2. $$

Combining these two inequalities gives

$$ \sum_{i=1}^k \sum_{j \in C_i''} \|\mathbf{x}_j - \boldsymbol{\mu}_i''\|^2 \leq \sum_{i=1}^k \sum_{j \in C_i'} \|\mathbf{x}_j - \boldsymbol{\mu}_i'\|^2, $$

as claimed. $\square$

2.3 Simulated data

We will test our implementation of $k$-means on a simple simulated dataset.

Recall that a standard Normal variable $X$ has PDF

$$ f_X(x) = \frac{1}{\sqrt{2 \pi}} \exp\left( - x^2/2 \right). $$

Its mean is $0$ and its variance is $1$.

This is what its PDF looks like:

Normal PDF (Source)

To construct a $d$-dimensional version, we take $d$ independent standard Normal variables $X_1, X_2, \ldots, X_d$ and form the vector $\mathbf{X} = (X_1,\ldots,X_d)$. We will say that $\mathbf{X}$ is a standard Normal $d$-vector. By independence, its joint PDF is given by the product of the PDFs of the $X_i$'s, that is,

$$ \begin{align} f_{\mathbf{X}}(\mathbf{x}) &= \prod_{i=1}^d \frac{1}{\sqrt{2 \pi}} \exp\left( - x_i^2/2 \right)\\ &= \frac{1}{\prod_{i=1}^d \sqrt{2 \pi}} \exp\left( - \sum_{i=1}^d x_i^2/2 \right)\\ &= \frac{1}{(2 \pi)^{d/2}} \exp(-\|\mathbf{x}\|^2/2). \end{align} $$

We can also shift and scale it.

Definition (Spherical Gaussian): Let $\mathbf{Z}$ be a standard Normal $d$-vector, let $\boldsymbol{\mu} \in \mathbb{R}^d$ and let $\sigma \in \mathbb{R}_+$. Then we will refer to the transformed random variable $\mathbf{X} = \boldsymbol{\mu} + \sigma \mathbf{Z}$ as a spherical Gaussian with mean $\boldsymbol{\mu}$ and variance $\sigma^2$. We use the notation $\mathbf{Z} \sim N_d(\boldsymbol{\mu}, \sigma^2 I)$. $\lhd$

The following function generates $n$ data points from two spherical $d$-dimensional Gaussians with variance $1$, one with mean $-w\mathbf{e}_1$ and one with mean $w \mathbf{e}_1$. Below, randn(d) generates a d-dimensional spherical Gaussian.

In [6]:
function two_clusters(d, n, w)
    X1 = reduce(hcat, [vcat(-w, zeros(d-1)) .+ randn(d) for i=1:n])'
    X2 = reduce(hcat, [vcat(w, zeros(d-1)) .+ randn(d) for i=1:n])'
    return X1, X2
end
Out[6]:
two_clusters (generic function with 1 method)

We will mix these two datasets to form an interesting case for clustering.

2.3.1 Two dimension

We start with $d=2$.

In [7]:
d, n, w = 2, 100, 3.
X1, X2 = two_clusters(d, n, w)
X = vcat(X1, X2)
scatter(X[:,1], X[:,2], legend=false, size=(700,350))
Out[7]:
In [8]:
assign = mmids_kmeans(X, 2);
G = sum(dist .^ 2) = 1642.660544012831
G = sum(dist .^ 2) = 412.35579408034704
G = sum(dist .^ 2) = 412.35579408034704
G = sum(dist .^ 2) = 412.35579408034704
G = sum(dist .^ 2) = 412.35579408034704
G = sum(dist .^ 2) = 412.35579408034704
G = sum(dist .^ 2) = 412.35579408034704
G = sum(dist .^ 2) = 412.35579408034704
G = sum(dist .^ 2) = 412.35579408034704
G = sum(dist .^ 2) = 412.35579408034704
In [9]:
scatter(X[:,1], X[:,2], marker_z=assign, legend=false, size=(700,350))
Out[9]:

2.3.2 General dimension

Let's see what happens in higher dimension. We repeat our experiment with $d=1000$.

In [10]:
d, n, w = 1000, 100, 3
X1, X2 = two_clusters(d, n, w)
X = vcat(X1, X2)
scatter(X[:,1], X[:,2], legend=false, size=(700,350))
Out[10]:
In [11]:
scatter(X[:,2], X[:,3], legend=false, size=(350,350))
Out[11]:
In [12]:
assign = mmids_kmeans(X, 2);
G = sum(dist .^ 2) = 199775.6175486944
G = sum(dist .^ 2) = 199754.31646446598
G = sum(dist .^ 2) = 199717.27042900465
G = sum(dist .^ 2) = 199717.27042900465
G = sum(dist .^ 2) = 199717.27042900465
G = sum(dist .^ 2) = 199717.27042900465
G = sum(dist .^ 2) = 199717.27042900465
G = sum(dist .^ 2) = 199717.27042900465
G = sum(dist .^ 2) = 199717.27042900465
G = sum(dist .^ 2) = 199717.27042900465
In [13]:
scatter(X[:,1], X[:,2], marker_z=assign, legend=false, size=(700,350))
Out[13]:

Our attempt at clustering does not appear to have been successful. What happened? While these clusters are easy to tease apart if we know to look at the first coordinate only, in the full space the within-cluster and between-cluster distances become harder to distinguish: the noise overwhelms the signal.

The function below plots the histograms of within-cluster and between-cluster distances for a sample of size $n$ in $d$ dimensions with a given offset. As $d$ increases, the two distributions become increasingly indistinguishable. Later in the course, we will develop dimension-reduction techniques that help deal with this issue.

In [14]:
function highdim_2clusters(d, n, w)
    # generate datasets
    X1, X2 = two_clusters(d, n, w)
    
    # within-cluster distances for X1
    intra = vec([norm(X1[i,:] .- X1[j,:]) for i=1:n, j=1:n if j>i])
    h = histogram(intra, normalize=:probability,
        label="within-cluster", alpha=0.75, title="dim=$d") # alpha=transparency
 
    # between-cluster distances
    inter = vec([norm(X1[i,:] .- X2[j,:]) for i=1:n, j=1:n])
    histogram!(inter, normalize=:probability,
        label="between-cluster", alpha=0.75)
end
Out[14]:
highdim_2clusters (generic function with 1 method)
In [15]:
highdim_2clusters(3, 100, 3)
Out[15]:
In [16]:
highdim_2clusters(100, 100, 3)
Out[16]:
In [17]:
highdim_2clusters(1000, 100, 3)
Out[17]:

(1) Expectation of $\Delta$: By defintion, the random variables $X_{1,i} - X_{2,i}$, $i = 1,\ldots, d$, and $Y_{1,i} - Y_{2,i}$, $i = 2,\ldots, d$, are identically distributed. So, by linearity of expectation,

$$ \begin{align} \mathbb{E}[\Delta] &= \sum_{i=1}^d \mathbb{E}[(Y_{1,i} - Y_{2,i})^2] - \sum_{i=1}^d \mathbb{E}[(X_{1,i} - X_{2,i})^2]\\ &= \mathbb{E}[(Y_{1,1} - Y_{2,1})^2] - \mathbb{E}[(X_{1,1} - X_{2,1})^2]. \end{align} $$

Further, we can write $Y_{1,1} - Y_{1,2} \sim (Z_1 -w) - (Z_2+w)$ where $Z_1, Z_2 \sim N(0,1)$ are independent, where here $\sim$ indicates equality in distribution. Hence, we have

$$ \begin{align} \mathbb{E}[(Y_{1,1} - Y_{2,1})^2] &= \mathbb{E}[(Z_1 - Z_2 - 2w)^2]\\ &= \mathbb{E}[(Z_1 - Z_2)^2] - 4w \,\mathbb{E}[Z_1 - Z_2] + 4 w^2. \end{align} $$

Similarly, $X_{1,1} - X_{1,2} \sim Z_1 - Z_2$ so $\mathbb{E}[(X_{1,1} - X_{2,1})^2] = \mathbb{E}[(Z_1 - Z_2)^2]$. Since $\mathbb{E}[Z_1 - Z_2] = 0$, we finally get $\mathbb{E}[\Delta] = 4 w^2$.