*Course:* Math 535 - Mathematical Methods in Data Science (MMiDS)

*Author:* Sebastien Roch, Department of Mathematics, University of Wisconsin-Madison

*Updated:* Sep 8, 2020

*Copyright:* © 2020 Sebastien Roch

Consider the following fundamental problem in data science. We are given $n$ vectors $\mathbf{x}_1,\ldots,\mathbf{x}_n$ in $\mathbb{R}^d$. Our goal is to find a good clustering: loosely speaking, we want to partition these data points into $k$ disjoint subsets -- or clusters -- with small pairwise distances within clusters and large pairwise distances across clusters. To make this rather imprecise problem more precise, we consider a specific objective function known as the $k$-means objective.

Fix a number of clusters $k$. Formally, we think of a clustering as a partition.

**Definition (Partition):** A partition of $[n] = \{1,\ldots,n\}$ of size $k$ is a collection of non-empty subsets $C_1,\ldots,C_k \subseteq [n]$ that:

- are pairwise disjoint, i.e., $C_i \cap C_j = \emptyset$, $\forall i \neq j$
- cover all of $[n]$, i.e., $\cup_{i=1}^k C_i = [n]$. $\lhd$

Under the $k$-means objective, the cost of $C_1,\ldots,C_k$ is then defined as

$$ \mathcal{G}(C_1,\ldots,C_k) = \min_{\boldsymbol{\mu}_1,\ldots,\boldsymbol{\mu}_k \in \mathbb{R}^d} G(C_1,\ldots,C_k; \boldsymbol{\mu}_1, \ldots, \boldsymbol{\mu}_k) $$where

$$ \begin{align} G(C_1,\ldots,C_k; \boldsymbol{\mu}_1, \ldots, \boldsymbol{\mu}_k) &= \sum_{j=1}^k \sum_{i \in C_j} \|\mathbf{x}_i - \boldsymbol{\mu}_j\|^2 = \sum_{i=1}^n \|\mathbf{x}_i - \boldsymbol{\mu}_{c(i)}\|^2. \end{align} $$Here $\boldsymbol{\mu}_i \in \mathbb{R}^d$ is the representative -- or center -- of cluster $C_i$. Note that $\boldsymbol{\mu}_i$ need not be one of the $\mathbf{x}_j$'s.

Our goal is to find a partition that minimizes $\mathcal{G}(C_1,\ldots,C_k)$, i.e., to solve the problem:

$$ \min_{C_1,\ldots,C_k} \mathcal{G}(C_1,\ldots,C_k) $$over all partitions of $[n]$ of size $k$.

To quote Wikipedia:

In centroid-based clustering, clusters are represented by a central vector, which may not necessarily be a member of the data set. When the number of clusters is fixed to k, k-means clustering gives a formal definition as an optimization problem: find the k cluster centers and assign the objects to the nearest cluster center, such that the squared distances from the cluster are minimized.

In general, the problem is NP-hard, that is, no fast algorithm is expected to exist to solve it. The $k$-means algorithm is a popular heuristic. It is based on the idea that the following two sub-problems are easy to solve:

- finding the optimal representatives for a fixed partition
- finding the optimal partition for a fixed set of representatives.

One then alternates between the two (perhaps until progress falls below a tolerance). To elaborate on the first step, we review an elementary fact about quadratic functions.

A key step of the algorithm involves minimizing a quadratic function. Consider the function

$$q(x) = a x^2 + b x + c.$$When $a > 0$, $q$ has a unique minimum.

**Lemma (Minimum of Quadratic Function):** Let $q(x) = a x^2 + b x + c$ where $a > 0$ and $x \in \mathbb{R}$. The unique minimum of $q$ is achieved at $x^* = -b/2a$.

*Proof:* By the *First-Order Necessary Condition*, a global minimizer of $q$ (which is necessarily a local minimizer) satisfies the condition

whose unique solution is

$$ x^*= -\frac{b}{2a}. $$To see that $x^*$ is indeed a global minimizer, we re-write $q$ as

$$ \begin{align} q(x) &= a \left(x^2 + 2 \left[\frac{b}{2a}\right] x\right) + c\\ &= a \left(x^2 + 2 \left[\frac{b}{2a}\right] x + \left[\frac{b}{2a}\right]^2\right) - a \left[\frac{b}{2a}\right]^2 + c\\ &= a (x - x^*)^2 + \left[c - \frac{b^2}{4a}\right]. \end{align} $$Clearly, any other $x$ gives a higher value for $q$. $\square$

**Lemma (Optimal Representatives):** Fix a partition $C_1,\ldots,C_k$. The optimal representatives under $G$ are the centroids

*Proof idea:* The objective $G$ can be written as a sum, where each term is a quadratic function in one component of one of the $\boldsymbol{\mu}_i$'s. Each of these terms is minimized by the average of the corresponding components of the $\mathbf{x}_j$'s belonging $C_i$.

*Proof:* Note that we can expand the $k$-means objective as

The expression in square brackets is a quadratic function in $\mu_{i,m}$:

$$ q_{i,m}(\mu_{i,m}) = \left\{\sum_{j \in C_i} x_{j,m}^2\right\} + \left\{- 2 \sum_{j \in C_i} x_{j,m}\right\} \mu_{i,m} + \left\{|C_i| \right\} \mu_{i,m}^2, $$and therefore, by the formula for the minimum of a quadratic function, is minimized at

$$ \mu_{i,m}^* = - \frac{- 2 \sum_{j \in C_i} x_{j,m}}{2 |C_i|} = \frac{1}{|C_i|} \sum_{j \in C_i} x_{j,m}. $$Since each term in the sum making up $G$ is strictly minimized at $\boldsymbol{\mu}_1^*,\ldots, \boldsymbol{\mu}_k^*$, so is $G$. $\square$

**Lemma (Optimal Clustering):** Fix the representatives $\boldsymbol{\mu}_1,\ldots,\boldsymbol{\mu}_k$. An optimal partition under $G$ is obtained as follows. For each $j$, find the $\boldsymbol{\mu}_i$ that minimizes $\|\mathbf{x}_j - \boldsymbol{\mu}_i\|^2$ (picking one arbitrarily in the case of ties) and assign $\mathbf{x}_j$ to $C_i$.

*Proof:* By definition, each term in

is, when the $\boldsymbol{\mu}_j$'s are fixed, minimized by the assignment in the statement. Hence so is $G$. $\square$

We are now ready to describe the $k$-means algorithm. We start from a random assignment of clusters. We then alternate between the optimal choices in the lemmas. In lieu of pseudo-code, we write out the algorithm in Julia.

The input `X`

is assumed to be a collection of $n$ vectors $\mathbf{x_1}, \ldots, \mathbf{x}_n \in \mathbb{R}^d$ stacked into a matrix. The other input, `k`

, is the desired number of clusters. There is an optional input `maxiter`

for the maximum number of iterations, which is set to $10$ by default.

We first define separate functions for the two main steps. To find the minimum of an array, we use the function `findmin`

.

In [3]:

```
function opt_reps(X, k, assign)
n, d = size(X)
reps = zeros(Float64, k, d) # rows are representatives
for j = 1:k
in_j = [i for i=1:n if assign[i] == j]
reps[j,:] = sum(X[in_j,:],dims=1) ./ length(in_j)
end
return reps
end
```

Out[3]:

opt_reps (generic function with 1 method)

In [4]:

```
function opt_clust(X, k, reps)
n, d = size(X) # n=number of rows, d=number of columns
dist = zeros(Float64, n) # distance to rep
assign = zeros(Int64, n) # cluster assignments
for i = 1:n
dist[i],assign[i] = findmin([norm(X[i,:].-reps[j,:]) for j=1:k])
end
@show G = sum(dist.^2)
return assign
end
```

Out[4]:

opt_clust (generic function with 1 method)

In [5]:

```
function mmids_kmeans(X, k; maxiter=10)
n, d = size(X)
assign = [rand(1:k) for i=1:n] # start with random assignments
reps = zeros(Int64, k, d) # initialization of reps
for iter = 1:maxiter
# Step 1: Optimal representatives for fixed clusters
reps = opt_reps(X, k, assign)
# Step 2: Optimal clusters for fixed representatives
assign = opt_clust(X, k, reps)
end
return assign
end
```

Out[5]:

mmids_kmeans (generic function with 1 method)

The $k$-means algorithm is only a heuristic. In particular, it is not guaranteed to find the global minimum of the $k$-means objective. However, it is guaranteed to improve the objective at every iteration, or more precisely, not to make it worse.

**Theorem (Convergence of $k$-means):** The sequence of objective function values produced by the $k$-means algorithm is non-increasing.

*Proof idea:* By the *Optimal Representatives* and *Optimal Clustering* lemmas, each step does not increase the objective.

*Proof:* Let $C_1',\ldots,C_k'$ be the current clusters, with representatives $\boldsymbol{\mu}_1',\ldots,\boldsymbol{\mu}_k'$. After Step 1, the new representatives are $\boldsymbol{\mu}_1'',\ldots,\boldsymbol{\mu}_k''$. By the *Optimal Representatives Lemma*, they satisfy

After Step 2, the new clusters are $C_1'',\ldots,C_k''$. By the *Optimal Clustering Lemma*, they satisfy

Combining these two inequalities gives

$$ \sum_{i=1}^k \sum_{j \in C_i''} \|\mathbf{x}_j - \boldsymbol{\mu}_i''\|^2 \leq \sum_{i=1}^k \sum_{j \in C_i'} \|\mathbf{x}_j - \boldsymbol{\mu}_i'\|^2, $$as claimed. $\square$

We will test our implementation of $k$-means on a simple simulated dataset.

Recall that a standard Normal variable $X$ has PDF

$$ f_X(x) = \frac{1}{\sqrt{2 \pi}} \exp\left( - x^2/2 \right). $$Its mean is $0$ and its variance is $1$.

This is what its PDF looks like:

(Source)

To construct a $d$-dimensional version, we take $d$ independent standard Normal variables $X_1, X_2, \ldots, X_d$ and form the vector $\mathbf{X} = (X_1,\ldots,X_d)$. We will say that $\mathbf{X}$ is a standard Normal $d$-vector. By independence, its joint PDF is given by the product of the PDFs of the $X_i$'s, that is,

$$ \begin{align} f_{\mathbf{X}}(\mathbf{x}) &= \prod_{i=1}^d \frac{1}{\sqrt{2 \pi}} \exp\left( - x_i^2/2 \right)\\ &= \frac{1}{\prod_{i=1}^d \sqrt{2 \pi}} \exp\left( - \sum_{i=1}^d x_i^2/2 \right)\\ &= \frac{1}{(2 \pi)^{d/2}} \exp(-\|\mathbf{x}\|^2/2). \end{align} $$We can also shift and scale it.

**Definition (Spherical Gaussian):** Let $\mathbf{Z}$ be a standard Normal $d$-vector,
let $\boldsymbol{\mu} \in \mathbb{R}^d$ and let $\sigma \in \mathbb{R}_+$. Then we will refer to the transformed random variable $\mathbf{X} = \boldsymbol{\mu} + \sigma \mathbf{Z}$ as a spherical Gaussian with mean $\boldsymbol{\mu}$ and variance $\sigma^2$. We use the notation $\mathbf{Z} \sim N_d(\boldsymbol{\mu}, \sigma^2 I)$. $\lhd$

The following function generates $n$ data points from two spherical $d$-dimensional Gaussians with variance $1$, one with mean $-w\mathbf{e}_1$ and one with mean $w \mathbf{e}_1$. Below, `randn(d)`

generates a `d`

-dimensional spherical Gaussian.

In [6]:

```
function two_clusters(d, n, w)
X1 = reduce(hcat, [vcat(-w, zeros(d-1)) .+ randn(d) for i=1:n])'
X2 = reduce(hcat, [vcat(w, zeros(d-1)) .+ randn(d) for i=1:n])'
return X1, X2
end
```

Out[6]:

two_clusters (generic function with 1 method)

We will mix these two datasets to form an interesting case for clustering.

We start with $d=2$.

In [7]:

```
d, n, w = 2, 100, 3.
X1, X2 = two_clusters(d, n, w)
X = vcat(X1, X2)
scatter(X[:,1], X[:,2], legend=false, size=(700,350))
```

Out[7]:

In [8]:

```
assign = mmids_kmeans(X, 2);
```

In [9]:

```
scatter(X[:,1], X[:,2], marker_z=assign, legend=false, size=(700,350))
```

Out[9]:

Let's see what happens in higher dimension. We repeat our experiment with $d=1000$.

In [10]:

```
d, n, w = 1000, 100, 3
X1, X2 = two_clusters(d, n, w)
X = vcat(X1, X2)
scatter(X[:,1], X[:,2], legend=false, size=(700,350))
```

Out[10]:

In [11]:

```
scatter(X[:,2], X[:,3], legend=false, size=(350,350))
```

Out[11]:

In [12]:

```
assign = mmids_kmeans(X, 2);
```

In [13]:

```
scatter(X[:,1], X[:,2], marker_z=assign, legend=false, size=(700,350))
```

Out[13]:

Our attempt at clustering does not appear to have been successful. What happened? While these clusters are easy to tease apart if we know to look at the first coordinate only, in the full space the within-cluster and between-cluster distances become harder to distinguish: the noise overwhelms the signal.

The function below plots the histograms of within-cluster and between-cluster distances for a sample of size $n$ in $d$ dimensions with a given offset. As $d$ increases, the two distributions become increasingly indistinguishable. Later in the course, we will develop dimension-reduction techniques that help deal with this issue.

In [14]:

```
function highdim_2clusters(d, n, w)
# generate datasets
X1, X2 = two_clusters(d, n, w)
# within-cluster distances for X1
intra = vec([norm(X1[i,:] .- X1[j,:]) for i=1:n, j=1:n if j>i])
h = histogram(intra, normalize=:probability,
label="within-cluster", alpha=0.75, title="dim=$d") # alpha=transparency
# between-cluster distances
inter = vec([norm(X1[i,:] .- X2[j,:]) for i=1:n, j=1:n])
histogram!(inter, normalize=:probability,
label="between-cluster", alpha=0.75)
end
```

Out[14]:

highdim_2clusters (generic function with 1 method)

In [15]:

```
highdim_2clusters(3, 100, 3)
```

Out[15]:

In [16]:

```
highdim_2clusters(100, 100, 3)
```

Out[16]:

In [17]:

```
highdim_2clusters(1000, 100, 3)
```

Out[17]:

*(1) Expectation of $\Delta$:* By defintion, the random variables $X_{1,i} - X_{2,i}$, $i = 1,\ldots, d$, and $Y_{1,i} - Y_{2,i}$, $i = 2,\ldots, d$, are identically distributed. So, by linearity of expectation,

Further, we can write $Y_{1,1} - Y_{1,2} \sim (Z_1 -w) - (Z_2+w)$ where $Z_1, Z_2 \sim N(0,1)$ are independent, where here $\sim$ indicates equality in distribution. Hence, we have

$$ \begin{align} \mathbb{E}[(Y_{1,1} - Y_{2,1})^2] &= \mathbb{E}[(Z_1 - Z_2 - 2w)^2]\\ &= \mathbb{E}[(Z_1 - Z_2)^2] - 4w \,\mathbb{E}[Z_1 - Z_2] + 4 w^2. \end{align} $$Similarly, $X_{1,1} - X_{1,2} \sim Z_1 - Z_2$ so $\mathbb{E}[(X_{1,1} - X_{2,1})^2] = \mathbb{E}[(Z_1 - Z_2)^2]$. Since $\mathbb{E}[Z_1 - Z_2] = 0$, we finally get $\mathbb{E}[\Delta] = 4 w^2$.