$k$-means clustering¶

Course: Math 535 - Mathematical Methods in Data Science (MMiDS)
Author: Sebastien Roch, Department of Mathematics, University of Wisconsin-Madison
Updated: Sep 1, 2020

Recap from the lectures¶

The setup We are given $n$ vectors $\mathbf{x}_1,\ldots,\mathbf{x}_n$ in $\mathbb{R}^d$. Our goal is to find a good clustering: loosely speaking, we want to partition these data points into $k$ disjoint subsets -- or clusters -- with small pairwise distances within clusters and large pairwise distances across clusters. To make this rather imprecise problem more precise, we consider a specific objective function known as the $k$-means objective.

Fix a number of clusters $k$. A partition of $[n] = \{1,\ldots,n\}$ of size $k$ is a collection of disjoint non-empty subsets $C_1,\ldots,C_k \subseteq [n]$ that covers all of $[n]$, that is, such that $\cup_{i=1}^k C_i = [n]$. The cost of $C_1,\ldots,C_k$ is then defined as

$$\mathcal{G}(C_1,\ldots,C_k) = \min_{\boldsymbol{\mu}_1,\ldots,\boldsymbol{\mu}_k} \sum_{i=1}^k \sum_{j \in C_i} \|\mathbf{x}_j - \boldsymbol{\mu}_i\|^2.$$

We think of $\boldsymbol{\mu}_i$ as the representative -- or center -- of cluster $C_i$. Note that $\boldsymbol{\mu}_i$ need not be one of the $\mathbf{x}_j$'s. Our goal is to find a partition that minimizes $\mathcal{G}(C_1,\ldots,C_k)$.

The $k$-means algorithm is a popular heuristic. It is based on the idea that the following two sub-problems are easy to solve:

1. finding the optimal representatives for a fixed partition
2. finding the optimal partition for a fixed set of representatives.

One then alternates between the two.

The theory We proved the following.

Lemma (Optimal Representatives): Fix a partition $C_1,\ldots,C_k$. The optimal representatives are the centroids

$$\boldsymbol{\mu}_i^* = \frac{1}{|C_i|} \sum_{j\in C_i} \mathbf{x}_j.$$

Lemma (Optimal Clustering): Fix the representatives $\boldsymbol{\mu}_1,\ldots,\boldsymbol{\mu}_k$. An optimal partition is obtained as follows. For each $j$, find the $\boldsymbol{\mu}_i$ that minimizes $\|\mathbf{x}_j - \boldsymbol{\mu}_i\|^2$ (picking one arbitrarily in the case of ties) and assign $\mathbf{x}_j$ to $C_i$.

Theorem (Convergence of $k$-means): The sequence of objective function values produced by the $k$-means algorithm is non-increasing.

In [2]:
function opt_clust(X, k, reps)
n, d = size(X) # n=number of rows, d=number of columns
dist = zeros(Float64, n) # distance to rep
assign = zeros(Int64, n) # cluster assignments
for i = 1:n
dist[i], assign[i] = findmin([norm(X[i,:] .- reps[j,:]) for j=1:k])
end
@show G = sum(dist.^2)
return assign
end

Out[2]:
opt_clust (generic function with 1 method)
In [3]:
function opt_reps(X, k, assign)
n, d = size(X)
reps = zeros(Float64, k, d) # rows are representatives
for j = 1:k
in_j = [i for i=1:n if assign[i] == j]
reps[j,:] = sum(X[in_j,:],dims=1) ./ length(in_j)
end
return reps
end

Out[3]:
opt_reps (generic function with 1 method)
In [4]:
function mmids_kmeans(X, k; maxiter=10)
n, d = size(X)
reps = zeros(Int64, k, d) # initialization of reps
for iter = 1:maxiter
# Step 1: Optimal representatives for fixed clusters
reps = opt_reps(X, k, assign)
# Step 2: Optimal clusters for fixed representatives
assign = opt_clust(X, k, reps)
end
return assign
end

Out[4]:
mmids_kmeans (generic function with 1 method)

1 Species delimitation¶

We will look again at the classical iris dataset first analyzed by Fisher. We will upload the data in the form of a DataFrame -- similar to a spreadsheet -- where the columns are different measurements (or features) and the rows are different samples.

In [5]:
df = CSV.read("iris-measurements.csv")
first(df, 5)

Out[5]:

5 rows Ã— 5 columns

IdPetalLengthCmPetalWidthCmSepalLengthCmSepalWidthCm
Int64Float64Float64Float64Float64
111.40.25.13.5
221.40.24.93.0
331.30.24.73.2
441.50.24.63.1
551.40.25.03.6
In [6]:
describe(df)

Out[6]:

5 rows Ã— 8 columns

variablemeanminmedianmaxnuniquenmissingeltype
SymbolFloat64RealFloat64RealNothingNothingDataType
1Id75.5175.5150Int64
2PetalLengthCm3.758671.04.356.9Float64
3PetalWidthCm1.198670.11.32.5Float64
4SepalLengthCm5.843334.35.87.9Float64
5SepalWidthCm3.0542.03.04.4Float64
In [7]:
X = reduce(hcat, [df[:,:PetalLengthCm], df[:,:PetalWidthCm],
df[:,:SepalLengthCm], df[:,:SepalWidthCm]]);
scatter(X[:,1], X[:,2],
legend=false, xlabel="PetalLength", ylabel="PetalWidth")

Out[7]:
In [8]:
assign = mmids_kmeans(X, 2);

G = sum(dist .^ 2) = 622.7644623999738
G = sum(dist .^ 2) = 156.29984835441667
G = sum(dist .^ 2) = 152.53336910043723
G = sum(dist .^ 2) = 152.36870647733903
G = sum(dist .^ 2) = 152.36870647733903
G = sum(dist .^ 2) = 152.36870647733903
G = sum(dist .^ 2) = 152.36870647733903
G = sum(dist .^ 2) = 152.36870647733903
G = sum(dist .^ 2) = 152.36870647733903
G = sum(dist .^ 2) = 152.36870647733903

In [9]:
scatter(X[:,1], X[:,2], marker_z=assign, legend=false)

Out[9]:
In [10]:
assign = mmids_kmeans(X, 3; maxiter=20);

G = sum(dist .^ 2) = 543.1623859092733
G = sum(dist .^ 2) = 101.63572149600577
G = sum(dist .^ 2) = 88.35799644444447
G = sum(dist .^ 2) = 85.04157943238867
G = sum(dist .^ 2) = 84.10217888865148
G = sum(dist .^ 2) = 83.13638186876973
G = sum(dist .^ 2) = 81.8390020677262
G = sum(dist .^ 2) = 80.895776
G = sum(dist .^ 2) = 79.96297983461302
G = sum(dist .^ 2) = 79.43376414532675
G = sum(dist .^ 2) = 79.01070972222222
G = sum(dist .^ 2) = 78.94506582597728
G = sum(dist .^ 2) = 78.94506582597728
G = sum(dist .^ 2) = 78.94506582597728
G = sum(dist .^ 2) = 78.94506582597728
G = sum(dist .^ 2) = 78.94506582597728
G = sum(dist .^ 2) = 78.94506582597728
G = sum(dist .^ 2) = 78.94506582597728
G = sum(dist .^ 2) = 78.94506582597728
G = sum(dist .^ 2) = 78.94506582597728

In [12]:
p3clust = scatter(X[:,1], X[:,2],
marker_z=assign, legend=false)

Out[12]:
In [13]:
df_truth = CSV.read("iris-species.csv")
first(df_truth, 5)

Out[13]:

5 rows Ã— 2 columns

IdSpecies
Int64String
11Iris-setosa
22Iris-setosa
33Iris-setosa
44Iris-setosa
55Iris-setosa
In [14]:
species = df_truth[:,:Species]
unique(species)

Out[14]:
3-element Array{String,1}:
"Iris-setosa"
"Iris-versicolor"
"Iris-virginica"
In [15]:
species2number = Dict("Iris-setosa"=>1, "Iris-versicolor"=>2, "Iris-virginica"=>3)
truth = map(i -> get(species2number,i,0), species)
ptruth = scatter(X[:,1], X[:,2],
marker_z=assign, legend=false, seriescolor=:lightrainbow, title="truth")
plot(ptruth, p3clust, layout=(1,2), size=(750,375))

Out[15]: