TUTORIAL 0

$k$-means clustering


Course: Math 535 - Mathematical Methods in Data Science (MMiDS)
Author: Sebastien Roch, Department of Mathematics, University of Wisconsin-Madison
Updated: Sep 1, 2020
Copyright: © 2020 Sebastien Roch


Recap from the lectures

The setup We are given $n$ vectors $\mathbf{x}_1,\ldots,\mathbf{x}_n$ in $\mathbb{R}^d$. Our goal is to find a good clustering: loosely speaking, we want to partition these data points into $k$ disjoint subsets -- or clusters -- with small pairwise distances within clusters and large pairwise distances across clusters. To make this rather imprecise problem more precise, we consider a specific objective function known as the $k$-means objective.

Fix a number of clusters $k$. A partition of $[n] = \{1,\ldots,n\}$ of size $k$ is a collection of disjoint non-empty subsets $C_1,\ldots,C_k \subseteq [n]$ that covers all of $[n]$, that is, such that $\cup_{i=1}^k C_i = [n]$. The cost of $C_1,\ldots,C_k$ is then defined as

$$ \mathcal{G}(C_1,\ldots,C_k) = \min_{\boldsymbol{\mu}_1,\ldots,\boldsymbol{\mu}_k} \sum_{i=1}^k \sum_{j \in C_i} \|\mathbf{x}_j - \boldsymbol{\mu}_i\|^2. $$

We think of $\boldsymbol{\mu}_i$ as the representative -- or center -- of cluster $C_i$. Note that $\boldsymbol{\mu}_i$ need not be one of the $\mathbf{x}_j$'s. Our goal is to find a partition that minimizes $\mathcal{G}(C_1,\ldots,C_k)$.

The $k$-means algorithm is a popular heuristic. It is based on the idea that the following two sub-problems are easy to solve:

  1. finding the optimal representatives for a fixed partition
  2. finding the optimal partition for a fixed set of representatives.

One then alternates between the two.

The theory We proved the following.

Lemma (Optimal Representatives): Fix a partition $C_1,\ldots,C_k$. The optimal representatives are the centroids

$$ \boldsymbol{\mu}_i^* = \frac{1}{|C_i|} \sum_{j\in C_i} \mathbf{x}_j. $$

Lemma (Optimal Clustering): Fix the representatives $\boldsymbol{\mu}_1,\ldots,\boldsymbol{\mu}_k$. An optimal partition is obtained as follows. For each $j$, find the $\boldsymbol{\mu}_i$ that minimizes $\|\mathbf{x}_j - \boldsymbol{\mu}_i\|^2$ (picking one arbitrarily in the case of ties) and assign $\mathbf{x}_j$ to $C_i$.

Theorem (Convergence of $k$-means): The sequence of objective function values produced by the $k$-means algorithm is non-increasing.

In [2]:
function opt_clust(X, k, reps)
    n, d = size(X) # n=number of rows, d=number of columns 
    dist = zeros(Float64, n) # distance to rep
    assign = zeros(Int64, n) # cluster assignments
    for i = 1:n
        dist[i], assign[i] = findmin([norm(X[i,:] .- reps[j,:]) for j=1:k])
    end
    @show G = sum(dist.^2)
    return assign
end
Out[2]:
opt_clust (generic function with 1 method)
In [3]:
function opt_reps(X, k, assign)
    n, d = size(X)
    reps = zeros(Float64, k, d) # rows are representatives
    for j = 1:k
        in_j = [i for i=1:n if assign[i] == j]             
        reps[j,:] = sum(X[in_j,:],dims=1) ./ length(in_j)
    end
    return reps
end
Out[3]:
opt_reps (generic function with 1 method)
In [4]:
function mmids_kmeans(X, k; maxiter=10)
    n, d = size(X)
    assign = [rand(1:k) for i=1:n] # start with random assignments
    reps = zeros(Int64, k, d) # initialization of reps
    for iter = 1:maxiter
        # Step 1: Optimal representatives for fixed clusters
        reps = opt_reps(X, k, assign) 
        # Step 2: Optimal clusters for fixed representatives
        assign = opt_clust(X, k, reps)     
    end
    return assign
end
Out[4]:
mmids_kmeans (generic function with 1 method)

1 Species delimitation

We will look again at the classical iris dataset first analyzed by Fisher. We will upload the data in the form of a DataFrame -- similar to a spreadsheet -- where the columns are different measurements (or features) and the rows are different samples.

In [5]:
df = CSV.read("iris-measurements.csv")
first(df, 5)
Out[5]:

5 rows × 5 columns

IdPetalLengthCmPetalWidthCmSepalLengthCmSepalWidthCm
Int64Float64Float64Float64Float64
111.40.25.13.5
221.40.24.93.0
331.30.24.73.2
441.50.24.63.1
551.40.25.03.6
In [6]:
describe(df)
Out[6]:

5 rows × 8 columns

variablemeanminmedianmaxnuniquenmissingeltype
SymbolFloat64RealFloat64RealNothingNothingDataType
1Id75.5175.5150Int64
2PetalLengthCm3.758671.04.356.9Float64
3PetalWidthCm1.198670.11.32.5Float64
4SepalLengthCm5.843334.35.87.9Float64
5SepalWidthCm3.0542.03.04.4Float64
In [7]:
X = reduce(hcat, [df[:,:PetalLengthCm], df[:,:PetalWidthCm], 
        df[:,:SepalLengthCm], df[:,:SepalWidthCm]]);
scatter(X[:,1], X[:,2], 
    legend=false, xlabel="PetalLength", ylabel="PetalWidth")
Out[7]:
In [8]:
assign = mmids_kmeans(X, 2);
G = sum(dist .^ 2) = 622.7644623999738
G = sum(dist .^ 2) = 156.29984835441667
G = sum(dist .^ 2) = 152.53336910043723
G = sum(dist .^ 2) = 152.36870647733903
G = sum(dist .^ 2) = 152.36870647733903
G = sum(dist .^ 2) = 152.36870647733903
G = sum(dist .^ 2) = 152.36870647733903
G = sum(dist .^ 2) = 152.36870647733903
G = sum(dist .^ 2) = 152.36870647733903
G = sum(dist .^ 2) = 152.36870647733903
In [9]:
scatter(X[:,1], X[:,2], marker_z=assign, legend=false)
Out[9]:
In [10]:
assign = mmids_kmeans(X, 3; maxiter=20);
G = sum(dist .^ 2) = 543.1623859092733
G = sum(dist .^ 2) = 101.63572149600577
G = sum(dist .^ 2) = 88.35799644444447
G = sum(dist .^ 2) = 85.04157943238867
G = sum(dist .^ 2) = 84.10217888865148
G = sum(dist .^ 2) = 83.13638186876973
G = sum(dist .^ 2) = 81.8390020677262
G = sum(dist .^ 2) = 80.895776
G = sum(dist .^ 2) = 79.96297983461302
G = sum(dist .^ 2) = 79.43376414532675
G = sum(dist .^ 2) = 79.01070972222222
G = sum(dist .^ 2) = 78.94506582597728
G = sum(dist .^ 2) = 78.94506582597728
G = sum(dist .^ 2) = 78.94506582597728
G = sum(dist .^ 2) = 78.94506582597728
G = sum(dist .^ 2) = 78.94506582597728
G = sum(dist .^ 2) = 78.94506582597728
G = sum(dist .^ 2) = 78.94506582597728
G = sum(dist .^ 2) = 78.94506582597728
G = sum(dist .^ 2) = 78.94506582597728
In [12]:
p3clust = scatter(X[:,1], X[:,2], 
    marker_z=assign, legend=false)
Out[12]:
In [13]:
df_truth = CSV.read("iris-species.csv")
first(df_truth, 5)
Out[13]:

5 rows × 2 columns

IdSpecies
Int64String
11Iris-setosa
22Iris-setosa
33Iris-setosa
44Iris-setosa
55Iris-setosa
In [14]:
species = df_truth[:,:Species]
unique(species)
Out[14]:
3-element Array{String,1}:
 "Iris-setosa"
 "Iris-versicolor"
 "Iris-virginica"
In [15]:
species2number = Dict("Iris-setosa"=>1, "Iris-versicolor"=>2, "Iris-virginica"=>3)
truth = map(i -> get(species2number,i,0), species)
ptruth = scatter(X[:,1], X[:,2], 
    marker_z=assign, legend=false, seriescolor=:lightrainbow, title="truth")
plot(ptruth, p3clust, layout=(1,2), size=(750,375))
Out[15]: